pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-06 17:24:59 +08:00

Author	SHA1	Message	Date
Wei Wang	e0c728c545	Changes for release 2.0 only (#94934 ) * Changes for release 2.0 only * Delete the refs during pytorch checkout * Bug fix and add xla r2.0 hash	2023-02-15 18:08:38 -05:00
Andrey Talman	dbcd11f3a7	try to fix OSS CI error (#94785 ) (#94936 ) Differential Revision: D43259005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94785 Approved by: https://github.com/weiwangmeta, https://github.com/digantdesai Co-authored-by: Cuiqing Li <cuiqingli123@meta.com>	2023-02-15 17:47:26 -05:00
Nikita Vedeneev	3ace14eb8b	[Bug fix] sparse_mask: wrong intersection on CUDA (#94829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94829 Approved by: https://github.com/cpuhrsch	2023-02-15 13:22:39 +00:00
Andrew Gu	0c3ba78568	[FSDP] Fix `clip_grad_norm_()` when rank has no local gradients (#94835 ) `functools.reduce()` requires non-empty input. We need to add a case for `len(grads) == 0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94835 Approved by: https://github.com/zhaojuanmao	2023-02-15 12:28:03 +00:00
Andrew Gu	8da776e3a7	[FSDP] Fix "use-after-free" in reshard logic (#94859 ) Overview This PR switches the order of freeing the unsharded `FlatParameter` (`self._free_unsharded_flat_param()`) and switching to use the sharded `FlatParameter` (`self._use_sharded_flat_param()`). This is to prevent "use-after_free"-type bugs where for `param.data = new_data`, `param` has its metadata intact but not its storage, causing an illegal memory access for any instrumentation that depends on its storage. (`param` is an original parameter and `new_data` is either a view into the sharded `FlatParameter` or `torch.empty(0)` depending on the sharding and rank.) Details To see why simply switching the order of the two calls is safe, let us examine the calls themselves: `652457b1b7/torch/distributed/fsdp/flat_param.py (L1312-L1339)` `652457b1b7/torch/distributed/fsdp/flat_param.py (L1298-L1310)` - `_free_unsharded_flat_param()` does not make any assumption that `self.flat_param`'s data is the sharded `FlatParameter` (i.e. `_local_shard`). - The sharded `FlatParameter` (i.e. `_local_shard`) is always present in memory, which means that FSDP can use sharded views at any time, including before freeing the unsharded data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94859 Approved by: https://github.com/zhaojuanmao, https://github.com/fegin	2023-02-15 12:16:20 +00:00
Kiersten Stokes	5a54537918	Add further info to `masked_scatter` and `masked_scatter_` documention (#94545 ) Fixes #94353 This PR adds examples and further info to the in-place and out-of-place masked scatter functions' documentation, according to what was proposed in the linked issue. Looking forward to any suggested changes you may have as I continue to familiarize myself with PyTorch 🙂 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94545 Approved by: https://github.com/lezcano	2023-02-15 07:50:47 +00:00
Wei Wang	5705199fb1	Update smoke test threshold (#94888 ) https://github.com/pytorch/pytorch/pull/94249 touched upon what values we should set. It turns out 1.17 is too high, as seemingly innocent commits are failing to yield 1.17x. They yielded ~1.168x. https://github.com/pytorch/pytorch/actions/runs/4180998255/jobs/7242758816 <img width="881" alt="image" src="https://user-images.githubusercontent.com/109318740/218951536-476d3764-1aa6-481b-bd92-f55d1c50e385.png"> Setting it to 1.165x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94888 Approved by: https://github.com/ngimel	2023-02-15 07:29:41 +00:00
Douglas Lehr	77d1135566	[ROCm] Pyt 2.0 rocm staging (#94660 ) Add triton support for ROCm builds of PyTorch. * Enables inductor and dynamo when rocm is detected * Adds support for pytorch-triton-mlir backend * Adds check_rocm support for verify_dynamo.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/94660 Approved by: https://github.com/malfet	2023-02-15 06:15:18 +00:00
Denis Vieriu	71ec2617d2	[MPS] Block uint8 data type for unary and binary ops on macOS 12 (#94876 ) Blocks uint8 data type for unary and binary ops on macOS 12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94876 Approved by: https://github.com/kulinseth	2023-02-15 06:09:56 +00:00
yanbing-j	8261c600b7	Update ideep to add primitive cache for ARM (#94719 ) ### Description This PR is to update ideep to add primitive cache in order to speed up ARM's PyTorch workloads. Fixes #94264. ### Performance test Use TorchBench test in ICX with 40 cores Intel OpenMP & jemalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/218937895-c97f5a5f-644b-4113-a3f5-7fe11fad7516.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94719 Approved by: https://github.com/jgong5	2023-02-15 05:46:39 +00:00
PyTorch MergeBot	c10acb834d	Revert "Temporarily disable inductor torchbench test (#94873 )" This reverts commit 79b7c697a48128265162f6112b4ef534683d2ce1. Reverted https://github.com/pytorch/pytorch/pull/94873 on behalf of https://github.com/kit1980 due to The tests should pass now	2023-02-15 04:22:06 +00:00
Masaki Kozuki	e0a954f531	call `zero_grad` in foreach/fused optimizers tests (#94724 ) the tests calling this method haven't failed because `iter` is a built-in function's name Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94724 Approved by: https://github.com/Skylion007	2023-02-15 04:14:34 +00:00
Justin Chu	afadc3697a	[ONNX] Fix assert in cat (#94870 ) The assert statement blocks tensors with unknown ranks. This change unblocks those cases. Needed for https://github.com/pytorch/vision/pull/7056 Verified against https://github.com/pytorch/vision/pull/7056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94870 Approved by: https://github.com/BowenBao	2023-02-15 04:09:59 +00:00
Nikita Shulga	3d5f4dcc4d	Update vision commit pin (#94874 ) To `0bdd01a79a` that removes usage of `torch._six` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94874 Approved by: https://github.com/kit1980	2023-02-15 03:27:48 +00:00
Nikita Shulga	117fafc260	[CI] Install `pytorch-cuda` for conda testing (#94852 ) Also, install it from the nightly channel, if `TORCH_CONDA_BUILD_FOLDER` is set to nightly Discovered after doing a bit more GPU smoke testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/94852 Approved by: https://github.com/atalman, https://github.com/Skylion007	2023-02-15 03:14:32 +00:00
Sergii Dymchenko	79b7c697a4	Temporarily disable inductor torchbench test (#94873 ) The test is failing with "ModuleNotFoundError: No module named 'torchbenchmark.models.fb'" because of some updates of torchbench deps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94873 Approved by: https://github.com/malfet	2023-02-15 02:07:08 +00:00
Edward Z. Yang	abf59f5703	Make _simplified kwarg private (#94782 ) CR on https://github.com/pytorch/pytorch/pull/94404 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94782 Approved by: https://github.com/voznesenskym	2023-02-15 01:52:16 +00:00
Jason Ansel	ae57bd6630	PT2/TorchScript interoperability fix (#94678 ) Allows torch.compile() to inline into ScriptFunction Pull Request resolved: https://github.com/pytorch/pytorch/pull/94678 Approved by: https://github.com/ezyang	2023-02-15 01:21:10 +00:00
AllenTiTaiWang	b6443fca86	[ONNX] Wrap op validation inputs and add export_options.py and function_dispatcher.py (#94721 ) 1. `_validate_op_between_ort_torch` inputs was not wrapped (preprocessed) properly. 2. Introduce function_dispatcher.py to store decompistion table (atn/prim) and ATenLib 3. Introduce ~~export_options.py~~ options.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/94721 Approved by: https://github.com/BowenBao	2023-02-15 00:59:59 +00:00
Natalia Gimelshein	5bc72bd019	sym_int simplification for integer args, attempt 3 (#94799 ) Per title, now propagates to inductor codegen. Where should I put the test and how should test look like? Pull Request resolved: https://github.com/pytorch/pytorch/pull/94799 Approved by: https://github.com/ezyang	2023-02-15 00:31:19 +00:00
Jason Ansel	65b998325c	[inductor] Disable developer warnings for "2.0.0" version (#94845 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94845 Approved by: https://github.com/wconstab	2023-02-15 00:09:26 +00:00
Driss Guessous	7f7f91e36f	add reproducibility notes to nn.UnpoolND operations (#94629 ) In response to some comments here: #80827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94629 Approved by: https://github.com/albanD	2023-02-15 00:06:48 +00:00
Andrew M. James	7c44823a4e	Fix layout/device checks in sparse-dense addmm (#94843 ) Resolves #94684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94843 Approved by: https://github.com/cpuhrsch	2023-02-14 23:23:26 +00:00
atalman	40cb494b1a	Switch Docker release to CUDA 11.7 (#94818 ) Switch Docker release to CUDA 11.7 Remove `ptxas` installation logic as Trition is now bundled with ptxas Successful run: https://github.com/pytorch/pytorch/actions/runs/4176843201/jobs/7233661196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94818 Approved by: https://github.com/malfet	2023-02-14 23:10:57 +00:00
dllehr-amd	98012e4a59	[ROCm] hipGraph support for pytorch mainline (#88202 ) With the release of ROCm 5.3 hip now supports a hipGraph implementation. All necessary backend work and hipification is done to support the same functionality as cudaGraph. Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2023-02-14 22:18:56 +00:00
Larry Liu	79783a51da	[torchgen] Loosen the restriction for only allowing 2 nested namespaces for kernels (#94834 ) As titled. We still want to have some restriction to avoid misuse but for internal use case we want to change the limit from 2 to 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94834 Approved by: https://github.com/SS-JIA	2023-02-14 21:50:12 +00:00
Syed Tousif Ahmed	7ef76ce6c3	Preloads more nvidia pypi library for multi arch distributions (#94355 ) Following the same logic of preloading cudnn and cublas from the pypi folder in multi-arch disributions, where Pure-lib vs Plat-lib matters, this PR adds the logic for the rest of the cuda pypi libraries that were integrated. I have tested this PR by running the code block locally and installing/uninstalling nvidia pypi libraries: ``` import sys import os def _preload_cuda_deps(): """Preloads cudnn/cublas deps if they could not be found otherwise.""" # Should only be called on Linux if default path resolution have failed cuda_libs = { 'cublas': 'libcublas.so.11', 'cudnn': 'libcudnn.so.8', 'cuda_nvrtc': 'libnvrtc.so.11.2', 'cuda_runtime': 'libcudart.so.11.0', 'cuda_cupti': 'libcupti.so.11.7', 'cufft': 'libcufft.so.10', 'curand': 'libcurand.so.10', 'cusolver': 'libcusolver.so.11', 'cusparse': 'libcusparse.so.11', 'nccl': 'libnccl.so.2', 'nvtx': 'libnvToolsExt.so.1', } cuda_libs_paths = {lib_folder: None for lib_folder in cuda_libs.keys()} for path in sys.path: nvidia_path = os.path.join(path, 'nvidia') if not os.path.exists(nvidia_path): continue for lib_folder, lib_name in cuda_libs.items(): candidate_path = os.path.join(nvidia_path, lib_folder, 'lib', lib_name) if os.path.exists(candidate_path) and not cuda_libs_paths[lib_folder]: cuda_libs_paths[lib_folder] = candidate_path if all(cuda_libs_paths.values()): break if not all(cuda_libs_paths.values()): none_libs = [lib for lib in cuda_libs_paths if not cuda_libs_paths[lib]] raise ValueError(f"{', '.join(none_libs)} not found in the system path {sys.path}") _preload_cuda_deps() ``` I don't have access to a multi-arch environment, so if somebody could verify a wheel with this patch on a multi-arch distribution, that would be great! Pull Request resolved: https://github.com/pytorch/pytorch/pull/94355 Approved by: https://github.com/atalman	2023-02-14 21:47:33 +00:00
Angela Yi	97510c6d50	Convert operator.not_ to torch.logical_not (#94626 ) If the input to operator.not_ is a tensor, I want to convert the operator to a torch.logical_not. This allows the following test case to pass. Beforehand it resulted in the error `NotImplementedError("local_scalar_dense/item NYI for torch.bool")` ``` def test_export_tensor_bool_not(self): def true_fn(x, y): return x + y def false_fn(x, y): return x - y def f(x, y): return cond(not torch.any(x), true_fn, false_fn, [x, y]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94626 Approved by: https://github.com/voznesenskym	2023-02-14 21:45:48 +00:00
Wen Chen	69bcefceec	[ROCm] Added MIOpen header files to installation package for ROCm. (#92969 ) Added MIOpen header files to installation package for building Pytorch extensions that requires MIOpen as a dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92969 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2023-02-14 21:43:31 +00:00
Catherine Lee	989299802c	Use s3 for some test infra files (#94642 ) companion to https://github.com/pytorch/test-infra/pull/2756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94642 Approved by: https://github.com/huydhn	2023-02-14 19:45:41 +00:00
Driss Guessous	63bf7674fa	add backwards for gelu and relu on nested tensors. (#94776 ) Fixes #94701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94776 Approved by: https://github.com/cpuhrsch	2023-02-14 18:42:06 +00:00
albanD	b7e1477e9b	Improve leaky relu doc (#94090 ) Fixes #83821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94090 Approved by: https://github.com/jbschlosser	2023-02-14 17:58:51 +00:00
Cuiqing Li	33f13fc959	Fix XNNPACK missing symbol from post-operation.c (#94768 ) Summary: Fix RL team XNNPACK xnn_mutex.h issue. Test Plan: buck2 test Reviewed By: kirklandsign Differential Revision: D43243129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94768 Approved by: https://github.com/kirklandsign, https://github.com/digantdesai	2023-02-14 17:17:39 +00:00
Sujoy Saraswati	4a5ce921a0	Add HPU to compatible shallow copy list and remove lazy HPU changes (#94673 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94673 Approved by: https://github.com/wconstab	2023-02-14 17:15:25 +00:00
AllenTiTaiWang	5c64d2141f	[ONNX] Add ExportOptions and op_level_debug mode (#94720 ) Add op_level_debug for turn on/off op-level validation with ORT during exporting. Also, integration of all exporting setting parameters into ExportOptions class to avoid the complexity of passing around parameters among functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94720 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2023-02-14 16:39:34 +00:00
Kshiteej K	3fc4bc115f	[functorch] jacrev, jacfwd error for complex input or output (#94805 ) Related: https://github.com/pytorch/pytorch/issues/94397, https://github.com/pytorch/pytorch/issues/94397#issuecomment-1428452756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94805 Approved by: https://github.com/lezcano	2023-02-14 16:13:37 +00:00
Nikita Shulga	18d93cdc5d	[CI] Use prebuilt triton from nightly repo (#94732 ) No point in building from source if it was prebuilt already Pull Request resolved: https://github.com/pytorch/pytorch/pull/94732 Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/huydhn, https://github.com/jansel	2023-02-14 15:51:23 +00:00
Yaoyao Ding	57b22bc6d8	[Dynamo] Backend registration with ``entry_points`` (#93873 ) Fixes #91824 This PR add a new dynamo backend registration mechanism through ``entry_points``. The ``entry_points`` of a package is provides a way for the package to reigster a plugin for another one. The docs of the new mechanism: ![image](https://user-images.githubusercontent.com/23381083/216133221-18cf18e2-6ad6-4cf7-8da2-9b9b883389c8.png) (the typo '...named "my_backend" that has been..." has been fixed to '...named "my_compiler" that has been...') # Discussion ## About the test I did not add a test for this PR as it is hard either to install a fack package during a test or manually hack the entry points function by replacing it with a fake one. I have tested this PR offline with the hidet compiler and it works fine. Please let me know if you have any good idea to test this PR. ## About the dependency of ``importlib_metadata`` This PR will add a dependency ``importlib_metadata`` for the python < 3.10 because the modern usage of ``importlib`` gets stable at this python version (see the documentation of the importlib package [here](https://docs.python.org/3/library/importlib.html)). For python < 3.10, the package ``importlib_metadata`` implements the feature of ``importlib``. The current PR will hint the user to install this ``importlib_metata`` if their python version < 3.10. ## About the name and docs Please let me know how do you think the name ``torch_dynamo_backend`` as the entry point group name and the documentation of this registration mechanism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93873 Approved by: https://github.com/malfet, https://github.com/jansel	2023-02-14 15:44:25 +00:00
Kulin Seth	94f0808629	[MPS] Add fmod op. (#94722 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94722 Approved by: https://github.com/DenisVieriu97	2023-02-14 14:55:26 +00:00
jon-chuang	d1d5d16df3	dynamo: handle straight-line graph breaks for autocast context manager with constant args (#94137 ) Fixes https://github.com/pytorch/pytorch/issues/93890 We do the following: 1. fix __init__constructor for `AutocastModeVariable` with exisiting `mode` while copying 2. `resume_execution` is made aware of constant args (`target_values`), by storing said args in `ReenterWith`. To propagate between subgraphs (in straightline code), we also store the constant args in the downstream's `code_options["co_consts"]` if not already. --- Future work: 1. handle instantiating context manager in non-inlineable functions. Simultaneously fix nested grad mode bug. 2. generalize to general `ContextManager`s 3. generalize to variable arguments passed to context manager, with guards around the variable. --- Actually, if we look at the repro: `74592a43d0/test/dynamo/test_repros.py (L1249)`, we can see that the method in this PR doesn't work for graph breaks in function calls, in particular, in function calls that don't get inlined. Why inlining functions with graph breaks is hard: - When we handle graph breaks, we create a new code object for the remainder of the code. It's hard to imagine doing this when you are inside a function, then we need a frame stack. And we just want to deal with the current frame as a sequence of straight line codes. Why propagating context manager information is hard: - If we do not inline the function, the frame does not contain any information about the parent `block_stack` or `co_consts`. So we cannot store it on local objects like the eval frame. It has to be a global object in the output_graph. --- Anyway, I'm starting to see clearly that dynamo must indeed be optimized for torch use-case. Supporting more general cases tends to run into endless corner-cases and caveats. One direction that I see as viable to handle function calls which have graph breaks and `has_tensor_in_frame` is stick with not inlining them, while installing a global `ContextManagerManager`, similar to the `CleanupManager` (which cleans up global variables). We can know which context managers are active at any given point, so that we can install their setup/teardown code on those functions and their fragments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94137 Approved by: https://github.com/yanboliang	2023-02-14 14:00:37 +00:00
zhuhong61	73ee4964d3	Add new checks in CI system to verify the built linux pip wheel with cpu-cxx11-abi (#79409 ) We added the linux pip wheel with cpu-cxx11-abi in pytorch/builder, see: https://github.com/pytorch/builder/pull/990 and https://github.com/pytorch/builder/pull/1023 The purpose of this PR is to add new checks in pytorch CI system to verify the linux pip wheel with cpu-cxx11-abi. Co-authored-by: Zhu Hong <hong.zhu@intel.com> Co-authored-by: Guo Yejun <yejun.guo@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/79409 Approved by: https://github.com/malfet	2023-02-14 12:59:03 +00:00
min-jean-cho	22e2fd554c	OpInfo for aten.exponential, Add check for dtype, parameter in decomp ref (#92709 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92709 Approved by: https://github.com/lezcano	2023-02-14 10:11:07 +00:00
Fabio Rocha	1dbaa5c290	Use decompositions for some fallbacks introduced in #94039 (#94206 ) In some cases, implements required inductor primitives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-02-14 09:31:30 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
fduwjj	39511697d4	[PT-D][BE] Update 2D parallelism API name and docs (#94771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94771 Approved by: https://github.com/wanchaol	2023-02-14 08:13:15 +00:00
chunyuan	53062e1fe4	inductor: fix size and stride comparison (#94481 ) We met a case where `old.get_stride()` is a `tuple`: `(1, 16)` while `new.get_stride()` is a `list`: `[1, 16]`. `old.get_stride() == new.get_stride()` returns `False` though they're actually equal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94481 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel	2023-02-14 07:14:20 +00:00
PyTorch MergeBot	28ed0bdb37	Revert "[tp] additional doc fixes (#94786 )" This reverts commit 7522ca55f19e8646f3e5cb59d2673fb0b46696c7. Reverted https://github.com/pytorch/pytorch/pull/94786 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the doc failure looks related and they are also failing in trunk `7522ca55f1`	2023-02-14 05:43:37 +00:00
PyTorch MergeBot	bafc4e377b	[vision hash update] update the pinned vision hash (#94784 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94784 Approved by: https://github.com/pytorchbot	2023-02-14 05:30:55 +00:00
Nicolas Macchioni	5cd2b65816	[inductor] fix sympy.core.numbers.Expr (#94780 ) Summary: Fix sympy.core.numbers.Expr, sympy.core has no module 'numbers' Test Plan: sandcastle Differential Revision: D43254644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94780 Approved by: https://github.com/bertmaher	2023-02-14 05:18:49 +00:00
Wanchao Liang	7522ca55f1	[tp] additional doc fixes (#94786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94786 Approved by: https://github.com/fduwjj	2023-02-14 04:52:04 +00:00
Denis Vieriu	1f06a71797	[MPS] Error out for square int64 input (#94766 ) - add checks for whether macOS is greater than 13.2 - remove square from block list - throw error messages if power int64 is called before macOS 13.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94766 Approved by: https://github.com/kulinseth	2023-02-14 04:45:41 +00:00
William Wen	d567df9f36	[dynamo 3.11] remap dup/rotate to copy/swap (#93988 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93988 Approved by: https://github.com/jansel, https://github.com/albanD, https://github.com/mlazos	2023-02-14 04:25:14 +00:00
William Wen	751bab094a	[dynamo 3.11] support new binary ops (#93987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93987 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/albanD	2023-02-14 04:25:14 +00:00
William Wen	d4d13d99e4	[dynamo 3.11] support new jump opcodes (#93986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93986 Approved by: https://github.com/jansel, https://github.com/albanD, https://github.com/malfet, https://github.com/voznesenskym	2023-02-14 04:25:14 +00:00
Sahdev Zala	3faa636196	Clarify the instructions for setting up dev environment [skip ci] (#94155 ) The `requirement.txt` file is in the PyTorch directory. The instructions to `clone` and `cd` to the PyTorch directory are in the later section under Get the PyTorch Source. So, the instructions as such gives an error that requirement.txt is not found. ```ERROR: Could not open requirements file: .. No such file or directory: 'requirements.txt' ``` This PR clarifies the usage of the command. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94155 Approved by: https://github.com/malfet	2023-02-14 03:56:11 +00:00
BowenBao	055dc72dba	[ONNX] Bump onnx to 1.13.1, onnxruntime to 1.14.0 (#94767 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94767 Approved by: https://github.com/abock	2023-02-14 03:53:05 +00:00
Angela Yi	7e3f79914c	Support functionalization for torch.map (#94558 ) We restrict: * Output of each map iteration aliasing the input * In-place mutation on the list element or inputs given to the map function Pull Request resolved: https://github.com/pytorch/pytorch/pull/94558 Approved by: https://github.com/tugsbayasgalan	2023-02-14 02:40:38 +00:00
Rohan Varma	3ea59b68af	[c10d] Enhance broadcastUniqueNCCLID error reporting (#94752 ) When this error is hit, usually it is because rank 0 has hit an error and crashed before setting the unique ID on rank 0. However, in many job scheduling tools the rank 0 error is not clearly reported and user must look for it, so add a small log reminding users to do so. Differential Revision: [D43245190](https://our.internmc.facebook.com/intern/diff/D43245190/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43245190/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/94752 Approved by: https://github.com/H-Huang	2023-02-14 02:00:58 +00:00
Brian Hirsh	ce474bc643	fix view + detach graph case for inductor (#94744 ) fixes https://github.com/pytorch/pytorch/issues/94175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94744 Approved by: https://github.com/ezyang	2023-02-14 01:35:23 +00:00
Will Constable	9fb9219478	Make DDPOptimizer work with torch._dynamo.explain() (#94749 ) GraphModules that were created during DDPOptimizer graph breaking lacked `compile_subgraph_reason`, which caused an exception when running .explain(). Now the reason is provided and users can use .explain() to find out that DDPOptimizer is causing graph breaks. Fixes #94579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94749 Approved by: https://github.com/voznesenskym	2023-02-14 01:33:47 +00:00
Liao, Xuan	fb55f12cb0	[cpu][inductor] improve cpu vec implementations of cos & sin (#94577 ) The current Torchinductor's `cos` & `sin` implementations will call `sleef` functions in `aten::Vec` which show worse performance than Aten's `cos` & `sin` implementations that invoke `MKL` functions. The reason is that the `sleef` algorithms sacrifice performance in order to have a higher precision. This PR changes Torchinductor's `cos` & `sin` implementations from the `sleef` functions with `1.0` ULP error bound to the ones with `3.5` ULP error bound. Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> suite=huggingface \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- op \| improved_ratio \| speedup_old \| RSD(3) \| speedup_new \| RSD(3) cos \| 62.12% \| 0.653826147 \| 4.48% \| 1.059999006 \| 3.38% sin \| 38.12% \| 0.745482927 \| 0.72% \| 1.029642026 \| 5.33% </body> </html> Accuracy data for eager v.s. inductor: Each tol has been tested for 1000 times. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> error_bound \| tol=1e-7 \| tol=1e-8 -- \| -- \| -- 1.0 ULP \| PASS \| FAIL 3.5 ULP \| PASS \| FAIL </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94577 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/Chillee, https://github.com/desertfire, https://github.com/jansel	2023-02-14 01:33:13 +00:00
Denis Vieriu	cedb7e3d77	[MPS] Fix remainder op for integral dtypes (#94757 ) Map remainder op to the same template as div (integral dtypes will be cast to float) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94757 Approved by: https://github.com/kulinseth	2023-02-14 01:06:49 +00:00
AllenTiTaiWang	84a5aec8c6	[ONNX] Add bloom ops (#94761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94761 Approved by: https://github.com/justinchuby	2023-02-14 00:40:13 +00:00
Justin Chu	5ed7c701a3	[ONNX] Remove the deprecated monkey patches to torch.Graph (#94747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94747 Approved by: https://github.com/BowenBao, https://github.com/Skylion007	2023-02-14 00:08:09 +00:00
min-jean-cho	92f3feabaa	fix torch.var backward when n==correction (#94546 ) Fixes #94184 This PR, as discussed in [comment ](https://github.com/pytorch/pytorch/issues/94184#issuecomment-1422128166), returns `x.grad` of same shape as `x`, and filled with `NaN` when the gradient of `torch.var(unbiased=True)` is `NaN`. The gradient of unbiased variance is `NaN` (undefined, divide by zero in the denom `N-1`, where `N` is the number of samples) when `N` is 1 (i.e., there's one sample only -- product of dim is 1 such as `[1]`, `[1,...,1]`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94546 Approved by: https://github.com/soulitzer	2023-02-13 23:38:38 +00:00
Edward Z. Yang	86240898de	Improve profiling and stack traces for SymNode method calls (#94410 ) This restructures the magic methods so that there is a stub `add` that calls the metaprogrammed `_add`. With this change, `SymNode.add` can now show up in stack traces, which is a huge benefit for profiling. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94410 Approved by: https://github.com/Chillee	2023-02-13 23:36:21 +00:00
Edward Z. Yang	f1f26fe8ec	Streamlining guard expect tests (#94404 ) Changes: * Add `simplified` kwarg to let you only render guards that are nontrivial (excludes duck sizing) * Make a list of strings valid for sources, if you just have some variable names you want to bind to * Add test helper `show_guards` using these facilities, switch a few tests to it Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94404 Approved by: https://github.com/Chillee	2023-02-13 23:36:21 +00:00
Edward Z. Yang	9d5fcd37a2	sym_max/sym_min introduce guard if hinted (#94400 ) This patch started with only the change in `torch/_prims_common/__init__.py`. Unfortunately, this change by itself fails tests. The reason it fails tests is sym_max produces sympy.Max expression, which impedes our ability to actually reason symbolically about the resulting expressions. We much prefer to insert a guard on `l > 1` and get a Sympy expression without Max in it, if we can. In the upcoming unbacked SymInts PR, we can't necessarily do this, but without unbacked SymInts, we always can. To do this, we introduce `alternate_impl_if_hinted_methods`. The idea is that if all of the arguments into max/min have hints, we will just go ahead and introduce a guard and then return one argument or the other, depending on the result. This is done by rewrapping the SymNode into SymInt/SymFloat and then running builtins.min/max, but we also could have just manually done the guarding (see also https://github.com/pytorch/pytorch/pull/94365 ) However, a very subtle problem emerges when you do this. When we do builtins min/max, we return the argument SymNode directly, without actually allocating a fresh SymNode. Suppose we do a min-max with a constant (as is the case in `sym_max(l, 1)`. This means that we can return a constant SymNode as the result of the computation. Constant SymNodes get transformed into regular integers, which then subsequently trigger the assert at https://github.com/pytorch/pytorch/pull/94400/files#diff-03557db7303b8540f095b4f0d9cd2280e1f42f534f67d8695f756ec6c02d3ec7L620 After thinking about this a bit, I think the assert is wrong. It should be OK for SymNode methods to return constants. The reason the assert was originally added was that ProxyTensorMode cannot trace a constant return. But this is fine: if you return a constant, no tracing is necessary; you know you have enough guards that it is guaranteed to be a constant no matter what the input arguments are, so you can burn it in. You might also be wondering why a change to SymNode method affects the assert from the dispatch mode dispatch: the call stack typically looks like SymNode.binary_magic_impl -> SymProxyTensorMode -> SymNode.binary_magic_impl again; so you hit the binary_magic_impl twice! No new tests, the use of sym_max breaks preexisting tests and then the rest of the PR makes the tests pass again. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94400 Approved by: https://github.com/Chillee	2023-02-13 23:36:21 +00:00
Denis Vieriu	4acdc446b2	[MPS] Fix batch norm for NHWC (#94760 ) Fixes `test_modules.py` batch norm NHWC testcases: - `test_memory_format_nn_BatchNorm2d_eval_mode_mps_float32` - `test_memory_format_nn_BatchNorm2d_eval_mode_mps_float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94760 Approved by: https://github.com/kulinseth	2023-02-13 23:31:10 +00:00
OwenPendrighElliott	840fb74ec8	86990 range mps support (#91075 ) Fixes #86990 - Added range_mps_out to RangeFactories.mm - Updated native_functions.yaml - Added tests in test_mps.py I did observe that despite [the documentation for torch.range](https://pytorch.org/docs/stable/generated/torch.range.html), the existing implementations do not adjust their return type based off the arguments passed to them. The MPS implementation provided here behaves the same way as the existing CPU and CUDA implementations in this regard, hence the conversion to float32 in the test cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91075 Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97	2023-02-13 23:19:10 +00:00
Natalia Gimelshein	f2aee8b8d5	small fixes for mlir backend (#94717 ) Fixes for skipped tests with mlir triton backend (will unskip once #94249 lands) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94717 Approved by: https://github.com/malfet, https://github.com/atalman	2023-02-13 22:42:53 +00:00
Alexander Grund	a0d1dbc446	Fix pytest arguments when --save-xml is not passed (#94589 ) The expression `argv + [f'--junit-xml-reruns={test_report_path}'] if TEST_SAVE_XML else []` evaluates to the empty list when `TEST_SAVE_XML` is false and would need parentheses. Instead simplify the code by appending the argument when required directly where `test_report_path` is set. Note that `.append()` may not be used as that would modify `argv` and in turn `UNITTEST_ARGS` which might have undesired side effects. Without this patch `pytest.main()` would be called, i.e. no arguments which will try to discover all tests in the current working directory which ultimately leads to (many) failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94589 Approved by: https://github.com/clee2000, https://github.com/Neilblaze	2023-02-13 22:19:51 +00:00
PyTorch MergeBot	e743d316e2	Revert "fix some MKL detection issues of CMake (#94402 )" This reverts commit 7ef46d40a1208a39d785b1ad772c10d4c6e0af0d. Reverted https://github.com/pytorch/pytorch/pull/94402 on behalf of https://github.com/malfet due to Broke binary builds, see https://github.com/pytorch/pytorch/issues/94751#issuecomment-1428562517	2023-02-13 22:09:40 +00:00
Wanchao Liang	2db12e3844	[tp] minor update to TP docs (#94748 ) minor update to TP docs for beta release Pull Request resolved: https://github.com/pytorch/pytorch/pull/94748 Approved by: https://github.com/fduwjj	2023-02-13 21:54:19 +00:00
Howard Huang	8b3e3f937d	Update documentation init_process_group optional backend (#94543 ) Update documentation for `init_process_group()` to mention the `backend` argument is optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543 Approved by: https://github.com/kwen2501	2023-02-13 21:45:38 +00:00
PyTorch MergeBot	25820b69f6	Revert "[BE] Use data() method when possible as it's safer and more readable (#92755 )" This reverts commit 582485bf0f880de75c7eb36a466562f77e6c64db. Reverted https://github.com/pytorch/pytorch/pull/92755 on behalf of https://github.com/ezyang due to could have forward fixed but not going to	2023-02-13 21:44:30 +00:00
Andrew Gu	5ee230face	[FSDP][1/N] Refactor module materialization (#94196 ) Overview This refactors module materialization (i.e. meta device or `torchdistX` deferred initialization) to compute the parameter and buffer names as needed instead of pre-computing them. These are needed to reacquire references to the states (e.g. `module.get_parameter(param_name)`) after materialization since the materialization may create new variables. This refactor simplifies `_get_fully_sharded_module_to_states()` (the core function for "pseudo auto wrapping") to better enable lowest common ancestor (LCA) module computation for shared parameters, for which tracking parameter and buffer names may complicate the already non-obvious implementation. Discussion The tradeoff is a worst case quadratic traversal over modules if materializing all of them. However, since (1) the number of modules is relatively small, (2) the computation per module in the quadratic traversal is negligible, (3) this runs only once per training session, and (4) module materialization targets truly large models, I think this tradeoff is tolerable. For Reviewers - `_init_param_handle_from_module()` initializes _one_ `FlatParamHandle` from a fully sharded module and represents the module wrapper code path. For this code path, there is no need to reacquire references to the parameters/buffers for now since the managed parameters are only computed after materialization. This works because the managed parameters have a simple definition: any parameter in the local root module's tree excluding those already marked as flattened by FSDP. Similarly, FSDP marks buffers to indicate that they have already been processed (synced if `sync_module_states`). - `_init_param_handles_from_module()` initializes _all_ `FlatParamHandle`s from a fully sharded module and represents the composable code path. For this code path, we must reacquire references to parameters/buffers because each logical wrapping is specified as a list of parameters/buffers to group together by those variables and because materialization may create new variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94196 Approved by: https://github.com/rohan-varma	2023-02-13 21:43:00 +00:00
Justin Chu	6cef200af9	[ONNX] Wrap symbolic method calls with graph context (#94746 ) This should address #93370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94746 Approved by: https://github.com/BowenBao	2023-02-13 21:29:39 +00:00
Theodor Arsenij	a6a433aecd	Add stack emptiness checks inside interpreter.cpp (#94298 ) Hi! I've been fuzzing different pytorch modules, and found a few crashes inside one of them. Specifically, I'm talking about a module for interpreting the JIT code and a function called `InterpreterState::run()`. Running this function with provided crash file results in a crash, which occurs while calling `dim()` on a `stack` with 0 elements ([line-686](`abc54f9314/torch/csrc/jit/runtime/interpreter.cpp (L686)`)). The crash itself occurs later, when std::move is called with incorrect value of type `IValue`. The second crash is similar and occurs on [line 328](`abc54f9314/torch/csrc/jit/runtime/interpreter.cpp (LL328C15-L328C48)`), where `reg(inst.X + i - 1) = pop(stack);` is executed. The error here is the same, `Stack stack` might not contain enough elements. The third crash occurs on [line 681](`abc54f9314/torch/csrc/jit/runtime/interpreter.cpp (L681)`). The problem here is the same as for previous crashes. There are not enough elements in the stack. In addition to these places, there are many others (in the same function) where border checking is also missing. I am not sure what is the best way to fix these problems, however I suggest adding a boundary check inside each of these case statement. All tests were performed on this pytorch version: [abc54f93145830b502400faa92bec86e05422fbd](`abc54f9314`) ### How to reproduce 1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch) 2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .` 3. Copy these crash files to the current directory: - [crash-4f18c5128c9a5a94343fcbbd543d7d6b02964471.zip](https://github.com/pytorch/pytorch/files/10674143/crash-4f18c5128c9a5a94343fcbbd543d7d6b02964471.zip) - [crash-55384dd7c9689ed7b94ac6697cc43db4e0dd905a.zip](https://github.com/pytorch/pytorch/files/10674147/crash-55384dd7c9689ed7b94ac6697cc43db4e0dd905a.zip) - [crash-06b6125d01c5f91fae112a1aa7dcc76d71b66576.zip](https://github.com/pytorch/pytorch/files/10674152/crash-06b6125d01c5f91fae112a1aa7dcc76d71b66576.zip) 4. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash`` 5. And execute the binary: `/jit_differential_fuzz /homedir/crash-4f18c5128c9a5a94343fcbbd543d7d6b02964471` After execution completes you will see this stacktrace: ```asan =36==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6060001657f8 at pc 0x00000060bc91 bp 0x7fff00b33380 sp 0x7fff00b33378 READ of size 4 at 0x6060001657f8 thread T0 #0 0x60bc90 in c10::IValue::IValue(c10::IValue&&) /pytorch_fuzz/torch/include/ATen/core/ivalue.h:214:43 #1 0xc20e7cd in torch::jit::pop(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/aten/src/ATen/core/stack.h:102:12 #2 0xc20e7cd in torch::jit::dim(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/mobile/promoted_prim_ops.cpp:119:20 #3 0xc893060 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:686:13 #4 0xc85c47b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:1010:9 #5 0x600598 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential_fuzz.cc:66:38 #6 0x601d99 in LLVMFuzzerTestOneInput /jit_differential_fuzz.cc:107:25 #7 0x52ccf1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #8 0x516c0c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #9 0x51c95b in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #10 0x545ef2 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #11 0x7f9ec069a082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) #12 0x51152d in _start (/jit_differential_fuzz+0x51152d) 0x6060001657f8 is located 8 bytes to the left of 64-byte region [0x606000165800,0x606000165840) allocated by thread T0 here: #0 0x5fd42d in operator new(unsigned long) /llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0xa16ab5 in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<c10::IValue&>(__gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue, std::allocator<c10::IValue> > >, c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33 #2 0xa168f1 in c10::IValue& std::vector<c10::IValue, std::allocator<c10::IValue> >::emplace_back<c10::IValue&>(c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:121:4 #3 0xc89b53c in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:344:19 #4 0xc85c47b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:1010:9 #5 0x600598 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential_fuzz.cc:66:38 #6 0x601d99 in LLVMFuzzerTestOneInput /jit_differential_fuzz.cc:107:25 #7 0x52ccf1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #8 0x516c0c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #9 0x51c95b in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #10 0x545ef2 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #11 0x7f9ec069a082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch_fuzz/torch/include/ATen/core/ivalue.h:214:43 in c10::IValue::IValue(c10::IValue&&) Shadow bytes around the buggy address: 0x0c0c80024aa0: fd fd fd fd fd fd fd fa fa fa fa fa 00 00 00 00 0x0c0c80024ab0: 00 00 00 fa fa fa fa fa fd fd fd fd fd fd fd fd 0x0c0c80024ac0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa 0x0c0c80024ad0: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd 0x0c0c80024ae0: fd fd fd fd fa fa fa fa 00 00 00 00 00 00 00 00 =>0x0c0c80024af0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa[fa] 0x0c0c80024b00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa 0x0c0c80024b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0c80024b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0c80024b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0c80024b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==36==ABORTING ``` 6. Executing the remaining crashes gives similar crash reports Pull Request resolved: https://github.com/pytorch/pytorch/pull/94298 Approved by: https://github.com/davidberard98	2023-02-13 21:00:00 +00:00
Quajak	c0e7077674	Fix link in docs (#94686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94686 Approved by: https://github.com/kit1980	2023-02-13 20:42:24 +00:00
Wang, Yi A	d82c2b14c7	jit trace will fail for parameter check if it contains param whose ki… (#94032 ) …nd is _ParameterKind.VAR_KEYWORD Pull Request resolved: https://github.com/pytorch/pytorch/pull/94032 Approved by: https://github.com/qihqi, https://github.com/davidberard98	2023-02-13 20:33:30 +00:00
Jason Ansel	4d6a4401f8	Raise warning if torch.compile options change without reset (#94680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94680 Approved by: https://github.com/wconstab, https://github.com/malfet	2023-02-13 20:21:04 +00:00
PyTorch MergeBot	7c3fc2c7f0	Revert "Issue-88098: extract utils from check labels (#94597 )" This reverts commit 2c76838d7ff96cc7aa3a30cae54fded70e0bccc5. Reverted https://github.com/pytorch/pytorch/pull/94597 on behalf of https://github.com/jeanschmidt due to reverting due internal breakages https://fburl.com/sandcastle/3ukij9xp	2023-02-13 20:19:50 +00:00
Catherine Lee	1f7448eeda	Add missing super().setUp() to test_freezing and test_tensorboard (#94553 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94553 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-02-13 19:56:12 +00:00
Huy Do	bdf9963e57	Cache linter S3 dependencies (#94745 ) Fixes https://github.com/pytorch/pytorch/issues/94716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94745 Approved by: https://github.com/seemethere	2023-02-13 19:44:23 +00:00
PyTorch MergeBot	36dfbb08f3	Revert "Update Cutlass to v2.11 (#94188 )" This reverts commit a0f9abdcb651bb948d2d6e9f7d3ce947e2c53659. Reverted https://github.com/pytorch/pytorch/pull/94188 on behalf of https://github.com/ezyang due to bouncing this to derisk branch cut	2023-02-13 19:03:36 +00:00
Nikita Karetnikov	f70ba23415	[inductor] enable `test_upsample_cat_conv_dynamic_shapes` (#94715 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94715 Approved by: https://github.com/ezyang	2023-02-13 18:29:21 +00:00
Aaron Gokaslan	0444a6c90a	[BE] Remove deprecated logging warn method (#94708 ) Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708 Approved by: https://github.com/ezyang	2023-02-13 18:24:52 +00:00
Edward Z. Yang	ae7a628b03	Dynamic shapes CI updates (#94690 ) Data from https://github.com/pytorch/pytorch/pull/94683 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94690 Approved by: https://github.com/cpuhrsch	2023-02-13 18:20:12 +00:00
XiaobingSuper	e355a5c1d6	inductor: fix the CPP issue of flag_to_float (#94730 ) Fix https://github.com/pytorch/pytorch/issues/94725. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94730 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel	2023-02-13 18:13:28 +00:00
Ramin Azarmehr	b57e6fdb50	[MPS] Enable Memory Leak Detection for test_mps.py (#94646 ) - To check for Memory Leaks in `test_mps.py`, set the env-variable `PYTORCH_TEST_MPS_MEM_LEAK_CHECK=1` when running test_mps.py (used CUDA code as reference). - Added support for the following new python interfaces in MPS module: `torch.mps.[empty_cache(), set_per_process_memory_fraction(), current_allocated_memory(), driver_allocated_memory()]` - Renamed `_is_mps_on_macos_13_or_newer()` to `_mps_is_on_macos_13_or_newer()`, and `_is_mps_available()` to `_mps_is_available()` to be consistent in naming with prefix `_mps`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94646 Approved by: https://github.com/malfet	2023-02-13 17:56:24 +00:00
Brian Hirsh	ceb0f1576b	turn functionalization on in aot_autograd inference (#92857 ) still waiting for CI fallout fixes #90759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92857 Approved by: https://github.com/ezyang	2023-02-13 17:48:00 +00:00
Mikayla Gawarecki	5ce1fad711	Add rnn.unpad_sequence and rnn.unpack_sequence to documentation (#94316 ) Fix #76064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94316 Approved by: https://github.com/jbschlosser	2023-02-13 17:47:10 +00:00
soulitzer	701412a4ec	Update gradcheck docs to mention non-differentiability (#94618 ) Fixes https://github.com/pytorch/pytorch/issues/94204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94618 Approved by: https://github.com/albanD	2023-02-13 17:14:14 +00:00
atalman	a064ce1939	Pin setup-buildx-action version. Fix Docker build (#94734 ) This pins setup-buildx-action version. Our Docker builds where fixed by: https://github.com/pytorch/pytorch/pull/92702 on Jan 25,26 However setup-builder-action update on Jan 27 broke these builds again. This PR pins version of setup-buildx-action and fixes Docker builds for nightly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94734 Approved by: https://github.com/jeanschmidt	2023-02-13 16:58:44 +00:00
Vasiliy Kuznetsov	216f88d084	ao migration: remove package test as this behavior is tested by other things (#94422 ) Summary: We have tests testing package level migration correctness for torch AO migration. After reading the code, I noticed that these tests are not testing anything additional on top of the function level tests we already have. An upcoming user warning PR will break this test, and it doesn't seem worth fixing. As long as the function level tests pass, 100% of user functionality will be tested. Removing this in a separate PR to keep PRs small. Test plan: ``` python test/test_quantization.py -k AOMigration ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94422 Approved by: https://github.com/jcaip	2023-02-13 16:33:40 +00:00
Vasiliy Kuznetsov	f6adbf4d97	ao migration: delete unused test class (#94420 ) Summary: This test case is dead code. A newer version of this code exists in `test/quantization/ao_migration/test_quantization.py`. I think this class must have been mistakenly left during a refactor. Deleting it. Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/94420 Approved by: https://github.com/jerryzh168	2023-02-13 16:33:40 +00:00
mfkasim1	2acac8a83a	Logcumsumexp for CUDA (build-time optimized) (#94310 ) Hopefully fixes #89205. This is another version of #90847 where it was reverted because it increases the compile-time significantly. From my discussion with @ngimel in https://github.com/pytorch/pytorch/pull/93153#issuecomment-1409051528, it seems the option of jiterator would be very tricky if not impossible. So what I did was to optimize the compile-time in my computer. To optimize the build time, first I compile the pytorch as a whole, then only change the `LogcumsumexpKernel.cu` file to see how it changes the compile time. Here are my results for the compilation time of only the `LogcumsumexpKernel.cu` file in my computer: - Original version (without any complex implementations): 56s (about 1 minute) - The previous PR (#90847): 13m 57s (about 14 minutes) - This PR: 3m 35s (about 3.5 minutes) If the previous PR increases the build time by 30 mins in pytorch's computer, then this PR reduces the increment of build time to about 6 mins. Hopefully this is an acceptable level of build-time increase. What I did was (sorted by how significant it reduces the build time from the most significant one): - Substituting `log(x)` to `log1p(x - 1)`. This is applied in the infinite case, so we don't really care about precision. - Implementing complex exponential manually tag: @malfet, @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/94310 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-13 16:00:52 +00:00
Nikita Shulga	4869929f32	Update Triton hash (#94249 ) That includes MLIR + latest packaging changes (that also download ptxas from CUDA-12) Tweak CI to install gcc-9 to build trition Disable a few tests to make everything be correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/94249 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/weiwangmeta	2023-02-13 13:17:36 +00:00
PyTorch MergeBot	e61d5b9588	Revert "Dynamo Export use fake tensor (#94276 )" This reverts commit 54fa9801868ae71565b3b237bc2bbcce90e42017. Reverted https://github.com/pytorch/pytorch/pull/94276 on behalf of https://github.com/jeanschmidt due to break several internal build/test jobs: https://fburl.com/phabricator/1tik7ggb	2023-02-13 09:36:41 +00:00
PyTorch MergeBot	641dc0b844	Revert "[quant] Add quantize and dequantize operators to decomposition table (#93312 )" This reverts commit 782e4f5c02abaf5b9cdba4eaa827bc70a310bca8. Reverted https://github.com/pytorch/pytorch/pull/93312 on behalf of https://github.com/jeanschmidt due to this commits breaks internal builds: https://fburl.com/sandcastle/dw0rqcbv	2023-02-13 09:20:37 +00:00
Jacob Szwejbka	2628901033	[Executorch][Quant] Add Choose_qparams_symmetric (#94685 ) Summary: needed for symmetric dynamic quant flow Test Plan: todo Reviewed By: jerryzh168 Differential Revision: D43134117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94685 Approved by: https://github.com/larryliu0820	2023-02-13 07:27:48 +00:00
Jason Ansel	ab261ff514	Tweak config for mode=max-autotune/reduce-overhead (#94659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94659 Approved by: https://github.com/Chillee	2023-02-13 04:32:25 +00:00
Nikita Shulga	e7e51b3a5c	Fix NVML visible device parsing (#92315 ) `CUDA_VISIBLE_DEVICES` can contain either ordinals or UUIDs Extend the logic to be able to parse it by UUID Added unit test to validate that parser and matcher behavior matches that of 525.60.13 driver Skip MIG- device parsing Fixes https://github.com/pytorch/pytorch/issues/90543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92315 Approved by: https://github.com/ngimel	2023-02-13 04:25:04 +00:00
Wei Wang	6fadd5e94a	Checkout torchbench with only needed models (#94578 ) Addresses (https://github.com/pytorch/pytorch/pull/93395#issuecomment-1414231011) The perf smoke test is supposed to be around one minute. But the torchbench checkout process is taking more than 15 minutes. This PR explores a way to just checkout torchbench with only needed models that are later used to do perf smoke test and memory compression ratio check. Torchbench installation has "python install.py models model1 model 2 model3" support to just install model1 model2 and model3, not providing "models model1 model2 model3" would install all models by default. Before this PR, inductor job takes about 27 minutes (21 minutes spent in testing phase) https://github.com/pytorch/pytorch/actions/runs/4149154553/jobs/7178024253 After this PR, inductor job takes about 19 minutes (12 minutes spent in testing phase), pytorch checkout and docker image pull takes about 5 - 6 minutes total. https://github.com/pytorch/pytorch/actions/runs/4149155814/jobs/7178735494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94578 Approved by: https://github.com/orionr, https://github.com/malfet, https://github.com/desertfire	2023-02-13 04:02:18 +00:00
Kulin Seth	18587cb31f	[MPS] Add sort and argSort Op. (#94697 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94697 Approved by: https://github.com/DenisVieriu97	2023-02-13 01:03:22 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Ramin Azarmehr	bdd8f518d7	[MPS] Add Python Module Bindings for the MPS backend (#94417 ) - This PR is a prerequisite for the upcoming Memory Leak Detection PR. - Enable global manual seeding via `torch.manual_seed()` + test case - Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case - Enable the following python interfaces for MPS: `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]` - Added some test cases in test_mps.py - Added `mps.rst` to document the `torch.mps` module. - Fixed the failure with `test_public_bindings.py` Description of new files added: - `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`. - `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417 Approved by: https://github.com/albanD	2023-02-12 21:22:30 +00:00
Aaron Gokaslan	a0f9abdcb6	Update Cutlass to v2.11 (#94188 ) Now that we are on CUDA 11+ exclusively, we can update Nvidia's Cutlass to the next version. We also had to remove the cuda build flag : "-D__CUDA_NO_HALF_CONVERSIONS__" since Cutlass no longer builds without it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94188 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-02-12 20:45:03 +00:00
cyy	7ef46d40a1	fix some MKL detection issues of CMake (#94402 ) This PR rewrites some logic of FindMKL.cmake and FindOpenMP.cmake to better detect the corresponding libraries and fix the infinitely recursion between them. It also contains some other fixes without changing the CMake interface. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94402 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-12 19:19:10 +00:00
Jason Ansel	a8fdfb4ba8	[inductor] Persistent reductions (#92267 ) This one may need to wait for the new MLIR Triton to land as it triggers some Triton crashes. Before: ``` $ pytest test/inductor/test_torchinductor.py -vsk test_softmax_one_kernel_loop_cuda ... @reduction( size_hints=[16, 32], reduction_hint=ReductionHint.INNER, filename=__file__, meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]} ) @triton.jit def triton_(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 16 rnumel = 32 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rbase = tl.arange(0, RBLOCK)[None, :] x0 = xindex _tmp1 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf") for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp0 = tl.load(in_ptr0 + (r1 + (32x0)), rmask & xmask, eviction_policy='evict_last') _tmp1 = tl.where(xmask & rmask & (_tmp1 < tmp0), tmp0, _tmp1) tmp1 = tl.max(_tmp1, 1)[:, None] _tmp5 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0 for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp2 = tl.load(in_ptr0 + (r1 + (32x0)), rmask & xmask, eviction_policy='evict_last') tmp3 = tmp2 - tmp1 tmp4 = tl.exp(tmp3) _tmp5 = tl.where(xmask & rmask, _tmp5 + tmp4, _tmp5) tmp5 = tl.sum(_tmp5, 1)[:, None] for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp6 = tl.load(in_ptr0 + (r1 + (32x0)), rmask & xmask, eviction_policy='evict_last') tmp7 = tmp6 - tmp1 tmp8 = tl.exp(tmp7) tmp9 = tmp8 / tmp5 tl.store(out_ptr2 + (r1 + (32x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp9, rmask & xmask) ``` After ``` $ pytest test/inductor/test_torchinductor.py -vsk test_softmax_one_kernel_persist_cuda ... @persistent_reduction( size_hints=[16, 32], reduction_hint=ReductionHint.INNER, filename=__file__, meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]} ) @triton.jit def triton_(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 16 rnumel = 32 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[None, :] rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (32x0)), rmask & xmask) tmp2 = tl.where(xmask & rmask, tmp0, float("-inf")) tmp3 = tl.max(tmp2, 1)[:, None] tmp4 = tmp0 - tmp3 tmp5 = tl.exp(tmp4) tmp7 = tl.where(xmask & rmask, tmp5, 0) tmp8 = tl.sum(tmp7, 1)[:, None] tmp9 = tmp5 / tmp8 tl.store(out_ptr2 + (r1 + (32x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp9, rmask & xmask) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92267 Approved by: https://github.com/Chillee	2023-02-12 17:39:25 +00:00
Chien-Chin Huang	eb81e7ec22	[FSDP] Avoid printing incorrect warning for _get_param_to_fqns (#94494 ) There exist a hack for `_get_param_to_fqns` and `_apply_to_modules`. The condition for the warning of the hack is incorrect and result in overwhelming message for users. This PR fixes the issue. The original hack is not removed. It will once the support of DMP + FSDP is deprecated. Differential Revision: [D43135611](https://our.internmc.facebook.com/intern/diff/D43135611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94494 Approved by: https://github.com/rohan-varma	2023-02-12 17:09:30 +00:00
Chien-Chin Huang	963d8f547e	[FSDP][state_dict] Return tensors instead of FlatParameters to avoid pickling errors (#94637 ) After https://github.com/pytorch/pytorch/pull/88913, user-defined parameter states will be pickled. For a FlatParameter, this means `_local_shard` will also be pickled. Since state_dict and load_state_dict only require the tensor, returning the full FlatParameter does not give us any extra benefit. This PR changes the behavior to simply return a view of the FlatParameter. Differential Revision: [D43205127](https://our.internmc.facebook.com/intern/diff/D43205127/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94637 Approved by: https://github.com/rohan-varma	2023-02-12 16:04:17 +00:00
Ning Xu	2c76838d7f	Issue-88098: extract utils from check labels (#94597 ) Fixes #88098 This is a mirror of the same PR (https://github.com/Goldspear/pytorch/pull/2) that has been reviewed in my fork (due to it's a stacked PR). ====================== ## Context This is the 2nd of the 3 PRs to address issue-88098. ## What Changed 1. Extract comment related utils from trymerge.py to github_utils.py 2. Extract label related utils from trymerge.py and check_labels.py to label_utils.py ## Tests * pytorch-dummy repo [trymerge run ](https://github.com/Goldspear/pytorch-dummy/actions/runs/4118944174)merged the test PR [OK](https://github.com/Goldspear/pytorch-dummy/pull/2). ## Note to Reviewers Due to higher degree of complexity involved to extract GitHubPR class, it's worth having a separate issue to handle that part of refactoring. This issue only focusing on refactoring where necessary to ship the functional diff. * 1st PR: https://github.com/pytorch/pytorch/pull/94179 * 2nd PR: this one * 3rd PR: https://github.com/Goldspear/pytorch/pull/3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94597 Approved by: https://github.com/ZainRizvi	2023-02-12 12:18:53 +00:00
XiaobingSuper	d04fd6b808	inductor: fix customer op _convolution_pointwise_.binary functional error at AOTAutograd (#94581 ) This is another try(first is https://github.com/pytorch/pytorch/pull/94172) to fix the warning message when running inductor CPU path: ``` l. Known situations this can occur are inference mode only compilation involving resize_ or prims (!schema.hasAnyAliasInfo() INTERNAL ASSERT FAILED); if your situation looks different please file a bug to PyTorch. Traceback (most recent call last): File "/home/xiaobing/pytorch-offical/torch/_functorch/aot_autograd.py", line 1377, in aot_wrapper_dedupe fw_metadata, _out = run_functionalized_fw_and_collect_metadata(flat_fn)( File "/home/xiaobing/pytorch-offical/torch/_functorch/aot_autograd.py", line 578, in inner flat_f_outs = f(flat_f_args) File "/home/xiaobing/pytorch-offical/torch/_functorch/aot_autograd.py", line 2455, in functional_call out = Interpreter(mod).run(args[params_len:], *kwargs) File "/home/xiaobing/pytorch-offical/torch/fx/interpreter.py", line 136, in run self.env[node] = self.run_node(node) File "/home/xiaobing/pytorch-offical/torch/fx/interpreter.py", line 177, in run_node return getattr(self, n.op)(n.target, args, kwargs) File "/home/xiaobing/pytorch-offical/torch/fx/interpreter.py", line 294, in call_module return submod(args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/_inductor/mkldnn.py", line 344, in forward return self._conv_forward(input, other, self.weight, self.bias) File "/home/xiaobing/pytorch-offical/torch/_inductor/mkldnn.py", line 327, in _conv_forward return torch.ops.mkldnn._convolution_pointwise_( File "/home/xiaobing/pytorch-offical/torch/_ops.py", line 499, in __call__ return self._op(args, *kwargs or {}) File "/home/xiaobing/pytorch-offical/torch/_inductor/overrides.py", line 38, in __torch_function__ return func(args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/_ops.py", line 499, in __call__ return self._op(args, **kwargs or {}) RuntimeError: !schema.hasAnyAliasInfo() INTERNAL ASSERT FAILED at "/home/xiaobing/pytorch-offical/aten/src/ATen/FunctionalizeFallbackKernel.cpp":32, please report a bug to PyTorch. mutating and aliasing ops should all have codegen'd kernels While executing %self_layer2_0_downsample_0 : [#users=2] = call_module[target=self_layer2_0_downsample_0](args = (%self_layer1_1_conv2, %self_layer2_0_conv2), kwargs = {}) Original traceback: File "/home/xiaobing/vision/torchvision/models/resnet.py", line 100, in forward identity = self.downsample(x) \| File "/home/xiaobing/vision/torchvision/models/resnet.py", line 274, in _forward_impl x = self.layer2(x) \| File "/home/xiaobing/vision/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94581 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-12 10:48:01 +00:00
Henry Cheng	fe0c7fbcf8	[MPS] Add repeat_interleave to MPS (#88649 ) Fixes #87219 Implements new ``repeat_interleave`` function into ``aten/src/ATen/native/mps/operations/Repeat.mm`` Adds it to ``aten/src/ATen/native/native_functions.yaml`` Adds new test ``test_repeat_interleave`` to ``test/test_mps/py`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88649 Approved by: https://github.com/kulinseth	2023-02-12 08:43:55 +00:00
Denis Vieriu	b794fd19c5	[MPS] Add scatter gather kernels (support up to 5 dimensions) (#94663 ) Add scatter gather kernels (support up to 5 dimensions) - Fixes int64 issues for `mH`, `mT`, `T`, `H` on Monterey Pull Request resolved: https://github.com/pytorch/pytorch/pull/94663 Approved by: https://github.com/kulinseth	2023-02-12 08:17:26 +00:00
zhxchen17	e3c4cea668	[functorch] Add support on CUDA keys for control flow ops. (#94465 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94465 Approved by: https://github.com/tugsbayasgalan	2023-02-12 06:45:53 +00:00
PyTorch MergeBot	989fb7c921	[vision hash update] update the pinned vision hash (#94557 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94557 Approved by: https://github.com/pytorchbot	2023-02-12 05:35:13 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Kulin Seth	54c0f37646	[MPS] Add support for TopK k>16 (#94639 ) Fixes: https://github.com/pytorch/pytorch/issues/78915 * Add the topk>16 support Pull Request resolved: https://github.com/pytorch/pytorch/pull/94639 Approved by: https://github.com/DenisVieriu97	2023-02-12 00:57:53 +00:00
haozhe.zhu	ed54a5d06b	enable bf16 emb (#94163 ) Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163 Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5	2023-02-12 00:05:09 +00:00
Kulin Seth	020a0fbf62	[MPS] Perf update to convolutions. (#94661 ) Map forward conv to depthwise for num_groups == input_channels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94661 Approved by: https://github.com/DenisVieriu97	2023-02-11 22:09:55 +00:00
Denis Vieriu	4a762cb622	[MPS] Fix channels last copies in ELU,ReLU and Hardswish (#94664 ) Fixes test_modules.py tests: ``` test_memory_format_nn_Hardswish_mps_float32 test_non_contiguous_tensors_nn_Hardswish_mps_float32 test_memory_format_nn_ReLU_mps_float32 ``` Fixes elu when ran with `ChannelsLast` memory format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94664 Approved by: https://github.com/kulinseth	2023-02-11 22:05:21 +00:00
Huy Do	371f587c92	Dockerize lint jobs (#94255 ) This is to minimize network flakiness when running lint jobs. I create a new Docker image for linter and install all linter dependencies there. After that, all linter jobs are converted to use Nova generic Linux job https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job.yml with the new image. For the future task: I encounter this issue with the current mypy version we are using and Python 3.11 https://github.com/python/mypy/issues/13627. Fixing this requires upgrading mypy to a newer version, but that can be done separately (require formatting/fixing `*.py` files with the newer mypy version) `collect_env` linter job is currently not included here as it needs older Python versions (3.5). It could also be converted to use the same mechanism (with another Docker image, probably). This one rarely fails though. ### Testing BEFORE https://github.com/pytorch/pytorch/actions/runs/4130366955 took a total of ~14m AFTER https://github.com/pytorch/pytorch/actions/runs/4130712385 also takes a total of ~14m Pull Request resolved: https://github.com/pytorch/pytorch/pull/94255 Approved by: https://github.com/ZainRizvi	2023-02-11 21:56:19 +00:00
Brian Hirsh	abfd293c39	functionalization: fix x.is_contiguous(channels_last) (#94195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94195 Approved by: https://github.com/ezyang	2023-02-11 21:07:08 +00:00
Brian Hirsh	aba4fb9a16	fix functionalization resize stride compute (#94018 ) uncovered from an OpInfo in inductor, when I turned on functionalization Pull Request resolved: https://github.com/pytorch/pytorch/pull/94018 Approved by: https://github.com/ezyang	2023-02-11 21:07:08 +00:00
Brian Hirsh	2b36d35b9c	add torch.autograd._unsafe_set_version_counter API (#92924 ) better description coming soon (but this is meant to fix https://github.com/pytorch/pytorch/issues/91093) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92924 Approved by: https://github.com/ezyang, https://github.com/alanwaketan, https://github.com/albanD	2023-02-11 21:07:08 +00:00
Kulin Seth	c74f438c01	[MPS] Fix the cat op for NHWC case (#94662 ) * add unit test cat with non-contiguous Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94662 Approved by: https://github.com/DenisVieriu97	2023-02-11 19:43:33 +00:00
Yanbo Liang	8ad10eab4d	[Dynamo] Fix bug of calling super from class extended from metaclass (#94547 ) Fixes #94299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94547 Approved by: https://github.com/jansel	2023-02-11 18:53:17 +00:00
Taylor Robie	d09cd15216	[Profiler] Defer recording startup python events (take 2) (#91684 ) This is my commandeer of https://github.com/pytorch/pytorch/pull/82154 with a couple extra fixes. The high level idea is that when we start profiling we see python frames which are currently executing, but we don't know what system TID created them. So instead we defer the TID assignment, and then during post processing we peer into the future and use the system TID of the next call on that Python TID. As an aside, it turns out that CPython does some bookkeeping (`ee821dcd39/Include/cpython/pystate.h (L159-L165)`, thanks @dzhulgakov for the pointer), but you'd have to do some extra work at runtime to know how to map their TID to ours so for now I'm going to stick to what I can glean from post processing alone. As we start observing more threads it becomes more important to be principled about how we start up and shut down. (Since threads may die while the profiler is running.) #82154 had various troubles with segfaults that wound up being related to accessing Python thread pointers which were no longer alive. I've tweaked the startup and shutdown interaction with the CPython interpreter and it should be safer now. Differential Revision: [D42336292](https://our.internmc.facebook.com/intern/diff/D42336292/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91684 Approved by: https://github.com/chaekit	2023-02-11 18:44:00 +00:00
Xuehai Pan	8d45f555d7	[BE] [1/3] Rewrite `super()` calls in caffe2 and benchmarks (#94587 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587 Approved by: https://github.com/ezyang	2023-02-11 18:19:48 +00:00
Howard Huang	aa6f0ace2f	Remove API declarations in Ops.hpp (#94532 ) In #91257, we removed direct calls to methods in ops.cpp, so this is updating to also remove ops.hpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/94532 Approved by: https://github.com/kwen2501	2023-02-11 18:13:09 +00:00
Justin Chu	a27bd42bb9	[ONNX] Use onnxruntime to run fx tests (#94638 ) - Enable the mnist test - Removed `max_pool2d` in the test because we don't have the op yet. - Add aten::convolution - Bump onnxscript version Pull Request resolved: https://github.com/pytorch/pytorch/pull/94638 Approved by: https://github.com/BowenBao, https://github.com/wschin, https://github.com/titaiwangms	2023-02-11 15:32:03 +00:00
Cuiqing Li	9dd7e83676	update xnnpack to newer version and update API usage in pytorch (#94330 ) Summary: Update XNNPACK to 51a987591a6fc9f0fc0707077f53d763ac132cbf (`51a987591a`) Update the corresponding CMake and BUCK rules, as well as the generate_wrapper.py for the new version. Due to XNNPACK having already changed a lot. We need to update XNNPACK in this time for many reasons. Firstly, XNNAPCK has updated a lot, and developers' community has re-factored codes' such as API changes. We can see from their cmakefile.txt to see there are many changes! Thus, in order to follow up upstream. We need to update xnnpack at this time. It is very crucial for our future development. Also, many projects are relying on newer versions of XNNPACK, so we probably need to update XNNPACK third-party libs at this time. we have some api changes of XNNPACK, so we also need to update them in this time. We also update target building files and generate-wrapper.py file to make this process more automatically. The original target files have some files which are missing, so we add them into buck2 building files so that it can build and test XNNPACK successfully. Test Plan: buck2 build //xplat/third-party/XNNPACK:operators buck2 build //xplat/third-party/XNNPACK:XNNPACK buck2 test fbcode//caffe2/test:xnnpack_integration Reviewed By: digantdesai Differential Revision: D43092938 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94330 Approved by: https://github.com/digantdesai, https://github.com/albanD	2023-02-11 08:59:35 +00:00
Natalia Gimelshein	e7a8af9376	don't warn on explicit fallback in inductor (#94643 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/94643 Approved by: https://github.com/Chillee	2023-02-11 07:29:10 +00:00
PyTorch MergeBot	4fe365774a	Revert "[MPS] Add Python Module Bindings for the MPS backend (#94417 )" This reverts commit beb4f5bf396ec2d53defa73c81aac48c38360544. Reverted https://github.com/pytorch/pytorch/pull/94417 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to break MacOS test in trunk `bae397ec63`	2023-02-11 05:24:45 +00:00
BowenBao	77d9e36b0a	[ONNX] Reduce 'find_mismatch' memory footprint by promptly freeing past sessions. (#94648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94648 Approved by: https://github.com/justinchuby	2023-02-11 05:06:12 +00:00
Ramin Azarmehr	7f068b7978	[MPS] Add APIs to query current and driver allocated memory in MPSAllocator (#94649 ) - Fixed the formatting in MPSAllocator.mm - Added `getCurrentAllocatedMemory()`and `getDriverAllocatedMemory()` to query memory allocations required for Memory Leak Detection in test_mps.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/94649 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth	2023-02-11 03:18:52 +00:00
Pruthvi Madugundu	6d1a9d7323	Revert "Mark ROCm trunk job as unstable (#94550 )" (#94631 ) This reverts commit 79ed6b246c768230aa1bf14eed804c8156a3f87f. Repo.radeon.com issue is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94631 Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd	2023-02-11 03:08:41 +00:00
Edward Z. Yang	50bc25baa0	Move ValueRanges into its own module (#94528 ) I am going to use it in ShapeEnv shortly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94528 Approved by: https://github.com/eellison	2023-02-11 02:54:49 +00:00
Huy Do	bae397ec63	Add filelock to MacOS dependencies (#94647 ) This starts to fails on trunk out of nowhere. Adding filelock dependency to forward fix the issue `d0cff06bcb` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94647 Approved by: https://github.com/clee2000	2023-02-11 02:14:41 +00:00
chunyuan	07cdea7cda	inductor: fix guard_equals (#94506 ) Fixes https://github.com/pytorch/pytorch/issues/94268. In the code before https://github.com/pytorch/pytorch/pull/92609, there was an assertion in the `guard_equals` function. ```python assert self.size_hint(expr) == 0, (expr, self.size_hint(expr)) ``` In https://github.com/pytorch/pytorch/pull/92609, `guard_equals` has been changed to ```python def guard_equals(self, left: Expr, right: Expr) -> Expr: self.shape_env.evaluate_expr(sympy.Eq(left, right)) return left ``` Considering the case where `left` and `right` are both concrete values for example, `left = 10` and `right = 20`. In the current code, `self.shape_env.evaluate_expr(sympy.Eq(left, right))` will directly return `False`: `a81cf49d97/torch/fx/experimental/symbolic_shapes.py (L1380-L1385)` This returned value is not used anywhere and the `guard_equals` function will still `return left` in this case even though `left != right`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94506 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/Chillee	2023-02-11 01:55:18 +00:00
Edward Z. Yang	c1c7eaf52b	Prevent sym_int from showing up in FX graph (#94595 ) Apply the optimization to floor instead of sym_int Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94595 Approved by: https://github.com/ngimel, https://github.com/bdhirsh	2023-02-11 01:43:05 +00:00
Ramin Azarmehr	030209088f	[MPS] Fix the regression with test_index_select_scalar() (#94645 ) The PR #94347 caused a regression in test_mps which this patch fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94645 Approved by: https://github.com/DenisVieriu97	2023-02-11 01:36:51 +00:00
Michael Lazos	ceab30775b	[Inductor] Enable fusion of mutation ops in narrow cases (#94110 ) Currently we don't enable fusion of mutation ops in any case (we introduce a `StarDep` to prevent fusion with any upstream readers, to ensure the kernel mutating the buffer is executing after them). This results in cases like [this](https://gist.github.com/mlazos/3dcfd416033b3459ffea43cb91c117c9) where even though all of the other readers have been fused into a single kernel, the `copy_` is left by itself. This PR introduces `WeakDep` and a pass after each fusion to see if after fusion there are other dependencies on the upstream fused node which already guarantee that this kernel is fused after the prior readers, if there are, the `WeakDep` is pruned and the kernel performing the mutation can be fused with the upstream kernel. This will allow Inductor to fuse epilogue `copy_`s introduced by functionalization on inference graphs. [before code](https://gist.github.com/mlazos/3369a11dfd1b5cf5bb255313b710ef5b) [after code](https://gist.github.com/mlazos/1005d8aeeba56e3a3e1b70cd77773c53) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94110 Approved by: https://github.com/jansel	2023-02-11 01:24:06 +00:00
Denis Vieriu	7ce785b50b	[MPS] Fix gelu forward and backward ops (#94529 ) Forward pass: ``` fix gelu_out_mps key add calculation for gelu with tanh remove gelu from blocklist ``` Backward pass: ``` fix gelu_backward_out_mps key uniform format add caculation for tanh approximate backward pass unblock grad test from blocklist ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94529 Approved by: https://github.com/razarmehr, https://github.com/kulinseth	2023-02-11 00:24:30 +00:00
Denis Vieriu	507b8c3423	[MPS] Native implementation for addr (#94538 ) ``` addr_out_mps to perform res = betainput + alpha(vec1Xvec2) move addr f16 to low precision list move addr none float to unsupported list add test_addr tests ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94538 Approved by: https://github.com/razarmehr	2023-02-11 00:16:50 +00:00
Huy Do	d51ca38ef0	Run test_serialization serially (for 2xlarge runners) (#94613 ) Fixes https://github.com/pytorch/pytorch/issues/92746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94613 Approved by: https://github.com/clee2000	2023-02-11 00:15:10 +00:00
Wanchao Liang	680fc84e7b	[dtensor] group public APIs together (#94524 ) This PR groups distribute_tensor/module to api.py rename some to non-public (ToTensor/FromTensor) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94524 Approved by: https://github.com/XilunWu	2023-02-10 23:40:34 +00:00
Aaron Gokaslan	3d82d8d0ed	[BE] Enable more flake8-comprehensions checks (#94601 ) I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR. This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601 Approved by: https://github.com/ezyang	2023-02-10 23:40:29 +00:00
Denis Vieriu	0b31ebf9e4	[MPS] Added zero check to inverse & fix for any op to avoid segfault issue (#94551 ) Fixes empty placeholder error in inverse op. Change to any op should also resolve previously seen segfaults Pull Request resolved: https://github.com/pytorch/pytorch/pull/94551 Approved by: https://github.com/kulinseth	2023-02-10 23:39:12 +00:00
Taylor Robie	45edf9a2ea	Reland: [Autograd] Use in-place input accumulation fast path for dense Tensors. (#90217 ) Identical to https://github.com/pytorch/pytorch/pull/88339 except with a `.has_storage()` check before `.storage()`. Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90217 Approved by: https://github.com/ngimel	2023-02-10 23:29:55 +00:00
Ramin Azarmehr	beb4f5bf39	[MPS] Add Python Module Bindings for the MPS backend (#94417 ) - This PR is a prerequisite for the upcoming Memory Leak Detection PR. - Enable global manual seeding via `torch.manual_seed()` + test case - Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case - Enable the following python interfaces for MPS: `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]` - Added some test cases in test_mps.py - Added `mps.rst` to document the `torch.mps` module. - Fixed the failure with `test_public_bindings.py` Description of new files added: - `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`. - `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417 Approved by: https://github.com/albanD	2023-02-10 23:18:41 +00:00
Daniel Falbel	d0cff06bcb	Call MPSAllocator callbacks when allocation fails. (#94133 ) Fixes #87374 @kulinseth and @albanD This makes the MPSAllocator call the MPSAllocatorCallbacks when getting a free buffer and a first try on allocating fails. User can register callbacks that might free a few buffers and an allocation will be retried. The reason why we need the `recursive_mutex` is that since callbacks are supposed to free memory, they will eventually call free_buffer() that will lock the same `mutex` that's used for allocation. This approach is similar what's used with the `FreeMemoryCallback` in the `CUDACachingAllocator`. This PR tries to be as minimal as possible, but there could be some additional improvements cleanups, like: - In current main, there's no way callbacks can be called, so we could probably rename the callback registry to something reflect the same naming in the CudaAllocator: `996cc1c0d0/c10/cuda/CUDACachingAllocator.h (L14-L24)` - Review the EventTypes here: `996cc1c0d0/aten/src/ATen/mps/MPSAllocator.h (L18-L23)` - And IMHO a nice improvement would be if callbacks could be aware of AllocParams, so they can decide to be more agressive or not depending on how much memory is requested. So I'd pass AllocParams in the signature of the executeCallback instance: `996cc1c0d0/aten/src/ATen/mps/MPSAllocator.h (L25)` Let me know if you think we could sneak those changes into this PR or if it's better to propose them in other smaller PR's. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94133 Approved by: https://github.com/kulinseth, https://github.com/razarmehr, https://github.com/albanD	2023-02-10 23:09:21 +00:00
Brian Hirsh	948cd61afc	add fallthrough kernel for AutogradMeta key (#94603 ) The other `Autograd[Backend]` keys all have fallthrough kernels registered to them, but `AutogradMeta` was missing the fallthrough kernel. This is a problem for custom ops that don't have autograd support, if you try to run them with meta tensors. If you have a custom op, and register a CPU and a Meta kernel, then: (1) if you run the op with cpu tensors, it will dispatch straight to the CPU kernel (as expected) (2) if you run the op with meta tensors, you will error - because we don't have a fallthrough registered to the AutogradMeta key, we will try to dispatch to the AutogradMeta key and error, since the op author hasn't provided an autograd implementation. Here's a repro that I confirmed now works: ``` import torch from torch._dispatch.python import enable_python_dispatcher from torch._subclasses.fake_tensor import FakeTensorMode lib = torch.library.Library("test", "DEF") impl_cpu = torch.library.Library("test", "IMPL", "CPU") impl_meta = torch.library.Library("test", "IMPL", "Meta") def foo_impl(x): return x + 1 lib.define("foo(Tensor a) -> Tensor") impl_meta.impl("foo", foo_impl) impl_cpu.impl("foo", foo_impl) with enable_python_dispatcher(): a = torch.ones(2, device='meta') print("@@@@@") b = torch.ops.test.foo.default(a) print(b) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94603 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 22:44:52 +00:00
ganler	0176405c69	fix: check if double to i64 is in well-formed range (#94290 ) Fixes #88951 The output shape of upsample is computed through `(i64)idim * (double)scale` and then casted back to `i64`. If the input scale is ill-formed (say negative number as #88951) which makes `(double)(idim * scale)` to be out of the range for `i64`, the casting will be an undefined behaviour. To fix it, we just check if `(double)(idim * scale)` can fit into `i64`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94290 Approved by: https://github.com/malfet	2023-02-10 22:35:22 +00:00
Edward Z. Yang	3fb08199f6	Remove unnecessary replace on self.expr (#94408 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94408 Approved by: https://github.com/jbschlosser	2023-02-10 22:16:31 +00:00
Wei Wang	480e0c0198	Remove anaconda-prune yml files as these have been moved to test-infra (#94610 ) Merge after https://github.com/pytorch/test-infra/pull/2691 These workflows would run from test-infra repository instead, after the PR (https://github.com/pytorch/test-infra/pull/2691) is merged. Not deleting anaconda-prune/ scripts because they may become handy during release if there is need to delete packages (no need to find these scripts in test-infra). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94610 Approved by: https://github.com/atalman	2023-02-10 22:04:40 +00:00
Huy Do	c53bd0dd30	Mitigate broken test_coalesce_reference_cycle test on dynamo (#94622 ) The test has been disabled and shows up on https://github.com/pytorch/test-infra/blob/generated-stats/stats/disabled-tests-condensed.json, but then the JSON file downloaded by the runner doesn't seem to have it. Disable it explicitly to keep trunk green while investigating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94622 Approved by: https://github.com/weiwangmeta	2023-02-10 21:59:36 +00:00
Denis Vieriu	728dfeee48	[MPS] Fix ops with bool issues in macOS Monterey (#94464 ) Summary: - Remove redundant bool casts from scatter/gather - Make the workarounds for scatter/gather (for bool/uint8 data types) OS specific - use them only in macOS Monterey, ignore them starting with macOS Ventura - Make all tensors ranked in scatter Fixes following tests: ``` test_output_match_slice_scatter_cpu_bool test_output_match_select_scatter_cpu_bool test_output_match_diagonal_scatter_cpu_bool test_output_match_repeat_cpu_bool test_output_match_rot90_cpu_bool etc.. ``` Still failing on macOS Monterey (needs additional investigation): ``` test_output_match_scatter_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94464 Approved by: https://github.com/kulinseth	2023-02-10 21:36:25 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
Kulin Seth	d14a59b63c	[MPS] Update merge rule list. (#94619 ) cc. @DenisVieriu97 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94619 Approved by: https://github.com/malfet	2023-02-10 21:07:09 +00:00
BowenBao	25619bdeb6	[ONNX][Experimental] FX Exporter w/ ONNX Script and ATen Lib (#94566 ) * Symbolic ONNX Exporter for TB Scale Models. * Based on ONNX Script and ATen Lib. * Produces diagnostics in Sarif. Co-authored-by: Justin Chu <justinchu@microsoft.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com> Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94566 Approved by: https://github.com/abock	2023-02-10 20:45:01 +00:00
BowenBao	8d8fb7efe7	[ONNX] Update diagnostics system (#94565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94565 Approved by: https://github.com/abock	2023-02-10 20:45:01 +00:00
BowenBao	88d0235b73	[ONNX] Update CI test environment; Add symbolic functions (#94564 ) * CI Test environment to install onnx and onnx-script. * Add symbolic function for `bitwise_or`, `convert_element_type` and `masked_fill_`. * Update symbolic function for `slice` and `arange`. * Update .pyi signature for `_jit_pass_onnx_graph_shape_type_inference`. Co-authored-by: Wei-Sheng Chin <wschin@outlook.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94564 Approved by: https://github.com/abock	2023-02-10 20:44:59 +00:00
Wei-Sheng Chin	c5c7687b74	Allow FakeTensorProp to run on graphs traced with some None inputs (#94569 ) Without this tiny change in `torch/_subclasses/fake_tensor.py`, the added test may fail with ``` TypeError: cannot create weak reference to 'NoneType' object ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94569 Approved by: https://github.com/ezyang	2023-02-10 20:38:22 +00:00
Jason Ansel	534db77e73	Autotune pointwise/reduction in max_autotune mode (#94556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94556 Approved by: https://github.com/ngimel	2023-02-10 19:41:39 +00:00
PyTorch MergeBot	111c86bfe5	Revert "[CI] Move M1 testing to periodic (#94608 )" This reverts commit 5c16788e5ff5ed1b3eba9c8fde5fc0910c495fa8. Reverted https://github.com/pytorch/pytorch/pull/94608 on behalf of https://github.com/malfet due to We have more runners now, let's see what will happen	2023-02-10 19:41:04 +00:00
Ramin Azarmehr	7c4acdad4a	[MPS] Fix the crash in huberloss with Float16 (#94567 ) - Also fix FP16 correctness issues in several other ops by lowering their FP16 precision in the new list `FP16_LOW_PRECISION_LIST`. - Add atol/rtol to the `AssertEqual()` of Gradient tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94567 Approved by: https://github.com/kulinseth	2023-02-10 19:20:29 +00:00
Wenlei Xie	d8f4026ebf	Continue support sharding pipes in `tud.datapipes.iter.grouping` as deprecated (#94527 ) Summary: https://github.com/pytorch/pytorch/pull/94095 moves this into `tud.datapipes.iter.sharding`. However, since previously this is a public API, this is a BC break change. As discussed in https://github.com/pytorch/data/pull/987#issuecomment-1422440049, we will have backward compatbile support but with deprecated warning. Differential Revision: D43161015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94527 Approved by: https://github.com/ejguan, https://github.com/NivekT	2023-02-10 18:42:10 +00:00
Nikita Shulga	5c16788e5f	[CI] Move M1 testing to periodic (#94608 ) To mitigate https://github.com/pytorch/pytorch/issues/94607 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94608 Approved by: https://github.com/albanD, https://github.com/ZainRizvi, https://github.com/weiwangmeta, https://github.com/huydhn	2023-02-10 18:23:05 +00:00
Fabio Rocha	e116ca93e1	Run test_torchinductor*.py with implicit_fallbacks=False (#94039 ) This way it errors out for ops that don't have decomps and requires you to add explicit fallbacks to lowering.py Turns out there are a lot, and this commit adds them as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94039 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/ngimel	2023-02-10 18:10:56 +00:00
Michael Voznesensky	e44586a78f	Pass input tensor __dict__ along to placeholder nodes (#94080 ) ``` import torch import torch.nn as nn import torch._dynamo.config import torch._inductor.config def pre_attention_state_ops(input, mems, state): lc_key = state[0] lc_val = state[1] bar = [] for i in range(0, 4): bar2 = [] for j in range(0, 3): bar2.append( lc_key + lc_val + torch.tensor([0.1, 0.25, 0.4, 0.5, 0.1]) ) bar.append(bar2) return bar mems = torch.tensor([[[1.8364, 0.2724, -1.4917, -0.4367, 0.8640]]]) state = [ torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]), torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]), ] i = torch.tensor( [ [0.0313, -0.1487, -0.3846, -0.5321], [-1.7073, 1.3331, -0.0890, -1.4935], [-0.8314, -0.1862, -0.5935, 1.5232], ] ) torch._dynamo.tag(mems, "MEMS") torch._dynamo.tag(i, "FOO") torch._dynamo.tag(state[0], "STATE_0") torch._dynamo.tag(state[1], "HMMM") exported = torch._dynamo.export(pre_attention_state_ops, i, mems, state) out_graph = exported[0] dynamo_result = out_graph(i, mems, state) nodes = list(out_graph.graph.nodes) placeholders = [node for node in nodes if node.op == "placeholder"] for placeholder in placeholders: if "tags" in placeholder.meta: print("PLACEHOLDER TAGS?", placeholder.meta["tags"]) ``` prints PLACEHOLDER TAGS? ['STATE_0'] PLACEHOLDER TAGS? ['HMMM'] Pull Request resolved: https://github.com/pytorch/pytorch/pull/94080 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-02-10 18:09:41 +00:00
Aaron Gokaslan	9171f7d4cd	[BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520 ) Applies some more pyupgrade fixits to PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520 Approved by: https://github.com/ezyang	2023-02-10 18:02:50 +00:00
Driss Guessous	70026aaad6	[SDPA] update type hint for scaled_dot_product_attention and documentation (#94008 ) # Summary - Adds type hinting support for SDPA - Updates the documentation adding warnings and notes on the context manager - Adds scaled_dot_product_attention to the non-linear activation function section of nn.functional docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/94008 Approved by: https://github.com/cpuhrsch	2023-02-10 18:02:43 +00:00
Yanming Wang	9bef1ebb9e	Fix div by fp64 scalar issue on xla device (#94459 ) This PR fixes https://github.com/pytorch/xla/issues/4574. I'll create a separate test PR in pytorch/xla repo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94459 Approved by: https://github.com/ezyang	2023-02-10 17:57:47 +00:00
joe	67513aee6d	Cleaning up some logic in tools/shared/cwrap_common.py (#94475 ) Noticed some code that needed some adjustment Pull Request resolved: https://github.com/pytorch/pytorch/pull/94475 Approved by: https://github.com/ezyang	2023-02-10 17:49:11 +00:00
zjjott	51cec7bf52	add compile reason in InstructionTranslator RETURN_VALUE (#94176 ) (#94367 ) add compile reason in InstructionTranslator RETURN_VALUE (#94176) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94367 Approved by: https://github.com/jansel	2023-02-10 17:43:45 +00:00
Denis Vieriu	92d8c4b37c	[MPS] Fix cumsum for integral data types (#94530 ) - Make intermediate type for cumsum ScalarType::Int: fixes https://github.com/pytorch/pytorch/issues/90635 - Add support for negative dimensions in cumsum: fixes https://github.com/pytorch/pytorch/issues/92329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94530 Approved by: https://github.com/kulinseth	2023-02-10 17:40:29 +00:00
Angela Yi	d990ddadd5	[fx] Fix matching args (#94375 ) To match nodes within the graph, the matcher currently flattens the arguments and compares each argument against each other. However, if it believes that a list input contains all literals, it will not flatten the list and will instead compare the list directly against each other. It determines if a list is a literal by checking if the first element is a node. However this doesn't work in some cases (like the test cases I added). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94375 Approved by: https://github.com/SherlockNoMad	2023-02-10 17:37:57 +00:00
ganler	db6cfff827	fix: forbid multi-index for index_select over scalar (#94347 ) Fixes #88940 According to the [doc](https://pytorch.org/docs/stable/generated/torch.index_select.html): 1. "The returned tensor has the same number of dimensions as the original tensor (`input`). " 2. "The `dim`th dimension has the same size as the length of `index`; other dimensions have the same size as in the original tensor." These two conditions cannot be satisfied at the same time if the `input` is a scalar && `index` has multiple values: because a scalar at most holds one element (according to property 1, the output is a scalar), it is impossible to satisfy "The `dim`th dimension has the same size as the length of `index`" when `index` has multiple values. However, currently, if we do so we either get: 1. Buffer overflow with ASAN; 2. Or (w/o ASAN) silently returns outputs that is not consistent with the doc (`x.index_select(0, torch.Tensor([0, 0, 0]).int())` returns `x`). As a result, we should explicitly reject such cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94347 Approved by: https://github.com/malfet	2023-02-10 17:17:09 +00:00
Maxwell Nuyens	0d0ebcdfe5	feature: adding the ability to restore shapes after loading a traced model (#90744 ) Adds the ability to store inputs used in tracing models when calling torch.jit.save and restore the input shapes using torch.jit.load if the appropriate variables are set. Fixes [89185](https://github.com/pytorch/pytorch/issues/89185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90744 Approved by: https://github.com/davidberard98	2023-02-10 17:12:52 +00:00
Mikayla Gawarecki	c7c7238976	Fix bug in unsqueeze_nested stride calculation (#88688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88688 Approved by: https://github.com/cpuhrsch	2023-02-10 17:00:04 +00:00
BowenBao	889a4640a0	[ONNX] Skip import test for experimental files (#94552 ) `torch.onnx._internal.fx` is experimental and is not imported when `import torch`/`import torch.onnx`. Need to skip it in this test as it depends on `onnx-script`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94552 Approved by: https://github.com/kit1980	2023-02-10 15:58:49 +00:00
mingfeima	c620ece726	port sparse_mm.reduce to pytorch and optimize it on CPU (#83727 ) ### Motivation of this PR This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of Gather, Apply Scatter in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300 GAS is the major step for Message Passing, the behavior of GAS can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes: * COO: the hotspot is `scatter_reduce` * CSR: the hotspot is `spmm_reduce` The reduce type can be choose from: "max", "mean", "max", "min". extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally. `sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs: * `out` - the actual output * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated. ### Performance Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch. Performance benefit for training will be bigger, the original backward impl for `sum\|mean` is sequential; the original backward impl for `max\|min` is not fused. #### before: ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9 aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9 aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9 aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9 aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6 aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6 aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9 aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3 ``` #### after ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9 aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9 aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9 aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9 aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6 aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6 aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9 aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4 aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu	2023-02-10 15:56:40 +00:00
Jason Ansel	24ae50bcc7	Add config option to reduce warnings in inductor (#94413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94413 Approved by: https://github.com/ezyang	2023-02-10 15:44:15 +00:00
Kulin Seth	1d3980656c	[MPS] Fix min/max_reduction_with_dim ops (#94386 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94386 Approved by: https://github.com/DenisVieriu97, https://github.com/razarmehr	2023-02-10 15:23:47 +00:00
Kulin Seth	0fe11589df	[MPS] Add im2col and col2im to Fallback (#94491 ) These are not in the hot path as they are mostly used in Preprocessing layers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94491 Approved by: https://github.com/razarmehr	2023-02-10 15:22:59 +00:00
Peter Bell	a21bddcc90	WelfordOps: Remove combine_t and use acc_scalar_t instead (#94522 ) `combine_t` is the type used to represent the number of elements seen so far as a floating point value (acc.nf). It is always used in calculations with other values of type `acc_scalar_t` so there is no performance gained by making this a separate template argument. Furthermore, when calculating the variance on CUDA it is always set to `float` which means values are unnecessarily truncated before being immediately promoted to `double`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94522 Approved by: https://github.com/ngimel	2023-02-10 15:19:46 +00:00
Peter Bell	e22e323bea	[decomp] Use var_mean in native_batch_norm decomposition (#94140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94140 Approved by: https://github.com/ngimel	2023-02-10 15:19:46 +00:00
Horace He	e844120b2f	Fix embedding_dense_backward to not cast indiices to floats (#94572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94572 Approved by: https://github.com/ngimel	2023-02-10 12:44:03 +00:00
Horace He	1770ccf6c8	Don't throw tf32 warning if no nodes in graph are matmuls + fp32 + cuda (#94561 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94561 Approved by: https://github.com/ngimel, https://github.com/eellison, https://github.com/malfet	2023-02-10 12:44:03 +00:00
PyTorch MergeBot	f152a79be9	Revert "update aten op overload to not use `from` to avoid compile errors (#89797 )" This reverts commit 021d2676941976d6a35a3b0e2034238889a6c872. Reverted https://github.com/pytorch/pytorch/pull/89797 on behalf of https://github.com/jeanschmidt due to breaking internal builds - more details on https://fburl.com/sandcastle/bz8mgkil	2023-02-10 11:32:25 +00:00
Natalia Gimelshein	a5daea69fb	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-10 11:21:57 +00:00
XiaobingSuper	02b8a7f473	inductor: don't do transpose vectoriztion if input ld depends on most inner var (#94493 ) Fixed https://github.com/pytorch/pytorch/issues/94269. For the following case: ``` *import torch import torchvision #import intel_extension_for_pytorch import torch._dynamo from torch._inductor import config class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): constant_pad_nd = x # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:195, code: kv = kv.unfold(2, self.win_size, self.block_size).unfold(3, self.win_size, self.block_size) as_strided: f32[1, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(constant_pad_nd, [1, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]); constant_pad_nd = None as_strided_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.as_strided.default(as_strided, [1, 384, 2, 2, 12, 12], [153600, 1, 61440, 3072, 7680, 384]); as_strided = None # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:197, code: kv = kv.reshape( clone_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.clone.default(as_strided_1, memory_format = torch.contiguous_format); as_strided_1 = None _unsafe_view_1: f32[8, 48, 4, 144] = torch.ops.aten._unsafe_view.default(clone_1, [8, 48, 4, 144]); clone_1 = None permute_2: f32[8, 4, 144, 48] = torch.ops.aten.permute.default(_unsafe_view_1, [0, 2, 3, 1]); _unsafe_view_1 = None # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:202, code: k, v = torch.split(kv, [self.dim_head_qk, self.dim_head_v], dim=-1) split_with_sizes = torch.ops.aten.split_with_sizes.default(permute_2, [16, 32], -1); permute_2 = None getitem: f32[8, 4, 144, 16] = split_with_sizes[0] getitem_1: f32[8, 4, 144, 32] = split_with_sizes[1]; split_with_sizes = None permute_3: f32[8, 4, 16, 144] = torch.ops.aten.permute.default(getitem, [0, 1, 3, 2]); getitem = None expand_1: f32[8, 4, 16, 144] = torch.ops.aten.expand.default(permute_3, [8, 4, 16, 144]); permute_3 = None clone_3: f32[8, 4, 16, 144] = torch.ops.aten.clone.default(expand_1, memory_format = torch.contiguous_format); expand_1 = None return clone_3 model = Model().eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(1, 384, 20, 20).to(memory_format=torch.channels_last) ref = model(x) with torch.no_grad(): for i in range(3): out = opt_model(x) print(torch.equal(ref, out)) ``` The generated code before this PR is: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/ni/cniims6nap7c5wars7cmtbjr3mw6b5cxyoyxmsu7ro2l5fkrwatl.h" extern "C" void kernel(const float __restrict__ in_ptr0, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<4; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<1; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<9; i3+=1) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + (16i2) + (48i0) + (384((16i3) % 12)) + (3072(i1 % 2)) + (7680(((4i3) / 3))) + (61440(i1 / 2)), ((-7680)(i3 / 12)) + ((-384)(i3 % 12)) + (384((1 + i3) % 12)) + (7680(((1 + i3) / 12))), tmp0, 16); for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + 16i2_inner); tmp1.store(out_ptr0 + (16i3) + (144i2_inner) + (2304i1) + (2304i2) + (9216i0)); } } #pragma GCC ivdep for(long i3=144; i3<144; i3+=1) { for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = in_ptr0[i2_inner + (16i2) + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2_inner) + (2304i1) + (2304i2) + (9216i0)] = tmp0; } } } #pragma GCC ivdep for(long i2=16; i2<16; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<144; i3+=1) { auto tmp0 = in_ptr0[i2 + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2) + (2304i1) + (9216i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg0_1 return (buf0, ) ``` After: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dm/cdmaihqxwe73zkb3he2zizktpq5uujetg2db26c3r4lgsmlx3b4c.h" extern "C" void kernel(const float __restrict__ in_ptr0, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<4; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<16; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<144; i3+=1) { auto tmp0 = in_ptr0[i2 + (48i0) + (384(i3 % 12)) + (3072(i1 % 2)) + (7680(i3 / 12)) + (61440(i1 / 2))]; out_ptr0[i3 + (144i2) + (2304i1) + (9216i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg0_1 return (buf0, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((1, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94493 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang	2023-02-10 09:04:45 +00:00
Horace He	3a12b16fb0	Renamed passes to options in torch.compile (#94500 ) @jansel expressed a preference for this (as most of our options are not passes), and I agree. I also think that `fullgraph` could be changed, but I don't know what I'd change it to. I considered `strict`, but some folks objected to that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94500 Approved by: https://github.com/voznesenskym, https://github.com/soumith, https://github.com/jansel	2023-02-10 08:19:41 +00:00
Kulin Seth	59e8756676	[MPS] Fix the Channels last bug with GradientWithInput. (#94384 ) * Fix the Channels last bug with GradientWithInput. The bug was mentioned in : https://github.com/pytorch/pytorch/issues/77764#issuecomment-1312241902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94384 Approved by: https://github.com/razarmehr	2023-02-10 07:36:06 +00:00
Kulin Seth	8dbe63c99e	[MPS] Casting int64 to int32 for reduction ops and raise warning. (#94484 ) Currently casting it as a workaround till we have full support in OS. Fixes #https://github.com/pytorch/pytorch/pull/88319#issuecomment-1424010624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94484 Approved by: https://github.com/razarmehr	2023-02-10 07:34:58 +00:00
Natalia Gimelshein	715f3733ef	don't call floor for symint unless necessary (#94365 ) Per @ezyang's advice, added magic sym_int method. This works for 1.0 * s0 optimization, but can't evaluate `a>0` for some args, and still misses some optimization that model rewrite achieves, so swin still fails (rewrite replaces `B = int(windows.shape[0] / (H * W / window_size / window_size))` with `B = (windows.shape[0] // int(H * W / window_size / window_size))` and model passes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94365 Approved by: https://github.com/ezyang	2023-02-10 07:17:11 +00:00
Nikita Shulga	89df0e4253	Enable Python-3.11 binary builds across the board (#94430 ) Most of the work is outside of repositories and consists of cloning projects https://github.com/AnacondaRecipes/ and building: - [typing_extensions](https://github.com/AnacondaRecipes/typing_extensions-feedstock) - [pyyaml](https://github.com/AnacondaRecipes/pyyaml-feedstock) - [setuptools](https://github.com/AnacondaRecipes/setuptools-feedstock) v 59.8.0, needed to build `numpy`. Trick here is to add `add_pip_as_python_dependency: off` to ones `.condarc` - [cython](https://github.com/AnacondaRecipes/cython-feedstock) - [mkl-service](https://github.com/AnacondaRecipes/mkl-service-feedstock) - [numpy-base](https://github.com/AnacondaRecipes/numpy-feedstock) (against mkl-2021.4), i.e. add `blas_impl: "mkl"` and `mkl: ">=2021.4.0,<2022.0a0"` to ones `conda_build_config.yaml` - [mkl_random](https://github.com/AnacondaRecipes/mkl_random-feedstock) - [mkl_fft](https://github.com/AnacondaRecipes/mkl_fft-feedstock) - [numpy](https://github.com/AnacondaRecipes/numpy-feedstock) - [mpmath](https://github.com/AnacondaRecipes/mpmath-feedstock) - [sympy](https://github.com/AnacondaRecipes/sympy-feedstock) Anaconda build system is really modern, so in order to be able to build: - x86 MacOS packages, one need to install Macos 10.10 SDK from 2014, still available at https://github.com/phracker/MacOSX-SDKs/releases and reference it as conda build sysroot, as follows: `CONDA_BUILD_SYSROOT: /Library/Developer/CommandLineTools/SDKs/MacOSX10.10.sdk` - Windows packages "MSVC v141 - VS 2017 C++ x64/86 build tools (v14.16)" is needed, which likely is still available as Visual Studio component As well as make a pretty trivial tweak to build rules in `cf4fa8900b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94430 Approved by: https://github.com/seemethere, https://github.com/weiwangmeta, https://github.com/albanD, https://github.com/atalman	2023-02-10 06:10:27 +00:00
Denis Vieriu	a1f15fb987	[MPS] Fix batchnorm forward and backward pass (#94351 ) Fixes batchnorm forward/backward pass and layer_norm: Batchnorm Forward pass: ``` - fix batch_norm_mps_out key - return 1/sqrt(var+epsilon) instead of var - return empty tensor for mean and var if train is not enabled - remove native_batch_norm from block list ``` Batchnorm Backward pass: ``` - add revert caculation for save_var used in backward path - add backward test for native_batch_norm and _native_batch_norm_legit ``` Layer norm: ``` - remove the duplicate calculation from layer_norm_mps - enable native_layer_norm backward test - raise atol rtol for native_layer_norm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94351 Approved by: https://github.com/razarmehr	2023-02-10 05:53:36 +00:00
Denis Vieriu	2ad29009bf	[MPS] Fix addmm calculation (#94534 ) Ignore input when beta is 0, so that `nan` and `inf` will not be propagated. Case already part of test_mps at https://github.com/pytorch/pytorch/blob/master/test/test_mps.py#L6308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94534 Approved by: https://github.com/kulinseth	2023-02-10 05:05:56 +00:00
PyTorch MergeBot	10c430ba0a	Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 )" This reverts commit 2a5851735ae4dc33ab4bc11c0b70d61102481f35. Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/desertfire due to TIMM models start to show flaky failures after this PR, need more investigation	2023-02-10 04:40:32 +00:00
Theodor Arsenij Larionov	a1d210de44	Add exception handlers for stoll in jit/frontend/schema_type_parser.cpp (#94295 ) Hi! I've been fuzzing different pytorch modules, and found a few crashes. Specifically, I'm talking about `schema_type_parser.cpp` and `irparser.cpp`. Inside these files, different standard conversion functions are used (such as `stoll`, `stoi`, `stod`, `stoull`). However, default `std` exceptions, such as `std::out_of_range`, `std::invalid_argument`, are not handled. Some of the crash-files: 1. [crash-493db74c3426e79b2bf0ffa75bb924503cb9acdc.zip](https://github.com/pytorch/pytorch/files/10237616/crash-493db74c3426e79b2bf0ffa75bb924503cb9acdc.zip) - crash source: schema_type_parser.cpp:272 2. [crash-67bb5d34ca48235687cc056e2cdeb2476b8f4aa5.zip](https://github.com/pytorch/pytorch/files/10237618/crash-67bb5d34ca48235687cc056e2cdeb2476b8f4aa5.zip) - crash source: schema_type_parser.cpp:240 3. [crash-0157bca5c41bffe112aa01f3b0f2099ca4bcc62f.zip](https://github.com/pytorch/pytorch/files/10307970/crash-0157bca5c41bffe112aa01f3b0f2099ca4bcc62f.zip) - crash source: schema_type_parser.cpp:179 4. [crash-430da923e56adb9569362efa7fa779921371b710.zip](https://github.com/pytorch/pytorch/files/10307972/crash-430da923e56adb9569362efa7fa779921371b710.zip) - crash source: schema_type_parser.cpp:196 The provided patch adds exception handlers for `std::invalid_argument` and `std::out_of_range`, to rethrow these exceptions with `ErrorReport`. ### How to reproduce 1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/blob/master/projects/pytorch/Dockerfile) 2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .` 3. Copy crash file to the current directory 5. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash`` 6. And execute the binary: `/irparser_fuzz /homedir/crash-67bb5d34ca48235687cc056e2cdeb2476b8f4aa5` After execution completes you will see this error message: ```txt terminate called after throwing an instance of 'std::out_of_range' what(): stoll ``` And this stacktrace: ```asan ==9626== ERROR: libFuzzer: deadly signal #0 0x5b4cf1 in __sanitizer_print_stack_trace /llvm-project/compiler-rt/lib/asan/asan_stack.cpp:87:3 #1 0x529627 in fuzzer::PrintStackTrace() /llvm-project/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:210:5 #2 0x50f833 in fuzzer::Fuzzer::CrashCallback() /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:233:3 #3 0x7ffff7c3741f (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f) #4 0x7ffff7a5700a in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300a) #5 0x7ffff7a36858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858) #6 0x7ffff7e74910 (/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e910) #7 0x7ffff7e8038b (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa38b) #8 0x7ffff7e803f6 in std::terminate() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa3f6) #9 0x7ffff7e806a8 in __cxa_throw (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa6a8) #10 0x7ffff7e7737d in std::__throw_out_of_range(char const) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa137d) #11 0xbd0579 in long long __gnu_cxx::__stoa<long long, long long, char, int>(long long ()(char const, char, int), char const, char const, unsigned long, int) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/ext/string_conversions.h:86:2 #12 0xc10f9c in std::__cxx11::stoll(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, int) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:6572:12 #13 0xc10f9c in torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2::operator()() const::'lambda'()::operator()() const /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:240:25 #14 0xc10f9c in void c10::function_ref<void ()>::callback_fn<torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2::operator()() const::'lambda'()>(long) /pytorch_fuzz/c10/util/FunctionRef.h:43:12 #15 0xbfbb27 in torch::jit::SchemaTypeParser::parseList(int, int, int, c10::function_ref<void ()>) /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:424:7 #16 0xc0ef24 in torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2::operator()() const /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:236:9 #17 0xc0ef24 in void c10::function_ref<void ()>::callback_fn<torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2>(long) /pytorch_fuzz/c10/util/FunctionRef.h:43:12 #18 0xbfbb27 in torch::jit::SchemaTypeParser::parseList(int, int, int, c10::function_ref<void ()>) /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:424:7 #19 0xbff590 in torch::jit::SchemaTypeParser::parseRefinedTensor() /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:209:3 #20 0xc02992 in torch::jit::SchemaTypeParser::parseType() /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:362:13 #21 0x9445642 in torch::jit::IRParser::parseVarWithType(bool) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:111:35 #22 0x944ff4c in torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >)::$_0::operator()() const /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:138:21 #23 0x944ff4c in void std::__invoke_impl<void, torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >)::$_0&>(std::__invoke_other, torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >)::$_0&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14 #24 0x94463a7 in torch::jit::IRParser::parseList(int, int, int, std::function<void ()> const&) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:498:7 #25 0x94460a5 in torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:137:3 #26 0x944c1ce in torch::jit::IRParser::parseOperator(torch::jit::Block) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:384:3 #27 0x944bf56 in torch::jit::IRParser::parseOperatorsList(torch::jit::Block) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:362:5 #28 0x9444f5f in torch::jit::IRParser::parse() /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:482:3 #29 0x94448df in torch::jit::parseIR(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Graph, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch::jit::Value, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, torch::jit::Value> > >&) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:94:5 #30 0x944526e in torch::jit::parseIR(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Graph) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:99:3 #31 0x5e3ebd in LLVMFuzzerTestOneInput /irparser_fuzz.cc:43:5 #32 0x510d61 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #33 0x4fac7c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #34 0x5009cb in fuzzer::FuzzerDriver(int, char*, int ()(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #35 0x529f62 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #36 0x7ffff7a38082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) #37 0x4f559d in _start (/irparser_fuzz+0x4f559d) ``` Following these steps with the remaining crashes will give you almost the same results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94295 Approved by: https://github.com/davidberard98	2023-02-10 04:37:23 +00:00
Will Constable	d21a7e7193	Assert TensorBox produced by lowering and add [Note: Inductor IR] (#94361 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94361 Approved by: https://github.com/jansel	2023-02-10 04:29:35 +00:00
Jiayi Sun	01de5ddafc	add mixed data type support for LayerNorm backward on CPU (#88064 ) ### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm. ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| bf16 forward (ms) \| mix forward (ms) \| fp32 backward (ms) \| bf16 backward (ms) \| mix backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.071 \| 0.065 \| 0.062 \| \| (8, 8, 16) \| 0.015 \| 0.014 \| 0.015 \| 0.074 \| 0.070 \| 0.063 \| \| (32, 8, 16) \| 0.062 \| 0.016 \| 0.016 \| 0.073 \| 0.073 \| 0.072 \| \| (64, 128, 56, 56) \| 2.467 \| 0.907 \| 0.0897 \| 12.993 \| 7.603 \| 7.777 \| \| (64, 128, 256, 256) \| 48.904 \| 25.589 \| 25.472 \| 343.992 \| 183.133 \| 188.222 \| Single core(icx): \| shape \| fp32 forward (ms) \| bf16 forward (ms) \| mix forward (ms) \| fp32 backward (ms) \| bf16 backward (ms) \| mix backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.050 \| 0.050 \| 0.050 \| \| (8, 8, 16) \| 0.014 \| 0.014 \| 0.014 \| 0.052 \| 0.054 \| 0.053 \| \| (32, 8, 16) \| 0.034 \| 0.019 \| 0.018 \| 0.059 \| 0.067 \| 0.066 \| \| (64, 128, 56, 56) \| 66.791\| 17.725 \| 19.799 \| 119.431 \| 106.123 \| 107.446 \| \| (64, 128, 256, 256) \| 1542.477 \| 402.132 \| 527.044 \| 3019.437 \| 2336.318 \| 2448.320 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/88064 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-10 03:10:14 +00:00
Sherlock Huang	54fa980186	Dynamo Export use fake tensor (#94276 ) This is a prerequisite for dynamo.export() to produce fine graph dynamic shape. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94276 Approved by: https://github.com/voznesenskym	2023-02-10 01:59:58 +00:00
Huy Do	2af89e96ec	Lower libtorch build parallelization to avoid OOM (#94548 ) Memory usage increases after https://github.com/pytorch/pytorch/pull/88575. Docker crashes with exit code 137, clearly means out of memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/94548 Approved by: https://github.com/seemethere	2023-02-10 01:52:09 +00:00
Nicolas Hug	544c04f2df	Add uint8 support for interpolate for CPU images (#90771 ) Joint work with @vfdev-5 This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ). On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`: ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (Note: we removed bicubic support for now) (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms ``` There is still room for further speed-ups (see TODOs in the code). #### More benchmark details with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why. <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 5X 1.1ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 12X 2.9ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 3X 0.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 7X 1.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 2.6X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 1.7X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 2.7X 0.7ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 1.8X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 4X 1.0ms vs 0.2ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 4X 2.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 3.0X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 3X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 4X 2.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 4X 2.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 7X 4.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 3X 2.1ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 4X 2.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 2.7X 1.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 2.6X 1.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 2.1X 1.2ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 2.8X 1.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 5X 2.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 2.3X 1.4ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 3X 1.9ms vs 0.6ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 4X 26.6ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 4X 23.9ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 2.5X 16.8ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 5X 33.1ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 4X 25.9ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 8X 59.6ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.9X 14.3ms vs 7.4ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 5X 35.4ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 2.0X 13.6ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 2.2X 14.8ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 1.3X 8.8ms vs 6.9ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.2X 8.4ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.8X 12.8ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 4X 32.1ms vs 7.2ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.4X 10.1ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.9X 20.9ms vs 7.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 2.1X 0.7ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 1.9X 0.6ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.6X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.8X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.2X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 1.2X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 1.5X 1.0ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.2X 0.8ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 2.3X 1.5ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.9X 1.2ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.6X 1.2ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 4X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 2.4X 1.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 2.8X 1.8ms vs 0.6ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 2.1X 12.8ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.6X 3.8ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 1.2X 7.1ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.9X 11.0ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 2.0X 12.6ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 6.1ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 1.8X 11.3ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 4.6ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.6X 9.3ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.3X 2.0ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.2X 7.2ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.3X 1.6ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.1X 7.1ms vs 6.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 0.6X 3.3ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 0.9X 5.9ms vs 6.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.4X 2.4ms vs 5.9ms ``` </details> without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs) <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 0.8X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 1.5X 1.7ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 0.9X 1.6ms vs 1.8ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 2.1X 3.9ms vs 1.9ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 0.8X 1.1ms vs 1.4ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 1.7X 2.4ms vs 1.5ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 0.9X 0.5ms vs 0.6ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 0.7X 0.5ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 0.9X 0.9ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 2.1X 2.0ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 0.8X 0.6ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 1.7X 1.3ms vs 0.8ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 1.0X 3.0ms vs 3.0ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 1.0X 2.8ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 1.0X 2.3ms vs 2.2ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 1.4X 3.3ms vs 2.3ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 1.0X 3.5ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 1.7X 6.1ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 0.9X 2.6ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 1.4X 4.2ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 1.0X 1.7ms vs 1.7ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 0.9X 1.6ms vs 1.8ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 0.9X 1.3ms vs 1.4ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 0.7X 1.1ms vs 1.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 1.0X 2.0ms vs 2.0ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 1.7X 3.2ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 0.8X 1.5ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 1.2X 2.3ms vs 1.9ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 1.1X 34.7ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 1.0X 31.2ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 1.0X 23.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 1.9X 42.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 0.9X 33.9ms vs 37.4ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 2.2X 84.0ms vs 37.5ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.0X 28.4ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 2.0X 56.7ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 1.1X 17.5ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 1.1X 17.7ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 0.8X 8.8ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.0X 11.1ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.1X 19.9ms vs 18.8ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 2.3X 42.5ms vs 18.7ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.0X 14.1ms vs 14.5ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.0X 28.4ms vs 14.5ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.0X 0.6ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.3ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 0.9X 0.5ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.7X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 1.0X 0.8ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.1X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 0.9X 0.7ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 0.9X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.8X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.3X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 0.9X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 1.2X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 0.8X 2.1ms vs 2.5ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 0.7X 1.6ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 1.2X 2.4ms vs 2.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 1.3X 2.6ms vs 2.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 1.1X 3.4ms vs 3.0ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 1.7X 4.8ms vs 2.8ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 1.1X 2.9ms vs 2.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 1.4X 3.5ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 0.9X 1.2ms vs 1.3ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.3X 1.6ms vs 1.2ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 0.8X 0.9ms vs 1.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.3X 1.3ms vs 1.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.4X 2.2ms vs 1.6ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 1.9X 2.8ms vs 1.5ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 0.8X 1.1ms vs 1.4ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 1.7X 2.1ms vs 1.3ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 1.0X 10.0ms vs 9.9ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.7X 4.6ms vs 6.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 0.9X 9.1ms vs 9.8ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.7X 9.4ms vs 5.7ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 1.0X 15.2ms vs 14.8ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 7.6ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 0.9X 13.3ms vs 14.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 5.9ms vs 7.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.2X 6.0ms vs 5.2ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.7X 2.3ms vs 3.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.0X 4.8ms vs 5.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.7X 1.9ms vs 2.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.6X 12.3ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 1.0X 3.9ms vs 3.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 1.0X 7.0ms vs 7.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.9X 3.0ms vs 3.5ms ``` </details> Benchmark code <details> ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu') if channels_last: input_image = input_image.contiguous(memory_format=torch.channels_last) self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "antialias": antialias, "dtype":dtype, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, antialias, dtype): if dtype == torch.float: input_image = input_image.float() out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias) if dtype == torch.float: out = out.round().clamp(min=0, max=256).to(torch.uint8) def make_config(): sizes = ( ((224, 224), (64, 64)), ((270, 268), (224, 224)), ((256, 256), (1024, 1024)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels # attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels # attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True, False], 'mode': ["bilinear", "bicubic"], 'antialias': [True, False], # 'dtype': [torch.float, torch.uint8] # 'dtype': [torch.uint8] 'dtype': [torch.float] }, tags=["short"], ) return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` ```py import re import argparse parser = argparse.ArgumentParser() parser.add_argument("f1", nargs="?", default="main") parser.add_argument("f2", nargs="?", default="new") args = parser.parse_args() with open(args.f1) as f: main = f.readlines() with open(args.f2) as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): # num_threads=1 # TODO: remove if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("antialias=", "") deets = deets.replace("channels_last=", "") # deets = deets.replace("channels_last=True, ", "") split = deets.split(",") # size = ','.join(split[:-3]) # mode, dtype, threads = split[-3:] # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" size = ','.join(split[:-5]) channels_last, mode, antialias, dtype, threads= split[-5:] deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("$.?$", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: # assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 8 == 0: print() # if i % 10 == 0 and i % 40 != 0: # print() # if i % 40 == 0: # print("-" 100) print(l) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771 Approved by: https://github.com/peterbell10, https://github.com/ngimel	2023-02-10 01:43:54 +00:00
Jerry Zhang	782e4f5c02	[quant] Add quantize and dequantize operators to decomposition table (#93312 ) Summary: This PR tries to decompose the operators in torch.ops.quantized_decomposed namespace to more primitive aten operators, this would free us from maintaining the semantics of the quantize/dequantize operators, which can be expressed more precises in terms of underlying aten operators Note: this PR just adds them to the decomposition table, we haven't enable this by default yet Test Plan: python test/test_quantization.py TestQuantizePT2E.test_q_dq_decomposition Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/93312 Approved by: https://github.com/vkuzo, https://github.com/SherlockNoMad	2023-02-10 01:40:12 +00:00
Mikayla Gawarecki	df13247e67	small bugfixes to release notes script (#94536 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94536 Approved by: https://github.com/drisspg	2023-02-10 01:23:07 +00:00
Bin Bao	93ee1bf168	[inductor] Fix a conv stride assertion (#94405 ) Summary: The issue appears when _inductor.config.tune_layout is set. If we pick a different aten convolution memory format, we need to transform its input layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94405 Approved by: https://github.com/jansel	2023-02-10 01:02:04 +00:00
Huy Do	f5ccbc1704	Ignore 7z locked usage log error on Windows non-ephemeral runners (#94483 ) This is the second times I spot this error on the new Windows non-ephemeral runners, so let's get it fixed. The error https://github.com/pytorch/pytorch/actions/runs/4130018165/jobs/7136942722 was during 7z-ing the usage log artifact on the runners: ``` WARNING: The process cannot access the file because it is being used by another process. usage_log.txt ``` The locking process is probably the monitoring script. This looks very similar to the issue on MacOS pet runners in which the monitoring script is not killed sometime. I could try to kill the process to unlock the file. But then not being able to upload the usage log here is arguably ok too. So I think it would be easier to just ignore the locked file and move on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94483 Approved by: https://github.com/clee2000	2023-02-10 00:58:36 +00:00
Denis Vieriu	016f0b2f62	[MPS] Calculate nonzero count inside nonzero op (#94442 ) Calculate nonzero count directly in the nonzero op. Additionally, synchronize before entering nonzero op to make sure all previous operations finished (output shape is allocated based on the count_nonzero count) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94442 Approved by: https://github.com/kulinseth	2023-02-10 00:53:52 +00:00
Taylor Robie	4c6a7faec5	[Profiler] Use RAII wrapper to manage refcounts during python tracer startup. (#91646 ) Refcounting is hard. (Citation needed.) https://github.com/pytorch/pytorch/pull/81242 introduced a corner case where we would over incref when breaking out due to max (128) depth. https://github.com/pytorch/pytorch/pull/85847 ostensibly fixed a segfault, but in actuality was over incref-ing because PyEval_GetFrame returns a borrowed reference while `PyFrame_GetBack` returns a strong reference. Instead of squinting really hard at the loops, it's much better to use the RAII wrapper and do the right thing by default. I noticed the over incref issue because of a memory leak where Tensors captured by the closure of a function would be kept alive by zombie frames. Differential Revision: [D42184394](https://our.internmc.facebook.com/intern/diff/D42184394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91646 Approved by: https://github.com/albanD	2023-02-10 00:28:18 +00:00
Denis Vieriu	336d9354d6	[MPS] Enable index add for TestConsistency (#94356 ) Enable index_add TestConsistency TestCase Pull Request resolved: https://github.com/pytorch/pytorch/pull/94356 Approved by: https://github.com/kulinseth	2023-02-10 00:21:11 +00:00
Kulin Seth	299ada9cff	[MPS] Add the floor_divide fixes. (#94488 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94488 Approved by: https://github.com/razarmehr	2023-02-10 00:10:08 +00:00
soulitzer	93d7d546ff	Fix saved tensor hooks to propogate errors back to python as-is (#94456 ) Mitigates the effect of https://github.com/pytorch/pytorch/issues/34172 for saved tensor hooks BC Breaking message: - Exceptions raised inside the pack and unpack hooks are no longer erroneously converted to RuntimeErrors. You should update your code to handle the original type of exception raised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94456 Approved by: https://github.com/albanD	2023-02-09 23:52:06 +00:00
Bin Bao	2a5851735a	Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 ) Summary: It looks like setting torch.backends.cudnn.deterministic to True is not enough for eliminating non-determinism when testing benchmarks with --accuracy, so let's turn off cudnn completely. With this change, mobilenet_v3_large does not show random failure on my local environment. Also take this chance to clean up CI skip lists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363 Approved by: https://github.com/ezyang	2023-02-09 23:43:13 +00:00
Huy Do	79ed6b246c	Mark ROCm trunk job as unstable (#94550 ) Failing to access AMD apt repo `09598b603f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94550 Approved by: https://github.com/clee2000	2023-02-09 23:20:00 +00:00
Jerry Zhang	2394e6baa9	[quant][fx] Change prepare_fx and convert_fx to preserve the GraphModule type of input (#94412 ) Summary: Previously prepare_fx returns an ObservedGraphModule and convert_fx returns a QuantizedGraphModule, this is to preserve the attributes since torch.fx.GraphModule did not preserve them, after https://github.com/pytorch/pytorch/pull/92062 we are preserving `model.meta`, so we can store the attributes in model.meta now to preserve them. With this, we don't need to create a new type of GraphModule in these functions and can use GraphModule directly, this is useful for quantization in pytorch 2.0 flow, if other transformations are using GraphModule as well, the quantization passes will be composable with them Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestQuantizeFxModels python test/test_quantization.py TestQuantizePT2E Imported from OSS Differential Revision: D42979722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94412 Approved by: https://github.com/vkuzo	2023-02-09 23:03:23 +00:00
Wanchao Liang	09598b603f	[dtensor] update readme for prototype release (#94517 ) This PR updates the README for prototype release, remove some code that are not available yet and use the ones that works. Also rename to DTensor in most sentences Pull Request resolved: https://github.com/pytorch/pytorch/pull/94517 Approved by: https://github.com/fegin	2023-02-09 22:35:26 +00:00
Jeff Daily	66bfcd32fd	[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725 ) Fixes #64427. MIOpen supports ChannelsLast. No longer need to opt-in with env var. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90725 Approved by: https://github.com/malfet	2023-02-09 22:26:24 +00:00
Vasiliy Kuznetsov	c1e2704656	ao migration: fix broken import, try 2 (#94458 ) Summary: https://github.com/pytorch/pytorch/pull/94170 broke some Meta-only tests because it broke the following syntax: ``` import torch.nn.intrinsic _ = torch.nn.intrinsic.quantized.dynamic.* ``` This broke with the name change because the `ao` folder is currently doing lazy import loading, but the original folders are not. For now, just unbreak the folders needed for the tests to pass. We will follow-up with ensuring this doesn't break for other folders in a future PR. Test plan: ``` python test/test_quantization.py -k AOMigrationNNIntrinsic.test_modules_no_import_nn_intrinsic_quantized_dynamic ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94458 Approved by: https://github.com/jerryzh168	2023-02-09 22:20:01 +00:00
Iris	bebe58bd71	[DCP] Set single_file_per_rank default to True (#94501 ) The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501 Approved by: https://github.com/fegin	2023-02-09 21:45:31 +00:00
c-odrin	54b7c7d5e9	Added requested_bytes to CUDA Caching Allocator Stats (#88575 ) Summary: The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce. We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag: - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead Test Plan: Added test case in caffe2/test/test_cuda.py Differential Revision: D40810674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575 Approved by: https://github.com/zdevito	2023-02-09 21:37:25 +00:00
pramenku	dddc0b41db	[ROCm] centos update endpoint repo and fix sudo (#92034 ) * Update ROCm centos Dockerfile * Update install_user.sh for centos sudo issue Fixes ROCm centos Dockerfile due to https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm file is not accessible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92034 Approved by: https://github.com/malfet	2023-02-09 21:30:58 +00:00
Joel Schlosser	dd315e5c06	Dynamo: Support ConstantVariable (comparison_op) SymNodeVariable (#94519 ) Expands the generic compare logic to handle SymNodeVariables on the right side of the expression. Also adds support for `>=`, which it appears was mistakenly left out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94519 Approved by: https://github.com/jansel	2023-02-09 21:17:17 +00:00
Xiaodong Wang	88e16849db	[pt2] Fix multiple races in log folder (#93407 ) Summary: There are a few races/permission errors in file creation, fixing OSS: 1. caffe2/torch/_dynamo/utils.py, get_debug_dir: multiple process may conflict on it even it's using us. Adding pid to it 2. caffe2/torch/_dynamo/config.py: may not be a right assumption that we have permission to cwd Test Plan: sandcastle Differential Revision: D42905908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93407 Approved by: https://github.com/soumith, https://github.com/mlazos	2023-02-09 21:10:14 +00:00
Xuehai Pan	444829fa21	[nn] Remove deprecated `torch.nn.utils._stateless` (#94498 ) Follows https://github.com/pytorch/pytorch/pull/92536#discussion_r1097578900. There have been 10 months since `torch.nn.utils._stateless` was marked as deprecated. This PR also changes `tie_weights` in `_reparametrize_module` to kw-only argument. Since it is private API and only imported by `torch.nn.utils._stateless` (removed). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94498 Approved by: https://github.com/jbschlosser	2023-02-09 20:53:40 +00:00
Howard Huang	f45c196653	Update backend config to be under _World (#94191 ) All the c10d process group state is under `_World`, so this is BE work to include a missing map Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191 Approved by: https://github.com/kumpera	2023-02-09 20:48:42 +00:00
Aaron Enye Shi	98d3612e48	[Profiler] Enable SOFT_ASSERT to log Invariant Violation to Kineto (#92872 ) Summary: Record the Soft assert to Kineto. Test Plan: Internal CI Tests. Differential Revision: D42219145 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/92872 Approved by: https://github.com/robieta	2023-02-09 20:36:25 +00:00
Iris	92620aface	[DCP]Update optimizer.py docstring (#94379 ) Update load_sharded_optimizer_state_dict() docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379 Approved by: https://github.com/fduwjj	2023-02-09 20:24:28 +00:00
Driss Guessous	760836f738	Add back in registration (#94452 ) Summary: Need to re-register the underscored function in order to have the op present in predictor. This is because older models have been exported with the underscored version. Test Plan: See if predictor tests pass? Reviewed By: cpuhrsch Differential Revision: D43138338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94452 Approved by: https://github.com/cpuhrsch	2023-02-09 20:18:19 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Thiago Crepaldi	a63524684d	[ONNX] Add col2im for opset 18 (#84594 ) Opset 18 will be used to introduce suport for ONNX's Col2Im-18 and resolve https://github.com/pytorch/pytorch/issues/84408 Depends: https://github.com/pytorch/pytorch/pull/83201 (CI will fail until ONNX submodule is updated) as per Faith recommendation, this PR should be merged post ORT 1.13 only Pull Request resolved: https://github.com/pytorch/pytorch/pull/84594 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/abock, https://github.com/BowenBao	2023-02-09 19:54:42 +00:00
Richard Barnes	ea98ba02e2	Prevent duplicate symbol for dsa_add_new_assertion_failure (#94064 ) `dsa_add_new_assertion_failure` is currently causing duplicate definition issues. Possible solutions: 1. Put the device code in a .cu file - requires device linking, which would be very painful to get setup. 2. inline the code - could cause bloat, especially since a function might include many DSAs. 3. Anonymous namespace - balances the above two. Putting the code in a .cu file would ensure that there's a single copy of the function, but it's hard to setup. Inlining the code would cause bloat. An anonymous namespace is easy to setup and produces a single copy of the function per translation unit, which allows the function to be called many times without bloat. Differential Revision: D42998295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94064 Approved by: https://github.com/ezyang	2023-02-09 19:47:36 +00:00
PyTorch MergeBot	6007874bbb	Revert "teach inductor to handle floor (#94341 )" This reverts commit e7df9aaec83648445f6cae3412b5b4038fbbe400. Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related. It fails on both PR and trunk `e7df9aaec8`	2023-02-09 19:31:08 +00:00
Kulin Seth	f35f12320a	[MPS] Fixes for arange_mps for empty tensor. (#94485 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94485 Approved by: https://github.com/razarmehr	2023-02-09 19:30:17 +00:00
Kulin Seth	105f7205bd	[MPS] Fix and unblock TestConsistency for median (#94489 ) - fix num_output_dims calculation - fix median_out_mps key - cast tensor sent to sortWithTensor and argSortWithTensor - note down same issue for unique - unblock median from blocklist - adding test_median_int16 test Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94489 Approved by: https://github.com/razarmehr	2023-02-09 19:29:07 +00:00
Xuehai Pan	69e0bda999	[BE] Import `Literal`, `Protocol`, and `Final` from standard library `typing` as of Python 3.8+ (#94490 ) Changes: 1. `typing_extensions -> typing-extentions` in dependency. Use dash rather than underline to fit the [PEP 503: Normalized Names](https://peps.python.org/pep-0503/#normalized-names) convention. ```python import re def normalize(name): return re.sub(r"[-_.]+", "-", name).lower() ``` 2. Import `Literal`, `Protocal`, and `Final` from standard library as of Python 3.8+ 3. Replace `Union[Literal[XXX], Literal[YYY]]` to `Literal[XXX, YYY]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94490 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-09 19:17:49 +00:00
Ning Xu	527b646f4b	Refactor to extract label_utils from export_pytorch_labels (#94179 ) Part of fixing #88098 ## Context This is 1/3 PRs to address issue 88098 (move label check failure logic from `check_labels.py` workflow to `trymerge.py` mergebot. Due to the messy cross-script imports and potential circular dependencies, it requires some refactoring to the scripts before, the functional PR can be cleanly implemented. ## What Changed 1. Extract extracts label utils fcns to a `label_utils.py` module from the `export_pytorch_labels.py` script. 2. Small improvements to naming, interface and test coverage ## Note to Reviewers This series of PRs is to replace the original PR https://github.com/pytorch/pytorch/pull/92682 to make the changes more modular and easier to review. * 1st PR: this one * 2nd PR: https://github.com/Goldspear/pytorch/pull/2 * 3rd PR: https://github.com/Goldspear/pytorch/pull/3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94179 Approved by: https://github.com/ZainRizvi	2023-02-09 19:17:05 +00:00
Ramin Azarmehr	4f691d2e2f	[MPS] Fix correctness issue with fill_scalar_mps() (#94479 ) - The self was not contiguous and inline filling produced wrong results - Added a test case for the issue Fixes the zero_like() issue reported in #94190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94479 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth	2023-02-09 19:07:13 +00:00
Jack Taylor	75545798c6	test_inductor test.sh fix (#92833 ) inductor/test_torchinductor suite is not running as part of the CI. I have triaged this down to a bug in the arguments supplied in test/run_test.py Currently test_inductor runs the test suites as: `PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include inductor/test_torchinductor --include inductor/test_torchinductor_opinfo --verbose` Which will only set off the test_torchinductor_opinfo suite Example from CI logs: https://github.com/pytorch/pytorch/actions/runs/3926246136/jobs/6711985831#step:10:45089 ``` + PYTORCH_TEST_WITH_INDUCTOR=0 + python test/run_test.py --include inductor/test_torchinductor --include inductor/test_torchinductor_opinfo --verbose Ignoring disabled issues: [] /var/lib/jenkins/workspace/test/run_test.py:1193: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6": Selected tests: inductor/test_torchinductor_opinfo Prioritized test from test file changes. reordering tests for PR: prioritized: [] the rest: ['inductor/test_torchinductor_opinfo'] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92833 Approved by: https://github.com/seemethere	2023-02-09 18:51:25 +00:00
min-jean-cho	81853354c3	added aten.log_normal_ decomp (#91674 ) Fixes #91275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91674 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2023-02-09 18:34:25 +00:00
Howard Huang	b2ea1d06aa	Collective dispatching from Process Group (#91257 ) Fixes https://github.com/pytorch/pytorch/issues/90932 Fixes https://github.com/pytorch/pytorch/issues/90659 Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup` Context: https://github.com/pytorch/pytorch/issues/86225 Differential Revision: [D42854676](https://our.internmc.facebook.com/intern/diff/D42854676) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91257 Approved by: https://github.com/kwen2501	2023-02-09 18:31:28 +00:00
Kulin Seth	31c30134bb	[MPS] Raise error for Conv3D as currently we don't have support. (#94492 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94492 Approved by: https://github.com/razarmehr	2023-02-09 18:28:11 +00:00
youkaichao	1dd6c8176c	Doc Fix: Update _symbolic_trace.py (#94510 ) Use `::` to activate the code block. Currently the code below is not rendered as code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94510 Approved by: https://github.com/H-Huang	2023-02-09 18:11:09 +00:00
PyTorch MergeBot	490c8f67c5	Revert "WIP: don't call floor for symint unless necessary (#94365 )" This reverts commit 8a9ea44985725e57cb82f0d978fafae31577ae6d. Reverted https://github.com/pytorch/pytorch/pull/94365 on behalf of https://github.com/ZainRizvi due to This looks like it caused some inductor test to start failing: `8a9ea44985`	2023-02-09 17:42:23 +00:00
Natalia Gimelshein	e7df9aaec8	teach inductor to handle floor (#94341 ) Per title, happen when there's upsampling with non-integer scale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341 Approved by: https://github.com/ezyang	2023-02-09 17:09:35 +00:00
double7	685108b201	[docs] Fix incorrect wrapping of function (#94446 ) The sample code of document incorrectly wraps the function decorator. To fix this, update the attributes of `func` based on `torch_function`. Fixes #94305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94446 Approved by: https://github.com/ezyang	2023-02-09 16:01:10 +00:00
Sahan Paliskara	47efbd5719	[pytorch] [hygiene] remove legacy buck rules (#94053 ) Summary: Removes legacy buck rules specifically we do the following conversions - ["xxx:=yyy"] -> ["xxx[yyy]"] - "//xxx/yyy" - "//xxx/yyy:yyy" Test Plan: CI should pass Differential Revision: D42999413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94053 Approved by: https://github.com/osalpekar, https://github.com/malfet	2023-02-09 15:45:29 +00:00
kshitij12345	4f3858c6d8	[functorch] linearize (#94173 ) Fixes https://github.com/pytorch/functorch/issues/724 TODO: * [x] Docs NOTE: `const_fold` pass raises UserWarning -> https://github.com/pytorch/pytorch/issues/94374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94173 Approved by: https://github.com/Chillee	2023-02-09 15:45:08 +00:00
jinsu kim	a5b052259b	Add MPS support for aten::remainder.Tensor_out (#92139 ) Fixes #86806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92139 Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97	2023-02-09 15:32:30 +00:00
Thiago Crepaldi	4e1bd4abe7	Fix scalar type resolution for optional tensor (#94427 ) When TorchScript Value has an optional tensor, `dtype()` or `scalarType()` is not available and raise (by design). The symbolic `_op_with_optional_float_cast` must check whether the tensor is otpional or not before calling the scalar type resolution API. This PR fixes that Pull Request resolved: https://github.com/pytorch/pytorch/pull/94427 Approved by: https://github.com/abock, https://github.com/shubhambhokare1	2023-02-09 15:22:02 +00:00
PyTorch MergeBot	76ed1a81d1	Revert "COO intersection kernel: respect value intersection order (#92242 )" This reverts commit b07c839b707761b677bf2d729a4d9b13dd2beabe. Reverted https://github.com/pytorch/pytorch/pull/92242 on behalf of https://github.com/jeanschmidt due to breaking vs17	2023-02-09 14:44:32 +00:00
chuanqiw	f165be5a49	tuned best BS with inductor on cpu for E2E models (#94181 ) Add 3 more batch size files for Torchbench/Huggingface/TIMMs suites which tuned on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. Fixes #94180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94181 Approved by: https://github.com/ezyang	2023-02-09 13:32:57 +00:00
Edward Z. Yang	a81cf49d97	Remove dead functions (#94415 ) CR from https://github.com/pytorch/pytorch/pull/94307 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94415 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2023-02-09 12:37:56 +00:00
Soof Golan	e4fe11eecb	[MPS] Fix torch.topk for empty tensors and k=0 on mps (#91884 ) Fixes #91878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91884 Approved by: https://github.com/kulinseth	2023-02-09 10:42:52 +00:00
Soof Golan	19264b50bb	[MPS] Add support for nansum on mps (#93845 ) * Add `nansum_out_mps` and `nansum_mps` functions * Moved `get_dtype_from_self` into ReduceOpsUtils.h Fixes #86809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93845 Approved by: https://github.com/malfet	2023-02-09 10:30:55 +00:00
Natalia Gimelshein	8a9ea44985	WIP: don't call floor for symint unless necessary (#94365 ) Per @ezyang's advice, added magic sym_int method. This works for 1.0 * s0 optimization, but can't evaluate `a>0` for some args, and still misses some optimization that model rewrite achieves, so swin still fails (rewrite replaces `B = int(windows.shape[0] / (H * W / window_size / window_size))` with `B = (windows.shape[0] // int(H * W / window_size / window_size))` and model passes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94365 Approved by: https://github.com/ezyang	2023-02-09 10:05:49 +00:00
Jing Xu	8b37eff69f	remove abi uncertainty and potential abi conflict (#94306 ) Currently there is a potential conflict for `GLIBCXX_USE_CXX11_ABI` configuration if users don't explicitly set this variable. In `caffe2/CMakeLists.txt`, if the variable is not set, an `abi checker` will be used to retrieve the ABI configuration from compiler. https://github.com/pytorch/pytorch/blob/master/caffe2/CMakeLists.txt#L1165-L1183 However, in 'torch/csrc/Module.cpp`, if the variable is not set, it will be set to `0`. The conflict happens when the default ABI of the compiler is `1`. https://github.com/pytorch/pytorch/blob/master/torch/csrc/Module.cpp#L1612 This PR eliminate this uncertainty and potential conflict. The ABI will be checked and set in `CMakeLists.txt`, and pass the value to `caffe2/CMakeLists.txt`. Meanwhile, in case the `caffe2/CMakeLists.txt` is directly invoked from a `cmake` command, The original GLIBC check logic is kept in this file. If users doesn't explicitly assign a value to `GLIBCXX_USE_CXX11_ABI`, the `abi checker` will be executed and set the value accordingly. If the `abi checker` failed to compile or execute, the value will be set to `0`. If users explicitly assigned a value, then the provided value will be used. Moreover, if `GLIBCXX_USE_CXX11_ABI` is set to `0`, the '-DGLIBCXX_USE_CXX11_ABI=0' flag won't be appended to `CMAKE_CXX_FLAGS`. Thus, whether to use ABI=0 or ABI=1 fully depends on compiler's default configuration. It could cause an issue that even users explicitly set `GLIBCXX_USE_CXX11_ABI` to `0`, the compiler still builds the binaries with ABI=1. https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L44-L51 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94306 Approved by: https://github.com/malfet	2023-02-09 09:54:04 +00:00
Kulin Seth	02ca2253cc	[MPS] Fixes for Binary ops with casting issues from FP to uint8 (#94382 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94382 Approved by: https://github.com/razarmehr	2023-02-09 09:44:02 +00:00
PyTorch MergeBot	e0e4f1a890	Revert "[functorch] linearize (#94173 )" This reverts commit b6b9e1e6e043ae4b9f41fbbee4f2a9e9a7e7d3d7. Reverted https://github.com/pytorch/pytorch/pull/94173 on behalf of https://github.com/kshitij12345 due to Broke lint runner	2023-02-09 09:22:39 +00:00
Kshiteej K	b6b9e1e6e0	[functorch] linearize (#94173 ) Fixes https://github.com/pytorch/functorch/issues/724 TODO: * [x] Docs NOTE: `const_fold` pass raises UserWarning -> https://github.com/pytorch/pytorch/issues/94374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94173 Approved by: https://github.com/Chillee	2023-02-09 08:57:05 +00:00
ecao	81e318353f	Align input memory format and grad memory format for GroupNorm backward (#92668 ) Fixes the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671. Align the input memory format and the grad memory format for GroupNorm backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92668 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-09 08:56:43 +00:00
Driss Guessous	81bbee7d7e	[SDPA] Adds basic correctness checks (#94274 ) # Summary Add more checks around shape constraints as well as update the sdp_utils to properly catch different head_dims between qk and v for flash_attention which is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94274 Approved by: https://github.com/cpuhrsch	2023-02-09 08:05:26 +00:00
min-jean-cho	92f569fe11	[Inductor] added aten.geometric_ decomp (#91672 ) Fixes #91671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91672 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2023-02-09 07:29:14 +00:00
Edward Z. Yang	c028fc4e25	Decouple PT2 dynamic shapes from the functorch setting (#94469 ) The functorch setting still exists, but now it is no longer necessary: we infer use of Python dispatcher by checking if the ambient FakeTensorMode has a ShapeEnv or not. The setting still exists, but it is for controlling direct AOTAutograd use now; for PT2, it's sufficient to use torch._dynamo.config.dynamic_shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94469 Approved by: https://github.com/Chillee, https://github.com/voznesenskym, https://github.com/jansel	2023-02-09 06:41:41 +00:00
CaoE	c82bb28759	Update autocast policy list on CPU (#92527 ) Update autocast policy list on CPU. It depends on #92530. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92527 Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet	2023-02-09 06:40:56 +00:00
Chien-Chin Huang	2180a0dc0c	[FSDP][optim_state_dict] Remove the dead code (#94448 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94448 Approved by: https://github.com/awgu	2023-02-09 06:32:40 +00:00
Chien-Chin Huang	af5b09182a	[PT-D] Update torch.distributed code owners (#94362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94362 Approved by: https://github.com/fduwjj	2023-02-09 05:33:01 +00:00
blorange-amd	11f51e798f	Upgrade nightly wheels to ROCm5.4.2 (#93090 ) Test PR1225: https://github.com/pytorch/builder/pull/1225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93090 Approved by: https://github.com/atalman	2023-02-09 04:53:11 +00:00
Ramin Azarmehr	cb715c26e2	[MPS] Replace the explicit commit in View ops with adaptive commit (#94218 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94218 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth	2023-02-09 04:10:59 +00:00
AllenTiTaiWang	6d722dba0f	[ONNX] Update CI onnx and ORT version (#94439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94439 Approved by: https://github.com/BowenBao	2023-02-09 04:08:38 +00:00
PyTorch MergeBot	03b9569d2c	[vision hash update] update the pinned vision hash (#94455 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94455 Approved by: https://github.com/pytorchbot	2023-02-09 04:03:11 +00:00
Bert Maher	bc26890bbe	[inductor] Fix args in sink_cat_after_pointwise (#94416 ) Summary: Silly me, I did not realize that dim could be a regular arg as well as a kwarg in this pass. Test Plan: New unit test. Differential Revision: D43098594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94416 Approved by: https://github.com/jansel	2023-02-09 03:40:08 +00:00
PyTorch MergeBot	fe00722539	Revert "feat(fx): `make_fx` should be aware of functions wrapped with `@fx.wrap` (#93273 )" This reverts commit 6a4bf3b71bf28ee6d1feb9608d59c27e3636232c. Reverted https://github.com/pytorch/pytorch/pull/93273 on behalf of https://github.com/ezyang due to nervous about this before branch cut. lets take our time post branch cut	2023-02-09 03:33:09 +00:00
fduwjj	41e3189222	[PT-D][Tensor parallelism] Add documentations for TP (#94421 ) This is far from completed and we will definitely polish it down the road. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94421 Approved by: https://github.com/wz337	2023-02-09 02:31:06 +00:00
Denis Vieriu	5b8e485a34	[MPS] Add 2d grid sampler (#94273 ) Add support for MPS grid sampler Pull Request resolved: https://github.com/pytorch/pytorch/pull/94273 Approved by: https://github.com/razarmehr	2023-02-09 02:25:46 +00:00
Ramin Azarmehr	6c80d0a5a5	[MPS] Fix correctness issues with Pool2D ops (#94348 ) - Fix wrong results in AvgPool2D when `count_include_pad=True` - Fix issues with adaptive average and max pool2d - Remove the redundant blocking copies from `AdaptiveMaxPool2d` - Add `divisor` to cached string key to avoid conflicts - Add test case when both `ceil_mode` and `count_include_pad` are True (previously failed). - Clean up redundant code Pull Request resolved: https://github.com/pytorch/pytorch/pull/94348 Approved by: https://github.com/kulinseth	2023-02-09 02:06:40 +00:00
PyTorch MergeBot	ca63040d2b	Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 )" This reverts commit 7bfc59993d25c444eccb6cd77e85e4dd0a348b7e. Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/huydhn due to This change fails in trunk `7bfc59993d` running out of memory. Mark this as weird because it was green in PR	2023-02-09 01:24:35 +00:00
Jacob Szwejbka	bb48d90b00	[Executorch][Quant][BE] Refactor Choose_Qparams (#94338 ) Summary: Refactor so that it can be decomposed Test Plan: ci Differential Revision: D42681268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94338 Approved by: https://github.com/jerryzh168	2023-02-09 01:20:17 +00:00
Aaron Gokaslan	1e2d82b8e4	[BE] Merge isinstance calls together (#94419 ) Simplify and speeds up isinstance calls by checking for multiple types at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419 Approved by: https://github.com/ezyang	2023-02-09 00:47:26 +00:00
albanD	f9cc12eebd	Remove duplicate CI jobs between pull and trunk (#94426 ) These configs are already in the pull settings and so run on trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94426 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-02-09 00:19:20 +00:00
Yeounoh Chung	5ea6f59875	Update xla image tag (#94377 ) Follow up, https://github.com/pytorch/xla/pull/4584 to support CUDA 11.7 and sccahe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94377 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-02-09 00:17:37 +00:00
min-jean-cho	66ae3aa096	[Inductor] added aten.cauchy_ decomp (#92047 ) Fixes #91675 TODO: compare perf of decomposed tan --vs-- libdevice tan, aten tan for triton, cpp backeneds Pull Request resolved: https://github.com/pytorch/pytorch/pull/92047 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel	2023-02-09 00:02:56 +00:00
Joel Schlosser	0ce95c3a17	Dynamo: Support min / max over iterables (#94350 ) Expands support for built-in `min` and `max` calls beyond binary to iterables - simply reduce over the existing binary logic. Adds support for: * lists * tuples * list iterators * vararg min / max - `min(2, 3, 4)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94350 Approved by: https://github.com/voznesenskym, https://github.com/ezyang	2023-02-09 00:02:40 +00:00
Edward Z. Yang	53a5c8c7cb	Avoid guarding on zero-ness with meta tensors. (#94399 ) This removes one of the == 0 tests that occur when you construct a tensor with SymInts. Unfortunately there are more, so I can't test this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94399 Approved by: https://github.com/albanD	2023-02-09 00:00:44 +00:00
Edward Z. Yang	dc70b00d0b	Track and record hint on SymNode and use when possible (#94201 ) Historically, we work out `size_hint` by working it out on the fly by doing a substitution on the sympy expression with the `var_to_val` mapping. With this change, we also maintain the hint directly on SymNode (in `expr._hint`) and use it in lieu of Sympy substitution when it is available (mostly guards on SymInt, etc; in particular, in idiomatic Inductor code, we typically manipulate Sympy expressions directly and so do not have a way to conveniently maintain hints.) While it's possible this will give us modest performance improvements, this is not the point of this PR; the goal is to make it easier to carefully handle unbacked SymInts, where hints are expected not to be available. You can now easily test if a SymInt is backed or not by checking `symint.node.hint is None`. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94201 Approved by: https://github.com/voznesenskym	2023-02-09 00:00:44 +00:00
Joel Schlosser	b5ef37b9a4	Dynamo: Fix graph break when iterating over tensor (#94326 ) Supports the following with dynamic shapes: ```python for element in tensor: # do stuff with element ``` Approach follows what's done when `call_range()` is invoked with dynamic shape inputs: guard on tensor size and continue tracing with a real size value from `dyn_dim0_size.evaluate_expr()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94326 Approved by: https://github.com/ezyang	2023-02-08 23:57:06 +00:00
Bin Bao	7bfc59993d	Set torch.backends.cudnn.enabled to false when testing accuracy (#94363 ) Summary: It looks like setting torch.backends.cudnn.deterministic to True is not enough for eliminating non-determinism when testing benchmarks with --accuracy, so let's turn off cudnn completely. With this change, mobilenet_v3_large does not show random failure on my local environment. Also take this chance to clean up CI skip lists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363 Approved by: https://github.com/ezyang	2023-02-08 23:30:10 +00:00
AllenTiTaiWang	04b06c9627	[ONNX] Use optional op to keep None in results for ONNX internal tests (#84789 ) All this time, PyTorch and ONNX has different strategy for None in output. And in internal test, we flatten the torch outputs to see if the rest of them matched. However, this doesn't work anymore in scripting after Optional node is introduced, since some of None would be kept. #83184 forces script module to keep all Nones from Pytorch, but in ONNX, the model only keeps the ones generated with Optional node, and deletes those meaningless None. This PR uses Optional node to keep those meaningless None in output as well, so when it comes to script module result comparison, Pytorch and ONNX should have the same amount of Nones. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84789 Approved by: https://github.com/BowenBao	2023-02-08 23:04:47 +00:00
AllenTiTaiWang	b27ac6dc56	[ONNX] Add full checker mode in torch.onnx.export (#83186 ) Fix #82589 Why: 1. full_check works in `onnx::checker::check_model` function as it turns on strict_mode in `onnx::shape_inference::InferShapes()` which I think that was the intention of this part of code. 2. strict_mode catches failed shape type inference (invalid ONNX model from onnx perspective) and ONNXRUNTIME can't run these invalid models, as ONNXRUNTIME actually rely on ONNX shape type inference to optimize ONNX graph. Why we don't set it True for default? >>> some of existing users use other platform, such as caffe2 to run ONNX model which doesn't need valid ONNX model to run. 3. This PR doesn't change the original behavior of `check_onnx_proto`, but add a warning message for those models which can't pass strict shape type inference, saying the models would fail on onnxruntime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83186 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi, https://github.com/jcwchen, https://github.com/BowenBao	2023-02-08 22:47:25 +00:00
William Wen	4e984cb614	[dynamo 3.11] changes to python code object (#93985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93985 Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/voznesenskym	2023-02-08 22:44:23 +00:00
Elias Ellison	021d267694	update aten op overload to not use `from` to avoid compile errors (#89797 ) Fix for https://github.com/pytorch/pytorch/issues/93591 by changing `random_.from` to `random_.from_int`. The previous signature would fail when printed in an fx graph, because `from` is a reserved python keyword. This change affects serialization but I have added an adapter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89797 Approved by: https://github.com/tugsbayasgalan	2023-02-08 22:04:59 +00:00
Will Constable	f2156ef42b	Make triton debug util reusable (#94225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94225 Approved by: https://github.com/Chillee	2023-02-08 22:03:35 +00:00
Denis Vieriu	22e1698cf7	[MPS] Add triangular solve op through MPSMatrixSolveTriangular (#94345 ) Add triangular solve op support through MPS `MPSMatrixSolveTriangular` kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/94345 Approved by: https://github.com/razarmehr	2023-02-08 21:48:12 +00:00
Nikita Shulga	82401c6a69	[BE] Set PYTORCH_TEST_WITH_INDUCTOR only once (#94411 ) Setting the same env-var twice should have no effect, unless one is trying mini rowhammer here Pull Request resolved: https://github.com/pytorch/pytorch/pull/94411 Approved by: https://github.com/jeanschmidt, https://github.com/huydhn, https://github.com/Skylion007	2023-02-08 21:00:40 +00:00
Yuyao Wang	0bf78b57c0	fix: max_unpool3d buffer overflow (#94372 ) Fixes #88032 Previously `output_size` is accessed before the shape length check, which leads to a buffer overflow issue. The fix is simply to prioritize the check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94372 Approved by: https://github.com/albanD	2023-02-08 19:48:25 +00:00
PyTorch MergeBot	3a5a762443	Revert "[quant] Add quantize and dequantize operators to decomposition table (#93312 )" This reverts commit 3fd46a2f9c56c692b242727cb146cfd464210c6a. Reverted https://github.com/pytorch/pytorch/pull/93312 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks trunk due to a landrace `3fd46a2f9c`. Please rebase and re-land it	2023-02-08 18:29:10 +00:00
Nikita Shulga	6ac0198c02	[CI] Add known ciflow labels to probot (#94368 ) Add `collect_ciflow_labels.py` that automatically extracts all labels from workflow files and adds the to pytorch-probot.yml Same script can also be used to validate that all tags are referenced in the config Add this validation to quickchecks Pull Request resolved: https://github.com/pytorch/pytorch/pull/94368 Approved by: https://github.com/jeanschmidt	2023-02-08 17:37:27 +00:00
Zain Rizvi	c0fe5fb987	[BE] Doc Update: Python 3.7 is past End of Life (#94314 ) Python 3.7 is no longer supported Pull Request resolved: https://github.com/pytorch/pytorch/pull/94314 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-02-08 17:34:45 +00:00
Xuehai Pan	b8de1cf007	[functorch][nn] Refactor NN stateless APIs by swapping module tensors (#92536 ) - Fixes #92295 - Resolves #86708 - Resolves #92153 - Closes #92401 - Closes #92218 - Requires #91579 Refactor NN stateless APIs by swapping module tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92536 Approved by: https://github.com/jbschlosser	2023-02-08 17:31:38 +00:00
Jerry Zhang	3fd46a2f9c	[quant] Add quantize and dequantize operators to decomposition table (#93312 ) Summary: This PR tries to decompose the operators in torch.ops.quantized_decomposed namespace to more primitive aten operators, this would free us from maintaining the semantics of the quantize/dequantize operators, which can be expressed more precises in terms of underlying aten operators Note: this PR just adds them to the decomposition table, we haven't enable this by default yet Test Plan: python test/test_quantization.py TestQuantizePT2E.test_q_dq_decomposition Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/93312 Approved by: https://github.com/vkuzo, https://github.com/SherlockNoMad	2023-02-08 17:26:01 +00:00
cyy	a405c6993f	[submodule] update libfmt to tag 9.1.0 (#93219 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93219 Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/albanD	2023-02-08 17:21:39 +00:00
David Berard	8ba87fa525	[dynamo] fix general attr on tensor for user-provided attributes (#94332 ) Problem: For a tensor `x`, you can assign `x.my_attr = 3.14` and then later access it. Dynamo does not support this right now; it errors out with an AttributError (it was broken in #91840). Fix: This fixes the problem by catching AttributeErrors in dynamo if we try to access an attr that does not exist on a standard torch.Tensor. Tests: Added tests for accessing and setting attributes to make sure dynamo does not error out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94332 Approved by: https://github.com/yanboliang	2023-02-08 17:11:18 +00:00
PyTorch MergeBot	f65a206433	Revert "sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048 )" This reverts commit 513b5da3573ffb542ac056dbc6142780a6fb43a5. Reverted https://github.com/pytorch/pytorch/pull/94048 on behalf of https://github.com/jeanschmidt due to issues with older versions of vs code	2023-02-08 16:51:07 +00:00
Ramin Azarmehr	e44cd942e3	[MPS] Fix the crash with hardswish_backward() (#94342 ) Also fix indentation and formatting Pull Request resolved: https://github.com/pytorch/pytorch/pull/94342 Approved by: https://github.com/kulinseth	2023-02-08 16:42:19 +00:00
Jason Ansel	eb1aca162e	Re-enable cudagraphs for benchmark scripts (#94192 ) Related to https://github.com/pytorch/pytorch/pull/93253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94192 Approved by: https://github.com/albanD, https://github.com/desertfire	2023-02-08 16:38:32 +00:00
lezcano	fe0e28ab87	[decompositions] GRU decompositon with and without packed sequence (#91466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91466 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
lezcano	5a7c1b7894	[decompositions] LSTM with packed input (#91465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91465 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
lezcano	bef61225c3	[decompositions] add decomposition for RNN with packed sequence (#91281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91281 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
lezcano	e5f6e1f660	[decompositions] add LSTM decomp (#91124 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91124 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
lezcano	20d01d2dc9	[expanded weights] add RNN support via decomp (#91807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91807 Approved by: https://github.com/albanD	2023-02-08 14:16:30 +00:00
lezcano	c2a92687e0	[decompositions] add RNN decomp and testing (#91123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91123 Approved by: https://github.com/zou3519	2023-02-08 14:16:30 +00:00
Nikita Shulga	768e547543	Fix SIGFPE in slow_conv3d_forward_out_cpu (#94325 ) Set number of groups to 0 if weights second dimension is zero. `slow_conv_shape_check` will raise an exception if groups are zero anyway. Fixes SIGFPE reported in https://github.com/pytorch/pytorch/issues/94125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94325 Approved by: https://github.com/albanD	2023-02-08 14:15:39 +00:00
Thiago Crepaldi	73bf32cb57	Bump to stable ONNX 1.13.0 (#90332 ) ONNX had mismatch checker usage between cpp and python and it's later fixed by https://github.com/onnx/onnx/pull/4386. And since `torch.onnx.export` is using cpp checker for graph-level check with older version of ONNX,this improvement should be added. Also, this version bump enables #83186 Updated 12/5/2022: This PR includes ONNX 1.13.0 release (https://github.com/onnx/onnx/tree/rel-1.13.0) For [CVE-2022-25882](https://nvd.nist.gov/vuln/detail/CVE-2022-25882) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90332 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-02-08 11:49:06 +00:00
Philip Meier	6f543e0d0a	add not_close_error_metas for internal comparison machinery (#90004 ) While discussing a possible addition of `assert_not_close` to the API (See #90005 later in the stack), it became clear that we should have an intermediate function that returns a bool-ish value that one can assert on. This PR introduces this function as `are_equal` as replacement for `assert_equal`. Interface is the same, but instead of raising in case a comparison failed, we return the `ErrorMeta`'s of all failures and leave it to the caller to handle. Note that this only applies to errors raised during the comparison stage. Everything else, e.g. only setting `atol` or `rtol`, will raise just as before. We decided to keep this private for now unless there is user demand. The largest issue that needs to be solved before this can become public is the return type: if we have something like `torch.testing.are_close` we are targeting two uses cases: 1. Using it to branch inside code like `if are_close(...):` 2. Using it to assert closeness inside a test like `assert are_close(...)`. This is the default way to assert something with `pytest` To do that, the return type has to be bool-ish, i.e. being an instance of `bool` or implementing `__bool__`. Plus, `bool(are_close()) is True` needs to be the if the inputs are close and `False` otherwise. The current logic of `are_close` satisfies the former, but violates the latter. In case everything is close, we return an empty list, but `bool([]) is False`. Directly using an instance of `bool` would work for the requirements above, but then we would have no option to add diagnositics to the error. Meaning `assert are_close()` would work, but would be non-descriptive. Using `Tuple[bool, str]` would work in general, but is quite dangerous and unexpected: since all non-empty tuples evaluate to `True`, this can easily hide bugs if the user is not super careful: ```pycon >>> close = (False, "error message with diagnostics") >>> assert close[0] AssertionError: error message with diagnostics >>> assert close ``` One possible solution here would be a thin custom object: ```py class Close: def __init__(self, flag:bool, msg: str = "") -> None: self._flag = flag self._msg = msg def __bool__(self): return self._flag def __str__(self): return self._msg ``` Now we can do something like ```pycon close = Close(False, "error message with diagnostics") # coming from are_close >>> if not close: ... print("It works!") It works! >>> assert close AssertionError >>> assert close, close # This looks weird, but does its job AssertionError: error message with diagnostics ``` But this means we introduce another abstraction that the user has to deal with. To reiterate, we are not going to make `are_close` public until there is user demand, since none of the options above is without flaws. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90004 Approved by: https://github.com/mruberry, https://github.com/malfet	2023-02-08 11:22:55 +00:00
Philip Meier	566eb49ed2	minor internal cleanup in assert_close (#90003 ) Per title. I'm going to highlight them with inline comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90003 Approved by: https://github.com/mruberry, https://github.com/malfet	2023-02-08 11:22:55 +00:00
Michael Voznesensky	bbe33532ae	Rename DynamicShapeVariable to SymNodeVariable cause thats what it is (#94152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94152 Approved by: https://github.com/ezyang	2023-02-08 10:41:10 +00:00
Jerry Zhang	cd057390b5	[quant][fx][pt2e] cleanup the args for some helper functions (#94352 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/94352 Approved by: https://github.com/vkuzo	2023-02-08 08:39:21 +00:00
Wang, Eikan	1767026d1e	Abstract the optimization context information as a dedicated class to better organize the code (#92057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92057 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-08 08:25:22 +00:00
Yanli Zhao	e0c24ec2a5	Print fqn in the warning message (#94313 ) Print fqn in the warning message, also make "else" match with the "if" in _apply_to_modules() Pull Request resolved: https://github.com/pytorch/pytorch/pull/94313 Approved by: https://github.com/fegin	2023-02-08 06:45:53 +00:00
Iris	e16daa78a0	[PT-D][Checkpoint] Turn on all default planner flags (#92933 ) Fixes #92823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92933 Approved by: https://github.com/kumpera	2023-02-08 06:30:45 +00:00
Nikita Shulga	230c4fe93d	[GHF] Fix pushDate handling (#94364 ) Merge commits does not have a merge date, which is also clear from [GraphQL schema](https://docs.github.com/en/graphql/reference/objects#commit). Modify return signature of `GitHubPR.last_pushed_at`, print warning when one can not be queried and add regression test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94364 Approved by: https://github.com/huydhn	2023-02-08 05:52:03 +00:00
Jiayi Sun	5fe72b8716	[Dynamo] modify dynamo ipex backend (#94169 ) 1. Extend fake_tensor_unsupported to support dynamic shapes mode. 2. Use fake_tensor_unsupported in dynamo ipex backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94169 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-08 05:10:42 +00:00
Ramin Azarmehr	877482ebc4	[MPS] Fix crashes in several backward ops (#94343 ) This should fix the hard crashes in several backward-pass ops for sigmoid, tanh, masked_fill, linear, prelu, etc. The tests cases that this patch fixes are part of a bigger change in TestConsistency and will be upstreamed as a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94343 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-02-08 04:47:28 +00:00
PyTorch MergeBot	61ecaf1dd4	[vision hash update] update the pinned vision hash (#94358 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94358 Approved by: https://github.com/pytorchbot, https://github.com/malfet	2023-02-08 04:03:30 +00:00
Huy Do	5f25c0831c	Cleanup hung Windows processes (#94357 ) Follow https://stackoverflow.com/questions/40585754/powershell-wont-terminate-hung-process to see if the hung python process can be killed completely ``` C:\Jenkins\Miniconda3\python.exe -bb test_ops.py -v --use-pytest -vv -rfEX -x --reruns=2 --shard-id=0 --num-shards=2 "-k=not linalg_cholesky" --import-slow-tests --import-disabled-tests ``` The command `Get-Process -Name $process -ErrorAction Stop \| Stop-Process -Force` doesn't stop this process as expect ### Testing 1. Spinning up a local python process on Windows runner `C:\Jenkins\Miniconda3\python.exe debug.py` 2. See that the process is runnning ``` Get-WmiObject -Class Win32_Process -Filter "Name LIKE 'python%' AND CommandLine LIKE '%debug%'" __GENUS : 2 __CLASS : Win32_Process __SUPERCLASS : CIM_Process __DYNASTY : CIM_ManagedSystemElement __RELPATH : Win32_Process.Handle="8812" __PROPERTY_COUNT : 45 __DERIVATION : {CIM_Process, CIM_LogicalElement, CIM_ManagedSystemElement} __SERVER : EC2AMAZ-S19AQ2Q __NAMESPACE : root\cimv2 __PATH : \\EC2AMAZ-S19AQ2Q\root\cimv2:Win32_Process.Handle="8812" Caption : python.exe CommandLine : "C:\Jenkins\Miniconda3\python.exe" debug.py CreationClassName : Win32_Process CreationDate : 20230208002358.569943+000 CSCreationClassName : Win32_ComputerSystem CSName : EC2AMAZ-S19AQ2Q Description : python.exe ExecutablePath : C:\Jenkins\Miniconda3\python.exe ExecutionState : Handle : 8812 HandleCount : 82 InstallDate : KernelModeTime : 312500 MaximumWorkingSetSize : 1380 MinimumWorkingSetSize : 200 Name : python.exe OSCreationClassName : Win32_OperatingSystem OSName : Microsoft Windows Server 2019 Datacenter\|C:\Windows\|\Device\Harddisk0\Partition1 OtherOperationCount : 1135 OtherTransferCount : 150908 PageFaults : 2442 PageFileUsage : 5020 ParentProcessId : 5396 PeakPageFileUsage : 5120 PeakVirtualSize : 4368465920 PeakWorkingSetSize : 9424 Priority : 8 PrivatePageCount : 5140480 ProcessId : 8812 QuotaNonPagedPoolUsage : 8 QuotaPagedPoolUsage : 63 QuotaPeakNonPagedPoolUsage : 8 QuotaPeakPagedPoolUsage : 63 ReadOperationCount : 88 ReadTransferCount : 519894 SessionId : 0 Status : TerminationDate : ThreadCount : 1 UserModeTime : 156250 VirtualSize : 4362371072 WindowsVersion : 10.0.17763 WorkingSetSize : 9592832 WriteOperationCount : 0 WriteTransferCount : 0 PSComputerName : EC2AMAZ-S19AQ2Q ProcessName : python.exe Handles : 82 VM : 4362371072 WS : 9592832 Path : C:\Jenkins\Miniconda3\python.exe ``` 3. Kill it ``` (Get-WmiObject -Class Win32_Process -Filter "Name LIKE 'python%' AND CommandLine LIKE '%debug%'").terminate() __GENUS : 2 __CLASS : __PARAMETERS __SUPERCLASS : __DYNASTY : __PARAMETERS __RELPATH : __PROPERTY_COUNT : 1 __DERIVATION : {} __SERVER : __NAMESPACE : __PATH : ReturnValue : 0 PSComputerName : ``` 4. Confirm that the process is killed Pull Request resolved: https://github.com/pytorch/pytorch/pull/94357 Approved by: https://github.com/clee2000, https://github.com/malfet	2023-02-08 03:45:41 +00:00
Michael Voznesensky	68b35017a9	Tiny unimplemented improvements (#94150 ) fix names Pull Request resolved: https://github.com/pytorch/pytorch/pull/94150 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-02-08 02:57:29 +00:00
Michael Voznesensky	b191a5f75f	Remove overly strict assert, add test (#94151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94151 Approved by: https://github.com/ezyang	2023-02-08 02:57:29 +00:00
Wang, Eikan	88ef4739b2	Check the semantic of loading the mask value (#91755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91755 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-08 02:34:22 +00:00
Brian Hirsh	83275d8cdf	add torch.autograd._set_view_replay_enabled, use in aot autograd (#92588 ) tldr; this should fix some minor perf regressions that were caused by adding more as_strided() calls in aot autograd. This PR adds a new context manager, `torch.autograd._set_view_replay_enabled()`. Context: AOT Autograd has special handling for "outputs that alias graph intermediates". E.g. given this function: ``` def f(x): y = torch.mul(x, 2) out = y.view(-1) return out ``` AOT Autograd will do the following: ``` def fn_to_compile(x): y = torch.mul(x, 2) out = y.view(-1) # return the graph intermediate return y, out compiled_fn = compile(fn_to_compile) def wrapper(x): y, out = compiled_fn(x) # regenerate the alias of the graph intermediate return out._view_func(y) ``` What's annoying is that `out._view_func()` will result in a `.as_strided` call, because `out` is an ordinary runtime tensor. This (likely?) caused a perf regression, because when running the backward, out `as_strided_backward()` is slower than our `view_backward()`. In this PR, I added some TLS for instructing autograd to do view replay instead of as_strided, even when given a normal tensor. I'm definitely interested in thoughts from autograd folks (cc @albanD @soulitzer). A few points that I want to bring up: (1) One reason that this API seems generally useful to me is because of the case where you `torch.compile()` a function, and you pass in two inputs that alias each other, and mutate one of the inputs. Autograd is forced to add a bunch of as_strided() calls into the graph when this happens, but this would give users an escape hatch for better compiled perf in this situation (2) To be fair, AOT Autograd probably won't need this TLS in the long term. There's a better (more complicated) solution, where AOT Autograd manually precomputes the view chain off of graph intermediates during tracing, and re-applies them at runtime. This is kind of complicated though and feels lower priority to implement immediately. (3) Given all of that I made the API private, but lmk what you all think. This is a followup of https://github.com/pytorch/pytorch/pull/92255. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-08 01:48:32 +00:00
Michael Voznesensky	333e771394	Add benchmarks.py to run all benchmarks, add new file with all torchbench model names (#94146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94146 Approved by: https://github.com/ezyang	2023-02-08 01:18:38 +00:00
cyy	5fa7120722	Simplify CMake CUDNN code (#91676 ) 1. Move CUDNN code to seperate module. 2. Merge CUDNN public and private targets into a single private target. There is no need to expose CUDNN dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91676 Approved by: https://github.com/malfet	2023-02-08 01:06:10 +00:00
cyy	9291f9b9e2	Simplify cmake code (#91546 ) We use various newer CMake features to simplify build system: 1.Caffe2::threads is replaced by threads::threads. 2.Some unused MSVC flags are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91546 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-08 01:05:19 +00:00
Ramin Azarmehr	c981b7e572	[MPS] Add MPSAllocatorInterface to access methods of MPSAllocator (#94327 ) This is a prerequisite for the upcoming PR's for the MPS Modules and Memory Leak Detection features. Also added pragma once to headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94327 Approved by: https://github.com/kulinseth	2023-02-08 00:59:36 +00:00
Liao, Xuan	51b487bf51	[inductor] fix cpu implementation of argmax / argmin (#94165 ) Fixes #94055 When the reduction numel equals to 1, inner function of argmax / argmin is `return 0`. This inner function losts the data type of `0`, which may result in conflicting types for subsequent calculations. This PR keeps the data type in inner function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94165 Approved by: https://github.com/jgong5, https://github.com/Neilblaze, https://github.com/jansel	2023-02-08 00:54:10 +00:00
chuanqiw	94394e568e	change the dynamo benchmark timeout as a parameter (#94284 ) Change the dynamo benchmark timeout from hard code to a parameter with default value 1200ms, cause the hard code 1200ms timeout led some single thread mode model crashed on CPU platform. With the parameter, users can specify the timeout freely. Fixes #94281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94284 Approved by: https://github.com/malfet	2023-02-08 00:45:08 +00:00
Michael Voznesensky	f48b4d8842	Handle sympy in split (#94285 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94285 Approved by: https://github.com/SherlockNoMad, https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel	2023-02-08 00:32:19 +00:00
Aaron Gokaslan	3ce1ebb6fb	Apply some safe comprehension optimizations (#94323 ) Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323 Approved by: https://github.com/albanD	2023-02-07 23:53:46 +00:00
Driss Guessous	bef2483ed8	[NestedTensor] Call contiguous in linear backward (#94317 ) Fixes #94303 If in upward grad for linear_backward was discontiguous we would throw a torch check. This updates the implementation to instead call contiguous and changes the check to an internal assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94317 Approved by: https://github.com/mikaylagawarecki	2023-02-07 23:43:46 +00:00
Chien-Chin Huang	ab4fe01e72	[FSDP][optim_state_dict] Returns the initial states of the empty parameters for KeyedOptimizer/NamedOptimizer (#94130 ) KeyedOptimizer and NamedOptimizer expect the states exist in the state_dict when `load_state_dict` is called even if the corresponding parameters are empty (size == 0). This PR adds the support to make KeyedOptimizer work with `use_orig_params=True`. Differential Revision: [D43019458](https://our.internmc.facebook.com/intern/diff/D43019458/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94130 Approved by: https://github.com/rohan-varma	2023-02-07 23:36:56 +00:00
Adam J. Stewart	ec25db7741	torch.inference_mode: add type hints (#94223 ) Copied the type hints from the other context managers. Not sure how to add type hints for `clone` since it returns the same class. The `Self` type isn't introduced until Python 3.11 and mypy just recently added support for it. Could also use `"inference_mode"` with quotes to avoid using it before it's declared, or `from __future__ import annotations` to allow its use without quotes. Or we could just skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94223 Approved by: https://github.com/albanD	2023-02-07 23:16:55 +00:00
albanD	75e04f6dad	Test enabling full testing on 3.11 for linux (#94056 ) Testing what happens if we run everything right now. Will remove the broken stuff to get a a mergeable version next. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94056 Approved by: https://github.com/malfet	2023-02-07 23:02:13 +00:00
albanD	34bbd7af87	Use the right run_test for inductor opinfo tests (#94312 ) One of the side effect of this is that this is not properly skipped on 3.11 As a side note, it was very surprising to find testing-specific code in `torch._dynamo` and not `torch.testing`... Pull Request resolved: https://github.com/pytorch/pytorch/pull/94312 Approved by: https://github.com/ezyang	2023-02-07 23:02:13 +00:00
Michael Lazos	d16c2c36ad	Add another missing decomp (#94113 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94113 Approved by: https://github.com/jansel	2023-02-07 21:32:56 +00:00
Stephen Jia	6b8eb0eb04	[vulkan] Add core graph components (#94222 ) Summary: This diff introduced the core components needed for the Vulkan Graph runtime. * ComputeGraph data structure * Value data structure * Copy node * Add node with option for prepacked weights Test Plan: Run the `delegate_experiment` binary. ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_use_gpu_diagnostics=1 :delegate_experimentAppleMac\#macosx-arm64 ``` Differential Revision: D42614155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94222 Approved by: https://github.com/salilsdesai	2023-02-07 21:15:17 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
Eli Uriegas	567e6152da	Revert "[inductor] fix crash issue when input is a view tensor (#90150 )" (#94329 ) Had to provide a merge conflict resolution due to conflicts with https://github.com/pytorch/pytorch/pull/94118 This was causing issues with internal tests that look similar to: ``` in clone_preserve_strides x.size(), x.stride(), x.storage_offset() AttributeError: 'KeyedJaggedTensor' object has no attribute 'size' ``` See https://fburl.com/testinfra/nc0du2sp for more information This reverts commit #90150 @jansel can you help @blzheng with re-landing this as a co-development diff? Pull Request resolved: https://github.com/pytorch/pytorch/pull/94329 Approved by: https://github.com/jansel	2023-02-07 20:45:58 +00:00
Mikayla Gawarecki	7b3217e6a2	Add deprecation warning to reduce flag of scatter for Tensor src and redirect to scatter_reduce (#94282 ) Address #94082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94282 Approved by: https://github.com/albanD	2023-02-07 20:22:22 +00:00
Aaron Gokaslan	748bac8757	[BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309 ) Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309 Approved by: https://github.com/albanD	2023-02-07 20:08:58 +00:00
Mikayla Gawarecki	895d4781b8	[easy] Add NestedTensorMeta to parseDispatchKey (#94279 ) ran into this when trying to use `torch.library.Library("aten", "IMPL", "NestedTensorMeta")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94279 Approved by: https://github.com/bdhirsh	2023-02-07 19:46:29 +00:00
Edward Z. Yang	8c835a9e52	Factor out SYMPY_INTERP (#94307 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94307 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-02-07 19:23:11 +00:00
Aleksandar Samardžić	e1f17b3530	Add CSR->BSC and CSC->BSR conversions (#93301 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93301 Approved by: https://github.com/cpuhrsch	2023-02-07 19:22:05 +00:00
Edward Z. Yang	d690a596dc	Fast path binary ops in fake tensor (#94047 ) Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup. Before: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 \| FakeTensor.__torch_dispatch__:4995 \| FakeTensorMode.__torch_dispatch__:89985 \| ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` After: ``` cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:40.18931 backend_compile:25.28828 STATS: call_* op count: 1369 \| FakeTensor.__torch_dispatch__:4995 \| FakeTensorMode.__torch_dispatch__:69478 \| attempt fast:4399 \| fast is_contiguous:4399 \| ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit# This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment: ``` diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py index e3bf545f3b8..395942c6ffe 100644 --- a/torch/_subclasses/fake_tensor.py +++ b/torch/_subclasses/fake_tensor.py @@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): kwargs = kwargs if kwargs else {} + with no_dispatch(): + if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}: + return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda') + if func == torch.ops.prim.device.default: assert len(args) == 1 and isinstance(args[0], FakeTensor) if args[0].fake_mode.in_kernel_invocation: ``` I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.) The implementation here is based off of https://github.com/pytorch/pytorch/pull/93118/ but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences: * Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last). * I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right. Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1)) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94047 Approved by: https://github.com/eellison	2023-02-07 18:34:24 +00:00
albanD	0603f4ff14	temp fix for segment reduce undocumented FC window (#94242 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94242 Approved by: https://github.com/malfet	2023-02-07 18:27:01 +00:00
Radek Bartoň	a88c15a849	Build Windows binaries with Visual Studio 2022 Build Tools (#90855 ) This PR enables VS 2022 binaries for build and test jobs. Another PR pytorch/builder#1240 is doing majority of the work. Closes #87695. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90855 Approved by: https://github.com/jeanschmidt, https://github.com/seemethere	2023-02-07 18:15:29 +00:00
Driss Guessous	e0950fccfa	[SDPA] Add expanded autograd testing for fused kernels and disable head_dim128 sm86 mem-efficient (#94009 ) # Summary - Adds a large parameter sweep for testing the various configs a user can call sdpa with and compares the deviation of the fused kernels vs the eager math fallback to test for correctness. - Sm86 + head_dim==128 is throwing an IMA for memory efficient attention. We add a filter for use_mem_efficient_attention(). This has since been fixed in the upstream Xformers version but will likely not make it for branch cut. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94009 Approved by: https://github.com/cpuhrsch	2023-02-07 18:04:48 +00:00
Natalia Gimelshein	7bba87ed06	add rsub decomposition with alpha (#94144 ) Fixes #93376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94144 Approved by: https://github.com/desertfire	2023-02-07 17:21:13 +00:00
Catherine Lee	e9533767af	trymerge to ignore certain failures (#91134 ) For any failure in dr ci listed as "flaky" or "broken trunk" (aka anything not "new failures"), these get marked as "ok to fail". If there are a small number (currently set to 3) ok to fail jobs, merge can still continue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91134 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2023-02-07 17:19:57 +00:00
Nikita Vedeneev	b07c839b70	COO intersection kernel: respect value intersection order (#92242 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92242 Approved by: https://github.com/cpuhrsch, https://github.com/amjames	2023-02-07 17:05:28 +00:00
albanD	0b2dc3b3ac	[Py-3.11] Skip dynamo related tests (#94187 ) The quantization test fails to import Dynamo as expected. The traceback tool looks a lot more tricky, opened https://github.com/pytorch/pytorch/issues/94189 to investigate further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94187 Approved by: https://github.com/malfet	2023-02-07 16:40:55 +00:00
Denis Vieriu	5d48392abb	[MPS] Skip gather/blit calls in case of strided output (#94260 ) Skip gather/blit calls in case of strided output - this prevents: - allocating additional memory for the output - additional transpose for both the input and output Fixes: ``` x = torch.rand((256,10), device='mps') x = x.permute(1,0) x.exp() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94260 Approved by: https://github.com/razarmehr	2023-02-07 16:25:03 +00:00
Denis Vieriu	86ae14deaa	[MPS] Fix MPSGraph casting issue to MPSDataTypeBool in masked_fill op (#94263 ) Fixes TestConsistency masked_fill for bool data type. Casting a tensor > 1 to MPSDataTypeBool will result in 0 instead of 1. This change manually casts the scalar to a value of 0 or 1 when casting a non-boolean tensor to a boolean tensor: ``` (inputDataType == MPSDataTypeBool) ? !!value.to<double>() : value.to<double>() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94263 Approved by: https://github.com/razarmehr	2023-02-07 16:20:55 +00:00
Denis Vieriu	e3ac109618	[MPS] Fallback on gather code to solve view tensors when a slice is followed by a reshape (#94278 ) There are cases when the arrayViewTensor API cannot be used to solve the view operations, such as when a view dimension is bigger than the base dimension of the tensor, e.g: ``` base shape: [1, 768, 512, 2] // we cannot slice the base shape in any way to result in first dimension `2` view shape: [2, 384, 512, 1] ``` On such cases, we need to fallback on the gather code (that detects this is a slice followed by a reshape) to solve this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94278 Approved by: https://github.com/razarmehr	2023-02-07 16:20:08 +00:00
Kulin Seth	4cd086b14c	[MPS] Raise error for int64 inputs of dot operator. (#94270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94270 Approved by: https://github.com/razarmehr	2023-02-07 16:12:17 +00:00
Ramin Azarmehr	b654d1494b	[MPS] Fix the argument error for tensor_split() test (#94234 ) The second tensor argument `tensor_indices_or_sections` of tensor_split() must be on CPU when testing it in TestConsistency. Otherwise it will error out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94234 Approved by: https://github.com/kulinseth	2023-02-07 15:56:49 +00:00
Ramin Azarmehr	a3ca66c69e	[MPS] Remove the unused code for view lists in OperationUtils.h (#94265 ) Clean up redundant code that was added before and not needed anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94265 Approved by: https://github.com/kulinseth	2023-02-07 15:56:05 +00:00
Kulin Seth	a0a3728069	[MPS] Don't reset the Graph state (#94283 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94283 Approved by: https://github.com/razarmehr	2023-02-07 15:52:44 +00:00
Ramin Azarmehr	36062dd2b4	[MPS] Fix the crash in View ops when slicing wrong lengths (#94259 ) The offset + length of destination tensor should not be larger than source's length when slicing Fixes #94190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94259 Approved by: https://github.com/malfet	2023-02-07 15:51:26 +00:00
Joel Schlosser	bf4fe5dddd	General in-place binary op support in dynamo (#94203 ) Continues the approach taken in #93271, expanding support to in-place binary ops (e.g. `__iadd__`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94203 Approved by: https://github.com/ezyang	2023-02-07 15:12:32 +00:00
Joel Schlosser	f954498edf	Dynamo: Fix to unpack ConstantVariable in call_range() (#94202 ) Fixes the `pyhpc_turbulent_kinetic_energy` model in torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94202 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2023-02-07 15:12:00 +00:00
sanchitintel	c4544bc169	Fix thread-allocation in `_vec_log_softmax_lastdim` (#85398 ) ## Problem history There seems to always have been a bug in `_vec_log_softmax_lastdim `. In particular, there were two issues with it - #### Bug 1 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`: `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();` It was `256` for float32, bfloat16, and float16. When AVX512 support was added, `CHUNK_SIZE` became `512`. The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it. #### Bug 2 `grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)` So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256. Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases. #### Problem caused by bugs With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`! When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700. In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`. AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads. ## Solution Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now, and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart). ## Benchmarks ##### Machine-config: Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake) One socket of 26 physical cores was used. Intel OpenMP & tcmalloc were preloaded. Example of a command to run benchmark: `ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu` Benchmark \| Old implementation time (us) \| New implementation time (us) \| Speedup ratio (old/new) -- \| -- \| -- \| -- LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 \| 11069.281 \| 2651.186 \| 4.17x LogSoftmax_N1024_seq_len23258_dim1_cpu AVX512 \| 18292.928 \| 2586.550\| 7.07x LogSoftmax_N700_seq_len23258_dim1_cpu AVX2 \| 9611.902 \| 1762.833 \| 5.452x LogSoftmax_N700_seq_len23258_dim1_cpu AVX512 \| 12168.371 \| 1717.824 \| 7.08x Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano	2023-02-07 15:09:05 +00:00
Elias Ellison	a2ac25f63e	update test fixture (#89796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89796 Approved by: https://github.com/davidberard98	2023-02-07 14:58:57 +00:00
Nikita Vedeneev	513b5da357	sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048 ) As per title. Sync is still unavoidable for super high-dim tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94048 Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch	2023-02-07 12:43:12 +00:00
Nikita Shulga	42b6bcdb13	[BE] Add empty tensor check to _compute_linear_combination (#94245 ) Fixes https://github.com/pytorch/pytorch/issues/94124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94245 Approved by: https://github.com/lezcano	2023-02-07 11:31:11 +00:00
Jiong Gong	a28a062938	[Inductor] Fix CPU vectorized implementation of mask calculation that breaks torch.where (#93922 ) Fix https://github.com/pytorch/pytorch/issues/93374 The cause of the issue is that the original vectorized float mask calculation doesn't consider the broadcast case. This PR adds the support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93922 Approved by: https://github.com/XiaobingSuper, https://github.com/desertfire, https://github.com/jansel	2023-02-07 11:30:21 +00:00
Nikita Karetnikov	0e94fbc0c8	[inductor] bug fix: use `create_symbolic_sizes_strides_storage_offset` (#94031 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94031 Approved by: https://github.com/ezyang	2023-02-07 09:52:25 +00:00
min-jean-cho	900e09c872	[Dynamo] Support torch.Tensor.fn as TorchVariable, not UserDefinedObjectVariable, preventing graph break (#93243 ) As found in #92709, thanks to @ngimel and @jansel, currently `torch.Tensor.fn` points to `UserDefinedObjectVariable` rather than `TorchVariable`. The root cause is due to https://github.com/pytorch/pytorch/pull/92709#pullrequestreview-1273357406. To prevent this, build `TorchVariable` of `torch.Tensor.fn` pointing to `torch.ops.aten.fn`. This issue propagates to `torch.Tensor.fn` causing graph break with `nopython=True`. ```python import torch import torch._dynamo as dynamo #op = torch.ops.aten.abs_ # no graph break op = torch.Tensor.abs_ # graph break args = torch.empty(10) def foo(args): return op(args) opt_foo = dynamo.optimize("inductor", nopython=True)(foo) y_ = opt_foo(args) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93243 Approved by: https://github.com/jansel	2023-02-07 09:26:50 +00:00
Wenlei Xie	d6dec1a5cf	Refactor sharding data pipe into a seperate file (#94095 ) Move `ShardingFilterIterDataPipe` into a dedicated file. Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`. So we don't need to based on whether `is_shardable` and `apply_sharding` are presented in DataPipe in `graph_settings.py`. But open to other discussions. Open question: Should [ShardingRoundRobinDispatcherIterDataPipe](`01fc762003/torchdata/datapipes/iter/util/sharding.py (L16-L17)`) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch). Differential Revision: D43014692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94095 Approved by: https://github.com/ejguan, https://github.com/NivekT	2023-02-07 09:12:02 +00:00
Jerry Zhang	59c1b5025f	[quant][fx][pt2e] Refactor prepare so it's aligned better with the new API plan in pt2e (#94011 ) Summary: There are three things that happens in the current prepare code, (1). user express their intention of how they want the model to be quantized with QConfigMapping, we translate that to node.meta["target_dtype_info"] (2). we validate the setting against BackendConfig (3). insert observers based on the validated node.meta["target_dtype_info"] previously (2) and (3) are mixed together, this PR tries to move (2) closer to (1), with one edge case left, this refactor moves us closer to our target design for quantization in pytorch 2.0 export path this is a follow up PR for https://github.com/pytorch/pytorch/pull/92641 Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestQuantizeFxModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/94011 Approved by: https://github.com/vkuzo	2023-02-07 08:23:56 +00:00
Tri Dao	ffb3561caa	[Docs] Add pointer to FlashAttention paper (#94253 ) As discussed with @drisspg, we're adding pointers to the docs for MHA and Transformers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94253 Approved by: https://github.com/drisspg, https://github.com/malfet	2023-02-07 08:05:10 +00:00
Sergii Dymchenko	f92348e13d	Clean up mentions of removed torch/csrc/generic/*.cpp (#94107 ) Summary: The dir was removed in https://github.com/pytorch/pytorch/pull/82373. Test Plan: Sandcastlle + GitHub CI. Differential Revision: D43016100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94107 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi	2023-02-07 07:17:16 +00:00
Ramin Azarmehr	bc8a378333	[MPS] Unregister put_() op due to lack of implementation (#94231 ) Currently, the `put_()` is not implemented on MPS backend, so this patch will unregister it and insert it into blocklist of TestConsistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94231 Approved by: https://github.com/kulinseth	2023-02-07 06:54:15 +00:00
Chien-Chin Huang	bc6d54f6d8	[FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank (#94129 ) When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state. This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank. Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94129 Approved by: https://github.com/rohan-varma	2023-02-07 06:29:28 +00:00
Chien-Chin Huang	f04106f1c2	[FSDP][state_dict] Fix incorrect valid_data_size for local_state_dict when some ranks have zero data. (#94109 ) When using `torch.chunks` to split the `flat_param`, some ranks may have zero data and `local_state_dict` does not handle the case correctly -- `local_state_dict` won't resize the local tensor to an empty one. This PR fixes the issue. Differential Revision: [D43004643](https://our.internmc.facebook.com/intern/diff/D43004643/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94109 Approved by: https://github.com/zhaojuanmao	2023-02-07 06:20:40 +00:00
Yanbo Liang	605b661805	FakeTensor should constant propagate through ops that allow numbers as scalars (#94145 ) Fixes #92655 Thanks @eellison for the code change suggestion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94145 Approved by: https://github.com/eellison	2023-02-07 06:20:35 +00:00
Mengwei Liu	579ae64d81	[mobile] List all missing ops at once (#94205 ) List all missing ops rather than early termination Test on device Logcat lists all operators: ``` 12-06 00:23:36.523 8299 8299 F DEBUG : Abort message: 'terminating with uncaught exception of type c10::Error: Following ops cannot be found: [aten::max_pool2d, aten::conv2d]. Please check if the operator library is included in the build. If built with selected ops, check if these ops are in the list. If you are a Meta employee, please see fburl.com/missing_ops for a fix. Or post it in https://discuss.pytorch.org/c/mobile/ () 12-06 00:23:36.523 8299 8299 F DEBUG : Exception raised from initialize_operators at xplat/caffe2/torch/csrc/jit/mobile/function.cpp:89 (most recent call first): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94205 Approved by: https://github.com/JacobSzwejbka	2023-02-07 05:45:57 +00:00
Xuehai Pan	4b0e2e2cc6	Use official NVML Python bindings (#93925 ) Use the official NVML Python binding package [`nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py), which is maintained by the NVIDIA NVML team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93925 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/ptrblck	2023-02-07 05:27:36 +00:00
PyTorch MergeBot	1063394898	Revert "Add fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries for _GLIBCXX_USE_CXX11_ABI=1 (#93835 )" This reverts commit b562be793a7f9fa8923b09367c320b1c378f6d25. Reverted https://github.com/pytorch/pytorch/pull/93835 on behalf of https://github.com/huydhn due to This breaks XLA build `b562be793a`	2023-02-07 04:49:06 +00:00
PyTorch MergeBot	f1c435d7b4	[vision hash update] update the pinned vision hash (#94241 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94241 Approved by: https://github.com/pytorchbot	2023-02-07 04:40:02 +00:00
zhuhong61	b562be793a	Add fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries for _GLIBCXX_USE_CXX11_ABI=1 (#93835 ) Fixes #https://github.com/pytorch/pytorch/pull/92550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93835 Approved by: https://github.com/malfet	2023-02-07 03:05:39 +00:00
Kulin Seth	ca74105377	[MPS] Add scalar params to the softplus key. (#94256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94256 Approved by: https://github.com/razarmehr, https://github.com/malfet	2023-02-07 03:04:53 +00:00
Denis Vieriu	9358726a06	[MPS] Handle empty input in layer norm (#94212 ) Handle empty input in layer norm Pull Request resolved: https://github.com/pytorch/pytorch/pull/94212 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-02-07 02:55:48 +00:00
Ramin Azarmehr	d493bc8a76	[MPS] Return input in addcmul/div if value is zero (#94214 ) Also remove the unnecessary resize (structured op) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94214 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-02-07 02:38:09 +00:00
Ramin Azarmehr	fa2b99f402	[MPS] Fix the crash in nan_to_num() with Float16 data type (#94220 ) This PR will prevent a crash in `test_output_match_nan_to_num_cpu_float16`, that would otherwise happen with the upcoming updates to MPS Framework in Ventura (in API `logicalANDWithPrimaryTensor()`). The fix is backwards compatible with Monterey too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94220 Approved by: https://github.com/malfet	2023-02-07 02:36:05 +00:00
Vasiliy Kuznetsov	f15ab8a7f2	AO migration: replace torch internal callsites (#94170 ) Summary: Do the following renames: `torch.quantization` -> `torch.ao.quantization` `torch.nn.quantized` -> `torch.ao.nn.quantized` `torch.nn.quantizable` -> `torch.ao.nn.quantizable` `torch.nn.qat` -> `torch.ao.nn.qat` `torch.nn.intrinsic` -> `torch.ao.nn.intrinsic` And then, do `torch.ao.nn.quantized._reference` -> `torch.ao.nn.quantized.reference` to clean up the aftermath of https://github.com/pytorch/pytorch/pull/84974 Then, manually update `test/test_module_init.py` to fix hanging whitespace due to the replace. Run this script to do the replacements: https://gist.github.com/vkuzo/7f7afebf8c31b9ba48306223e68a1c82 This is for https://github.com/pytorch/pytorch/issues/81667 Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/94170 Approved by: https://github.com/jerryzh168	2023-02-07 02:32:23 +00:00
Vasiliy Kuznetsov	a9f57db607	AO migration: migrate .rst files to new locations (#94211 ) Summary: Migrates the PyTorch documentation to point to the new locations of AO code. Context: https://github.com/pytorch/pytorch/issues/81667 Process: 1. run https://gist.github.com/vkuzo/c38d4ba201604579d7d316ec4a4692e7 for automated replacement 2. manually fix the doc build errors (by removing the module declarations which are now duplicate) Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/94211 Approved by: https://github.com/jerryzh168	2023-02-07 02:32:23 +00:00
Ramin Azarmehr	368e364c19	[MPS] Fix gradient issues with NLL and Smooth_L1 loss ops (#94226 ) - Fix correctness issues with nll_loss_backward(), smooth_l1_loss_backward() and cross_entropy_backward() by taking grad_output into account when computing those loss ops - Add numel()==0 check to prevent crashes - Clean up and formatting Pull Request resolved: https://github.com/pytorch/pytorch/pull/94226 Approved by: https://github.com/kulinseth	2023-02-07 01:54:18 +00:00
cyy	bf9be50bb8	Some more fixes (#94049 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94049 Approved by: https://github.com/Skylion007	2023-02-07 01:51:06 +00:00
PyTorch MergeBot	53e4fe076a	Revert "enable bf16 emb (#94163 )" This reverts commit f3bf46e801dec2637751224fd6e27fbf97453bc6. Reverted https://github.com/pytorch/pytorch/pull/94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, `05397b1250`	2023-02-07 00:32:22 +00:00
Masaki Kozuki	6ba041fcae	Look up `group["capturable"]`, not `defaults["capturable"]` in Adam(W) (#94149 ) We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g. https://github.com/pytorch/pytorch/issues/89987. So check whether or not `capturable` is `True` among all the `param_group`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149 Approved by: https://github.com/albanD	2023-02-07 00:24:35 +00:00
Huy Do	0dfc3e1340	Cleanup all leftover processes in MacOS pet runner (#94127 ) Despite my initial attempt to clean up MacOS runner as best as I could (https://github.com/pytorch/test-infra/pull/2100, https://github.com/pytorch/test-infra/pull/2102), the runner in question `i-09df3754ea622ad6b` (yes, the same one) still had its free space gradually dropping from 10GB (after cleaning conda and pip packages few days ago) to only 5.2GB today: `4207d3c330` I have a gotcha moment after logging into the runner and the direct root cause is right before my eyes. I forgot to look at the processes running there: ``` 501 7008 1 0 13Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3912838018 --no-capture-output python3 -m tools.stats.monitor 501 30351 30348 0 18Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3953492510 --no-capture-output python3 -m tools.stats.monitor 501 36134 36131 0 19Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3956679232 --no-capture-output python3 -m tools.stats.monitor 501 36579 36576 0 Mon11PM ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4048875121 --no-capture-output python3 -m tools.stats.monitor 501 37096 37093 0 20Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3971130804 --no-capture-output python3 -m tools.stats.monitor 501 62770 62767 0 27Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4025485821 --no-capture-output python3 -m tools.stats.monitor 501 82293 82290 0 20Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3969944513 --no-capture-output python3 -m tools.stats.monitor 501 95762 95759 0 26Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4012836881 --no-capture-output python3 -m tools.stats.monitor ``` There were many leftover `tools.stats.monitor` processes there. After pkill them all, an extra 45GB of free space was immediately free up. Same situation could be seen on other MacOS pet runners too, i.e. `i-026bd028e886eed73`. At the moment, it's unclear to me what edge case could cause this as the step to stop the monitoring script should always be executed, may be it received an invalid PID somehow. However, the safety net catch-all solution would be to cleanup all leftover processes on MacOS pet runner before running the workflow (similar to what is done in Windows https://github.com/pytorch/pytorch/pull/93914) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94127 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-02-07 00:15:31 +00:00
Peter Bell	a595d06c12	[inductor] Avoid re-computing mean in lowering for aten.var_mean (#94139 ) The current lowering results in the mean being computed twice. In the following snippet, both `tmp1` and `tmp8` are the sum of `in_ptr0`: ```python def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): # ... _tmp1 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0 for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last') _tmp1 = tl.where(rmask, _tmp1 + tmp0, _tmp1) tmp1 = tl.sum(_tmp1, 1)[:, None] _tmp7 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0 _tmp8 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0 for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp2 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last') tmp3 = 100.0 tmp4 = tmp1 / tmp3 tmp5 = tmp2 - tmp4 tmp6 = tmp5 * tmp5 _tmp7 = tl.where(rmask, _tmp7 + tmp6, _tmp7) _tmp8 = tl.where(rmask, _tmp8 + tmp2, _tmp8) tmp7 = tl.sum(_tmp7, 1)[:, None] tmp8 = tl.sum(_tmp8, 1)[:, None] # ... ``` After this change, the mean is computed only once: ```python for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last') _tmp1 = tl.where(rmask, _tmp1 + tmp0, _tmp1) tmp1 = tl.sum(_tmp1, 1)[:, None] tmp2 = 100.0 tmp3 = tmp1 / tmp2 tl.store(in_out_ptr0 + (0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp3, None) _tmp7 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0 for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp4 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last') tmp5 = tmp4 - tmp3 tmp6 = tmp5 * tmp5 _tmp7 = tl.where(rmask, _tmp7 + tmp6, _tmp7) tmp7 = tl.sum(_tmp7, 1)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94139 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-02-06 22:34:16 +00:00
Peter Bell	719f78d311	[inductor] Count bytes can't read from buffers that are never written (#94142 ) If a buffer is never materialized, it follows that it will never be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94142 Approved by: https://github.com/jansel	2023-02-06 22:34:16 +00:00
Nikita Shulga	43f6ed4abd	Extend torch-trition conda to 3.11 (#93117 ) Also drop 3.7 from both builds and add proper names to the steps Add `pytorch-nightly` for `conda` builds to test the installation against `pytorch` from the nightly channel as well as get [`filelock`](https://anaconda.org/pytorch-nightly/filelock) dependency for 3.11) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93117 Approved by: https://github.com/atalman	2023-02-06 22:14:57 +00:00
cyy	3c6bc58f63	use C10_API in libc10.so (#94171 ) MSVC emits several C4273 warning when compiling c10. I think the offending files should use C10_API instead of TORCH_API. If the tests pass, the changes should be safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94171 Approved by: https://github.com/Skylion007	2023-02-06 20:16:22 +00:00
Nikita Shulga	a07d1291cf	Re-enable compilation tests (#92333 ) As CUDA-11.5 is no longer supported, just remove the check Fixes https://github.com/pytorch/pytorch/issues/69460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92333 Approved by: https://github.com/atalman	2023-02-06 20:06:12 +00:00
Jason Ansel	180adf8c18	Fix bug in generic_list_compare (#94156 ) https://github.com/pytorch/pytorch/pull/94054 introduced a bug in list comparisons other than `==`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94156 Approved by: https://github.com/voznesenskym	2023-02-06 19:50:04 +00:00
Mikayla Gawarecki	fdebc06242	Point to scatter_reduce for reduce argument in scatter_ docs (#94081 ) Fix in response to https://github.com/pytorch/pytorch/issues/22378#issuecomment-1411636451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94081 Approved by: https://github.com/cpuhrsch	2023-02-06 19:26:21 +00:00
Zain Rizvi	05397b1250	Make linter quick-checks setup steps retryable (#94199 ) We've been seeing linter failures when the `apt-get install doxygen` command fails to install due to network errors, and the workflow doesn't get retried since it's in a non-retryable step This PR moves it to a retryable step It also marks a deterministic step as nonretryable, since retrying that one will never change the output Pull Request resolved: https://github.com/pytorch/pytorch/pull/94199 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-02-06 18:44:41 +00:00
albanD	496c0a207b	Make segment_reduce properly private. (#93166 ) I am attempting not to change the aten function to reduce the amount of BC issues on the torchscript side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93166 Approved by: https://github.com/ngimel	2023-02-06 18:32:23 +00:00
albanD	9b3277c095	Make sure to properly pull the right submodule in BC test (#94182 ) To unblock https://github.com/pytorch/pytorch/pull/93219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94182 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007	2023-02-06 18:03:35 +00:00
PyTorch MergeBot	0444b8f560	Revert "Support neg calls to dyn shapes (#94068 )" This reverts commit 9350bcf6ae9d646389a0a4345c48275d4f9e4d1a. Reverted https://github.com/pytorch/pytorch/pull/94068 on behalf of https://github.com/malfet due to This broke hugging_face shard, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_huggin	2023-02-06 17:50:10 +00:00
Wei Wang	9b2e7d3b4f	[Inductor] Performance smoke test - hf bert performance increased (#94088 ) therefore bumping up from 1.185 to 1.200 to better detect regression logurl date model speedup https://ossci-raw-job-status.s3.amazonaws.com/log/11101705328 2023-02-03T23:05:19.5738026Z hf_Bert 1.2122 https://ossci-raw-job-status.s3.amazonaws.com/log/11101331469 2023-02-03T22:54:18.0252738Z hf_Bert 1.2129 https://ossci-raw-job-status.s3.amazonaws.com/log/11101288841 2023-02-03T22:52:17.6331332Z hf_Bert 1.2189 https://ossci-raw-job-status.s3.amazonaws.com/log/11101190372 2023-02-03T22:50:28.6010460Z hf_Bert 1.2117 https://ossci-raw-job-status.s3.amazonaws.com/log/11101101525 2023-02-03T22:27:18.5573576Z hf_Bert 1.2088 https://ossci-raw-job-status.s3.amazonaws.com/log/11101034545 2023-02-03T22:24:33.8710157Z hf_Bert 1.2229 https://ossci-raw-job-status.s3.amazonaws.com/log/11101004878 2023-02-03T22:22:38.0506379Z hf_Bert 1.2074 https://ossci-raw-job-status.s3.amazonaws.com/log/11100834787 2023-02-03T22:12:34.9376779Z hf_Bert 1.2142 https://ossci-raw-job-status.s3.amazonaws.com/log/11100413479 2023-02-03T21:47:55.7536822Z hf_Bert 1.2112 https://ossci-raw-job-status.s3.amazonaws.com/log/11100372087 2023-02-03T21:46:19.6411599Z hf_Bert 1.2175 https://ossci-raw-job-status.s3.amazonaws.com/log/11100291417 2023-02-03T21:41:01.3427726Z hf_Bert 1.2068 https://ossci-raw-job-status.s3.amazonaws.com/log/11100137256 2023-02-03T21:32:14.4491714Z hf_Bert 1.2089 https://ossci-raw-job-status.s3.amazonaws.com/log/11098980986 2023-02-03T20:30:13.4082966Z hf_Bert 1.2109 https://ossci-raw-job-status.s3.amazonaws.com/log/11098634747 2023-02-03T20:12:57.4921305Z hf_Bert 1.2169 https://ossci-raw-job-status.s3.amazonaws.com/log/11096295932 2023-02-03T18:58:55.1214750Z hf_Bert 1.2196 https://ossci-raw-job-status.s3.amazonaws.com/log/11095904757 2023-02-03T18:49:48.4541355Z hf_Bert 1.22 https://ossci-raw-job-status.s3.amazonaws.com/log/11095292402 2023-02-03T18:10:54.6924201Z hf_Bert 1.2122 https://ossci-raw-job-status.s3.amazonaws.com/log/11095026691 2023-02-03T18:11:26.7384107Z hf_Bert 1.2228 https://ossci-raw-job-status.s3.amazonaws.com/log/11094943489 2023-02-03T17:53:00.0989341Z hf_Bert 1.2165 https://ossci-raw-job-status.s3.amazonaws.com/log/11093227145 2023-02-03T16:04:18.7935799Z hf_Bert 1.2208 https://ossci-raw-job-status.s3.amazonaws.com/log/11092910912 2023-02-03T15:51:28.1977577Z hf_Bert 1.2188 https://ossci-raw-job-status.s3.amazonaws.com/log/11091775528 2023-02-03T15:27:21.7984395Z hf_Bert 1.2231 https://ossci-raw-job-status.s3.amazonaws.com/log/11091768252 2023-02-03T15:12:33.0339859Z hf_Bert 1.2167 https://ossci-raw-job-status.s3.amazonaws.com/log/11091051563 2023-02-03T14:44:42.7011287Z hf_Bert 1.2214 https://ossci-raw-job-status.s3.amazonaws.com/log/11088539227 2023-02-03T12:41:29.9098435Z hf_Bert 1.2192 https://ossci-raw-job-status.s3.amazonaws.com/log/11088428613 2023-02-03T12:35:38.4674850Z hf_Bert 1.2108 https://ossci-raw-job-status.s3.amazonaws.com/log/11088405279 2023-02-03T12:34:54.0870617Z hf_Bert 1.2197 https://ossci-raw-job-status.s3.amazonaws.com/log/11087037337 2023-02-03T12:06:58.2426787Z hf_Bert 1.2174 https://ossci-raw-job-status.s3.amazonaws.com/log/11085381881 2023-02-03T10:19:20.8764019Z hf_Bert 1.2189 https://ossci-raw-job-status.s3.amazonaws.com/log/11085190037 2023-02-03T10:14:41.5234245Z hf_Bert 1.2046 https://ossci-raw-job-status.s3.amazonaws.com/log/11085016390 2023-02-03T09:50:59.7484273Z hf_Bert 1.2155 https://ossci-raw-job-status.s3.amazonaws.com/log/11084948754 2023-02-03T09:47:15.7358069Z hf_Bert 1.2083 https://ossci-raw-job-status.s3.amazonaws.com/log/11084675155 2023-02-03T09:42:35.6628268Z hf_Bert 1.2126 https://ossci-raw-job-status.s3.amazonaws.com/log/11081270865 2023-02-03T06:05:22.1828269Z hf_Bert 1.2083 https://ossci-raw-job-status.s3.amazonaws.com/log/11081252914 2023-02-03T05:43:59.0680872Z hf_Bert 1.2097 https://ossci-raw-job-status.s3.amazonaws.com/log/11081252670 2023-02-03T05:44:17.0945428Z hf_Bert 1.2143 https://ossci-raw-job-status.s3.amazonaws.com/log/11081244430 2023-02-03T05:43:43.6811750Z hf_Bert 1.2204 https://ossci-raw-job-status.s3.amazonaws.com/log/11081191493 2023-02-03T05:38:43.7833293Z hf_Bert 1.2079 https://ossci-raw-job-status.s3.amazonaws.com/log/11081191168 2023-02-03T05:38:21.1397044Z hf_Bert 1.2067 https://ossci-raw-job-status.s3.amazonaws.com/log/11081189846 2023-02-03T05:38:53.5914557Z hf_Bert 1.2073 https://ossci-raw-job-status.s3.amazonaws.com/log/11080883297 2023-02-03T05:13:25.0077772Z hf_Bert 1.2105 https://ossci-raw-job-status.s3.amazonaws.com/log/11080456108 2023-02-03T04:34:34.0934838Z hf_Bert 1.204 https://ossci-raw-job-status.s3.amazonaws.com/log/11079957300 2023-02-03T03:53:18.9091026Z hf_Bert 1.207 https://ossci-raw-job-status.s3.amazonaws.com/log/11078579407 2023-02-03T02:03:11.2254812Z hf_Bert 1.2049 https://ossci-raw-job-status.s3.amazonaws.com/log/11078204621 2023-02-03T01:58:39.0887941Z hf_Bert 1.2214 https://ossci-raw-job-status.s3.amazonaws.com/log/11078126527 2023-02-03T01:38:20.2183225Z hf_Bert 1.2061 https://ossci-raw-job-status.s3.amazonaws.com/log/11077409013 2023-02-03T00:48:51.8981496Z hf_Bert 1.2086 https://ossci-raw-job-status.s3.amazonaws.com/log/11077176061 2023-02-03T00:27:27.2594172Z hf_Bert 1.2077 https://ossci-raw-job-status.s3.amazonaws.com/log/11077075809 2023-02-03T00:21:54.4916449Z hf_Bert 1.2103 https://ossci-raw-job-status.s3.amazonaws.com/log/11076629886 2023-02-02T23:50:38.3512367Z hf_Bert 1.2191 https://ossci-raw-job-status.s3.amazonaws.com/log/11076577074 2023-02-02T23:46:06.5987589Z hf_Bert 1.2061 https://ossci-raw-job-status.s3.amazonaws.com/log/11076403972 2023-02-02T23:35:49.7931367Z hf_Bert 1.2088 https://ossci-raw-job-status.s3.amazonaws.com/log/11076234469 2023-02-02T23:25:55.7300688Z hf_Bert 1.2099 https://ossci-raw-job-status.s3.amazonaws.com/log/11075752070 2023-02-02T22:57:25.4280216Z hf_Bert 1.2048 https://ossci-raw-job-status.s3.amazonaws.com/log/11074434992 2023-02-02T22:10:58.4127805Z hf_Bert 1.2084 https://ossci-raw-job-status.s3.amazonaws.com/log/11074370082 2023-02-02T22:10:06.8153498Z hf_Bert 1.2075 https://ossci-raw-job-status.s3.amazonaws.com/log/11073914614 2023-02-02T21:25:53.3262334Z hf_Bert 1.2058 https://ossci-raw-job-status.s3.amazonaws.com/log/11073616418 2023-02-02T21:12:03.0024412Z hf_Bert 1.2053 https://ossci-raw-job-status.s3.amazonaws.com/log/11072632121 2023-02-02T20:25:37.5689220Z hf_Bert 1.2082 https://ossci-raw-job-status.s3.amazonaws.com/log/11072091471 2023-02-02T20:00:08.5175281Z hf_Bert 1.2079 https://ossci-raw-job-status.s3.amazonaws.com/log/11069395867 2023-02-02T18:29:04.6481423Z hf_Bert 1.2071 https://ossci-raw-job-status.s3.amazonaws.com/log/11069169921 2023-02-02T18:18:36.5701242Z hf_Bert 1.2036 https://ossci-raw-job-status.s3.amazonaws.com/log/11069070631 2023-02-02T18:15:32.2345859Z hf_Bert 1.2055 https://ossci-raw-job-status.s3.amazonaws.com/log/11067153829 2023-02-02T16:38:27.4201129Z hf_Bert 1.2133 https://ossci-raw-job-status.s3.amazonaws.com/log/11066885021 2023-02-02T16:28:44.4489971Z hf_Bert 1.2043 The above are the result of running a rockset query which returns links to the log and wget the logs and grep "Z hf_Bert" Pull Request resolved: https://github.com/pytorch/pytorch/pull/94088 Approved by: https://github.com/desertfire	2023-02-06 17:48:09 +00:00
albanD	d2b82feb41	Don't compare ids of temporary python objects (#94097 ) Since `.data` creates a new Tensor and thus a new python object, this check checks the id of temporary objects and thus always succeed given the current behavior of python's allocator: ``` >>> import torch >>> print(id(torch.rand(2)) == id(torch.rand(3))) True ``` I change it here to make sure they look at the same memory. If you want to check that they are the same python object, I can change it to `is`. Let me know! Pull Request resolved: https://github.com/pytorch/pytorch/pull/94097 Approved by: https://github.com/malfet	2023-02-06 16:30:20 +00:00
albanD	25a6e0fd79	Fix serialization (#94096 ) We now always have a `__getstate__`/`__setstate__` pair AND the `__dict__` attribute is lazily initialized. So we need to support that in our serialization code. A quick audit of the rest doesn't look like the new `__getstate__` is too problematic. But maybe the test suite will bring more things to light. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94096 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-06 16:30:20 +00:00
Bin Bao	db011e11ea	Skip sebotnet33ts_256 on CI (#94067 ) Summary: Random failure on CI and it happens more frequently lately. Skip for now and filed an issue at https://github.com/pytorch/pytorch/issues/94066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94067 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-06 14:58:54 +00:00
Iris	16387bee4a	[DCP] Fix test_file_system_checkpoint.py and test_file_system_checkpoint_cpu.py (#94069 ) This fixes the typo in assert that would always return True and adds missing import. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94069 Approved by: https://github.com/kumpera	2023-02-06 13:56:07 +00:00
Peter Bell	819990f595	[decomp] Decompose std/std_mean into aten.var/var_mean (#94072 ) These are currently decomposed into prims.var which is less useful for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94072 Approved by: https://github.com/lezcano	2023-02-06 10:22:07 +00:00
mingfeima	26cba842ad	Optimize ConvTransposed2D with mkldnn float32 and bfloat16 on CPU (#92530 ) this PR optimized `ConvTranspose2d` with oneDNN and add channels last support for it. Also the fallback path `slow_conv_transpose2d` also have channels last support. So the memory format propagation behavior would stay the same with or without oneDNN. Replacement of https://github.com/pytorch/pytorch/pull/77060, https://github.com/pytorch/pytorch/pull/70897 and https://github.com/pytorch/pytorch/pull/74023 which enables oneDNN for `ConvTranspose2d` and `ConvTranspose3d` The following results collects on Skylake Xeon 8180, dual sockets, 28 cores per socket. ### single core channels last configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 181.36 \| 91.16 \| 1.99 \| 531.38 \| 124.08 \| 4.28 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 324.35 \| 153.50 \| 2.11 \| 973.16 \| 185.97 \| 5.23 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 1086.82 \| 671.52 \| 1.62 \| 3008.94 \| 1453.33 \| 2.07 ### single core channels first configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 138.10 \| 5.94 \| 23.23 \| 37.97 \| 11.25 \| 3.38 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 236.43 \| 8.75 \| 27.03 \| 87.77 \| 18.58 \| 4.72 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 484.39 \| 37.69 \| 12.85 \| 185.40 \| 90.57 \| 2.05 ### single socket channels last configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 138.10 \| 5.94 \| 23.23 \| 37.97 \| 11.25 \| 3.38 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 236.43 \| 8.75 \| 27.03 \| 87.77 \| 18.58 \| 4.72 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 484.39 \| 37.69 \| 12.85 \| 185.40 \| 90.57 \| 2.0 ### single socket channels first configs \| forward before/ms \| forward after/ms \| ratio \| backward before/ms \| backward after/ms \| ratio -- \| -- \| -- \| -- \| -- \| -- \| -- input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) \| 132.56 \| 7.19 \| 18.43 \| 31.43 \| 11.20 \| 2.81 input size: (32, 16, 200, 200), weight size: (16, 16, 3, 3) \| 227.94 \| 13.33 \| 17.11 \| 63.00 \| 23.41 \| 2.69 input size: (32, 128, 100, 100), weight size: (128, 128, 3, 3) \| 473.68 \| 52.79 \| 8.97 \| 150.40 \| 87.33 \| 1.72 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92530 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-02-06 10:11:25 +00:00
haozhe.zhu	f3bf46e801	enable bf16 emb (#94163 ) Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163 Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5	2023-02-06 07:11:40 +00:00
Natalia Gimelshein	ea4cda5268	fix inductor clamp decomp to correctly type promote and avoid wrappin… (#94157 ) …g scalars Fixes #93784, #93225 Ideally, clamp decomp should live in refs or _decomp, but this reversed our current decomposition flow of `clamp_min` -> `clamp` -> lowering, so to keep changes to minimum, I'm leaving it in inductor for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94157 Approved by: https://github.com/ezyang	2023-02-06 05:36:19 +00:00
Michael Voznesensky	9350bcf6ae	Support neg calls to dyn shapes (#94068 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94068 Approved by: https://github.com/jansel	2023-02-05 21:38:16 +00:00
Aaron Gokaslan	7b6e948812	Add missing move to torch_dispatch_mode.h (#94154 ) Removes an unnecessary copy from torch_dispatch_mode.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/94154 Approved by: https://github.com/ezyang	2023-02-05 20:43:30 +00:00
Nikita Shulga	10a1efb49f	[MPS] Fix `cumsum` for negative indexes (#94119 ) Use `wrap_dim` to get dim in range or range IndexError Add test to test for that Addresses feedback raised in https://github.com/pytorch/pytorch/pull/88319#issuecomment-1403541180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94119 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2023-02-05 18:21:29 +00:00
Michael Voznesensky	60a3b7425d	Small refactor of shape guards to allow for 1:1 code_parts (#93894 ) By moving guard string assembly into dynamo's default behavior and letting code_parts do the work, we can have much better shape guard failures. Before this fix, the guard failure in the test would look like: ``` 'x.size()[1] == x.size()[0] and x.stride()[0] == x.[264 chars]!= 1' != 'x.size()[0] < 3' - x.size()[1] == x.size()[0] and x.stride()[0] == x.size()[0] and x.stride()[1] == 1 and x.storage_offset() == 0 and y.size()[0] == x.size()[0] and y.size()[1] == x.size()[0] and y.stride()[0] == x.size()[0] and y.stride()[1] == 1 and y.storage_offset() == 0 and x.size()[0] < 3 and x.size()[0] != 0 and x.size()[0] != 1 + x.size()[0] < 3 ``` now it is ``` "x.size()[0] < 3" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93894 Approved by: https://github.com/ezyang	2023-02-05 09:24:12 +00:00
Nikita Shulga	8a88852d5f	[MPS] Fix `index_select` for empty input (#94117 ) Also add test for this case to `test_index_select` Fixes https://github.com/pytorch/pytorch/issues/93877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94117 Approved by: https://github.com/orionr	2023-02-05 05:45:57 +00:00
Natalia Gimelshein	8ecda19607	fix upsampling decompositions to have integer output sizes (#94123 ) This allows unet to be compiled with symbolic shapes (but it still fails accuracy, lol). Output sizes are always integer, there's no need to pretend they are ever float. Recomputing scale factors still used nominally float sizes converted to int, we might as well do it from the start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94123 Approved by: https://github.com/ezyang	2023-02-05 04:56:07 +00:00
Yanbo Liang	2362b5fca3	[Dynamo] Put torch.cuda.stream into Dynamo FX graph (#93808 ) Fixes #92804 This PR only handles ```torch.cuda.stream```. If this is a right direction, I'll add support for several relevant functions, e.g, ```torch.cuda.current_stream().wait_stream(s)``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93808 Approved by: https://github.com/jansel	2023-02-05 04:52:43 +00:00
Michael Voznesensky	25c0737adc	dont graph break on list[SymInt] comparisons (#94054 ) Reland of https://github.com/pytorch/pytorch/pull/92617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94054 Approved by: https://github.com/jansel	2023-02-05 04:47:12 +00:00
Edward Z. Yang	1d53123f44	Report graph breaks separately from graph count (#94143 ) graph break != graph count - 1. Suppose you have a nested inline function call f1 to f2 to f3. A graph break in f3 results in six graphs: f1 before, f2 before, f3 before, f3 after, f2 after, f1 after. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94143 Approved by: https://github.com/voznesenskym	2023-02-05 04:03:12 +00:00
Edward Z. Yang	a2db70b3c7	Add graphs/ops to parse_logs.py (#94138 ) Also remove broken stats parsing logic. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94138 Approved by: https://github.com/voznesenskym	2023-02-05 04:03:12 +00:00
Wang, Eikan	9895c19a7a	To vectorize long datatype as mask index (#91076 ) In this PR, we record the current fx node being executed to cache additional information to simply the vectorization checker. In addition, we supported `masked` in this PR by simplifying it as `mask_load` to support `max_pool2d`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91076 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-05 03:36:22 +00:00
Edward Z. Yang	834e8f0464	Hack SymInt.__iadd__ to be working. (#94136 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94136 Approved by: https://github.com/Skylion007	2023-02-04 21:17:36 +00:00
Edward Z. Yang	c1da35af5e	Update dynamic benchmark skips (#94114 ) Data from https://github.com/pytorch/pytorch/pull/94134 Signed-off-by: Edward Z. Yang <ezyangmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94114 Approved by: https://github.com/SherlockNoMad	2023-02-04 20:36:51 +00:00
Aaron Gokaslan	3693039bb7	perf: fix missing noexcepts on minpybind in functorch (#94135 ) Noticed this performance bug in functorch. We got a pretty big perf in pybind11 improvement by explicitly marking at noexcept, see https://quuxplusone.github.io/blog/2022/08/26/vector-pessimization/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/94135 Approved by: https://github.com/ezyang	2023-02-04 20:07:15 +00:00
Iris	f54fd6fb28	[c10d] Update get_backend() in exception_handler (#94063 ) Currently, get_backend() and get_world_size() would always return the default value if no pg group argument is passed. This fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94063 Approved by: https://github.com/H-Huang	2023-02-04 19:39:36 +00:00
Edward Z. Yang	8c26ed5f5e	Add lowerings for all symbolic shape operators (#94121 ) In particular, this fixes the missing negative problem. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94121 Approved by: https://github.com/ngimel	2023-02-04 12:57:22 +00:00
cyy	afd7b581aa	Simplify OpenMP detection in CMake (#91576 ) We greatly simplify the handing of OpenMP in CMake by using caffe2::openmp target thoroughly. We follow the old behavior by defaulting to MKL OMP library and detecting OMP flags otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91576 Approved by: https://github.com/malfet	2023-02-04 11:50:06 +00:00
Eli Uriegas	d4a93eadee	tools: Add lint for CONSTEXPR (#94089 ) Adds a lint for CONSTEXPR to have us prefer to use macros for cuda files to support VS2017 compilations on windows internally (Meta) Follow up to https://github.com/pytorch/pytorch/pull/94091 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94089 Approved by: https://github.com/malfet	2023-02-04 11:25:30 +00:00
Nikita Shulga	996cc1c0d0	Fix Win+CUDA builds using VS2017 (#94091 ) Summary: Followup after https://github.com/pytorch/pytorch/pull/93267 Generated by running: ``` for i in *.cu; do sed -i -e "s/constexpr char/CONSTEXPR_EXCEPT_WIN_CUDA char/" $i; done ``` Otherwise, attempts to compile using VS-15.9 results in: ``` D:\pytorch\aten\src\aten\native\cuda\laguerre_polynomial_l.cu(17): fatal error C1001: An internal error has occurred in the compiler. (compiler file 'msc1.cpp', line 1518) To work around this problem, try simplifying or changing the program near the locations listed above. Please choose the Technical Support command on the Visual C++ Help menu, or open the Technical Support help file for more information Internal Compiler Error in D:\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64\cl.exe. You will be prompted to send an error report to Microsoft later. INTERNAL COMPILER ERROR in 'D:\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64\cl.exe' Please choose the Technical Support command on the Visual C++ Help menu, or open the Technical Support help file for more information ``` Test Plan: CI Differential Revision: D43011140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94091 Approved by: https://github.com/seemethere	2023-02-04 08:22:49 +00:00
Sergii Dymchenko	2064fa9f10	Clean-up removed TH from BUCK (#94022 ) Differential Revision: D42981979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94022 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb, https://github.com/malfet	2023-02-04 08:16:43 +00:00
PyTorch MergeBot	7fb2ac2bd5	Revert "trymerge to ignore certain failures (#91134 )" This reverts commit 8b7bd5dffccf342cacae510d6c5a6ca2665770b7. Reverted https://github.com/pytorch/pytorch/pull/91134 on behalf of https://github.com/seemethere due to Breaks internal `github-export-checks` see failure: https://fburl.com/sandcastle/ggqj29pz	2023-02-04 08:08:32 +00:00
Edward Z. Yang	170a3e0257	Enable Python dispatcher on inference-only aot_dispatch_base (#94118 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94118 Approved by: https://github.com/voznesenskym	2023-02-04 06:10:21 +00:00
Masaki Kozuki	4207d3c330	`FusedAdam(W)` should take `OptState` into account before unscaling grads (#94060 ) the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in `e52786f3d1/torch/cuda/amp/grad_scaler.py (L235-L266)` and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients Related #90752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060 Approved by: https://github.com/albanD	2023-02-04 05:20:13 +00:00
William Wen	adde6fd25e	[dynamo 3.11] update instruction sizes (#93984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93984 Approved by: https://github.com/jansel, https://github.com/albanD, https://github.com/malfet, https://github.com/mlazos	2023-02-04 04:09:24 +00:00
Liao, Xuan	11de399447	[inductor] fix cpu implement of torch.neg (#94035 ) Fixes #93380 Fix to maintain the data type after doing neg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94035 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-04 03:13:11 +00:00
cyy	1a32db15e7	Some performance fixes (#94034 ) Applies some performance fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/94034 Approved by: https://github.com/Skylion007	2023-02-04 02:17:48 +00:00
cyy	fa65ae8f56	cleanup unused include (#93359 ) Using `include-what-you-use` tool to find out and remove some unused includes Pull Request resolved: https://github.com/pytorch/pytorch/pull/93359 Approved by: https://github.com/malfet	2023-02-04 02:15:50 +00:00
cyy	27efdc5eed	fix writable-strings warnings (#93246 ) clang reports "ISO C++11 does not allow conversion from string literal to 'char *'" Pull Request resolved: https://github.com/pytorch/pytorch/pull/93246 Approved by: https://github.com/malfet	2023-02-04 02:11:15 +00:00
Huy Do	59a81b695a	Fix flaky linter clang-tidy relative path (#94093 ) There are some occurrences when clang-tidy linter fails flakily with the following error, which is very weird: ``` >>> Lint for FILE: Error (CLANGTIDY) command-failed Failed due to FileNotFoundError: [Errno 2] No such file or directory: '.lintbin/clang-tidy' ``` For examples, * `0a93e6db5a` * `203b2cad3e` The binary is definitely there as the log shows that it has been downloaded successfully from S3. Looking a bit closer, I notice that the linter uses `os.chdir` to jump around between the workspace and the build folder. And it also refers to the binary with the relative path `.lintbin/clang-tidy` which doesn't exist in the latter. AFAIK, the current working directory is per process (https://stackoverflow.com/questions/16388400/what-is-a-thread-specific-os-chdir-and-mkdir-in-python), so I suspect that there is a race here where one thread chdir into build while another thread tries to lint another file. Thus the fix to use the absolute path to clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/94093 Approved by: https://github.com/malfet	2023-02-04 02:05:38 +00:00
Jason Ansel	e071d72f3c	Tag dynamo backends as debug/experimental (#93878 ) Hides debug/experimental backends by default. Before: ``` torch._dynamo.list_backends() ['aot_eager', 'aot_eager_decomp_partition', 'aot_torchxla_trace_once', 'aot_torchxla_trivial', 'aot_ts', 'aot_ts_nvfuser', 'cudagraphs', 'dynamo_accuracy_minifier_backend', 'dynamo_minifier_backend', 'eager', 'inductor', 'ipex', 'nvprims_aten', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'torchxla_trace_once', 'torchxla_trivial', 'ts', 'tvm'] ``` After: ``` torch._dynamo.list_backends() ['aot_ts_nvfuser', 'cudagraphs', 'inductor', 'ipex', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'tvm'] ``` Fixes https://github.com/pytorch/pytorch/issues/93733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93878 Approved by: https://github.com/voznesenskym	2023-02-04 00:50:51 +00:00
Howard Huang	5c7f4534e9	[small] multithreaded-pg guard attr (#93883 ) currently the test ``` pytest test/distributed/test_multi_threaded_pg.py -vs ``` has errors ``` Traceback (most recent call last): File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run self._target(self._args, *self._kwargs) File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run self._tls.precision = TestCase._precision AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883 Approved by: https://github.com/awgu, https://github.com/wanchaol	2023-02-03 23:01:02 +00:00
amdfaa	6d597c532e	[ROCm] Add diskspace check for rocm CI nodes (#93032 ) Fixes #92822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93032 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-02-03 22:38:57 +00:00
Huy Do	ef156f9136	Enable retry support for MPS tests (#94070 ) Here is an example `d7c71a95b6` where the MPS test was flaky but not retried. Thus it failed. We probably would want to support retry on MPS tests like the rest of the CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/94070 Approved by: https://github.com/clee2000	2023-02-03 22:21:31 +00:00
Natalia Gimelshein	3c79ea2607	Removes stray print (#94079 ) Pertitle Pull Request resolved: https://github.com/pytorch/pytorch/pull/94079 Approved by: https://github.com/voznesenskym	2023-02-03 21:56:45 +00:00
Jason Ansel	dfac113cfc	Remove torch/_dynamo/optimizations (#93871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93871 Approved by: https://github.com/voznesenskym	2023-02-03 21:54:28 +00:00
Jason Ansel	5f4fec7459	Fix/refactor dynamo tvm backend (#93870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93870 Approved by: https://github.com/shingjan, https://github.com/desertfire	2023-02-03 21:48:31 +00:00
Jason Ansel	0a93e6db5a	Fix/refactor dynamo ipex backend (#93863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93863 Approved by: https://github.com/desertfire	2023-02-03 21:42:27 +00:00
Svetlana Karslioglu	5197496799	Add a private API banner (#93996 ) Add a banner that will appear on all pages where the last segment of the URL starts with an underscore "_". Example pages: * https://pytorch.org/docs/master/_dynamo.html * https://pytorch.org/docs/master/_modules/torch/_jit_internal.html Sample screenshots: <img width="885" alt="Screenshot 2023-02-03 at 1 13 47 PM" src="https://user-images.githubusercontent.com/5317992/216711948-6ba35d38-da8f-4145-9580-bafc921a1df5.png"> <img width="871" alt="Screenshot 2023-02-03 at 1 12 51 PM" src="https://user-images.githubusercontent.com/5317992/216711951-877a760e-3449-4593-b81c-14bf3b9943da.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93996 Approved by: https://github.com/malfet, https://github.com/albanD	2023-02-03 21:40:15 +00:00
Catherine Lee	1c30268ff1	Update rockset version (#94005 ) upgrading rockset to 1.0.3 the diff looks like it gets rid of dependency on six but i think python-dateutils still uses it but is better about downloading it Pull Request resolved: https://github.com/pytorch/pytorch/pull/94005 Approved by: https://github.com/huydhn	2023-02-03 21:38:35 +00:00
albanD	5be57d51f9	Fix testing now that random.sample() arg must be a sequence (#94052 ) This is only enforced in 3.11 but the change is not bad for other versions either (and this is test code so perf is not a concern). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94052 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-03 21:28:02 +00:00
albanD	8051f8a6ee	Fix Storage destruction GC tracking (#94051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94051 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-03 21:28:02 +00:00
Jason Ansel	203b2cad3e	Remove fx2trt/torch2trt backends (#93822 ) These backends have been broken for some time. I tried to get them running again, but as far as I can tell they are not maintained. Installing torch_tensorrt downgrades PyTorch to 1.12. If I manually bypass that downgrade, I get import errors from inside fx2trt. Fixes that re-add these are welcome, but it might make sense to move these wrappers to the torch_tensorrt repo once PyTorch 2.0 support is added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93822 Approved by: https://github.com/frank-wei	2023-02-03 21:04:21 +00:00
Jason Ansel	5d709af59a	Rename aot_cudagraphs to cudagraphs (#93821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93821 Approved by: https://github.com/ezyang	2023-02-03 21:01:27 +00:00
Catherine Lee	8b7bd5dffc	trymerge to ignore certain failures (#91134 ) For any failure in dr ci listed as "flaky" or "broken trunk" (aka anything not "new failures"), these get marked as "ok to fail". If there are a small number (currently set to 3) ok to fail jobs, merge can still continue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91134 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-02-03 20:56:39 +00:00
Jason Ansel	a5ff40032d	Fix/refactor dynamo onnxrt backend (#93818 ) Fixes https://github.com/pytorch/pytorch/issues/90352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93818 Approved by: https://github.com/voznesenskym	2023-02-03 20:48:02 +00:00
Masaki Kozuki	d9870d70c1	Exempt `_foreach_norm` from autograd_not_implemented_fallback check (#93995 ) Fixes #93940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93995 Approved by: https://github.com/ngimel, https://github.com/albanD	2023-02-03 19:45:46 +00:00
Joel Schlosser	dc7bf1a7ea	General reversible binary op support (e.g. __add__ / __radd__) in dynamo (#93271 ) Generic support for reversible binary op pairs (e.g. `__add__` / `__radd__`) in dynamo. Adds logic to flip args and try the reverse op when the forward op is unsupported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93271 Approved by: https://github.com/voznesenskym, https://github.com/jansel, https://github.com/ezyang	2023-02-03 19:28:35 +00:00
albanD	e52786f3d1	Silence profiler error (#94013 ) This is not 3.11 specific but a lot more likely in 3.11 I guess You can find other reports at https://github.com/pytorch/pytorch/issues/64345 as well for it failing in 3.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94013 Approved by: https://github.com/malfet	2023-02-03 17:33:47 +00:00
Wei Wang	a0fc90b07f	Add TorchData for regular cleanup of anaconda pytorch-nightly channel (#94014 ) Fixes https://github.com/pytorch/test-infra/issues/1413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94014 Approved by: https://github.com/ejguan, https://github.com/malfet	2023-02-03 17:13:58 +00:00
Svetlana Karslioglu	3b7140d938	Add the new submission form (#94000 ) Adding the new form for submitting topics on quarterly maintainers meetings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94000 Approved by: https://github.com/orionr	2023-02-03 16:46:30 +00:00
Sean Ross-Ross	6650aac8ce	move more operators to BatchRulesDecompositions (#93164 ) Moving operators over to `BatchRulesDecompositions.cpp` to remove xfails. I noticed that composite-compliant does not mean inductor or vmap compliant, so I added more `isTensorSubclassLike` checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/93164 Approved by: https://github.com/lezcano, https://github.com/kshitij12345	2023-02-03 16:36:05 +00:00
Davis Rollman	6e1e212c39	[platform010] remove more ovr_config//runtime:platform009 usage (#93008 ) Summary: WTTS Test Plan: ci Reviewed By: akrieger Differential Revision: D42729966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93008 Approved by: https://github.com/kit1980	2023-02-03 16:32:04 +00:00
milesial	6c555b29a8	MHA optimizations (#93234 ) Slight perf optimizations for regular MHA by reducing the number of kernels called Before: ![image](https://user-images.githubusercontent.com/30204471/215349212-172c6364-9e3c-4fd1-92b6-8ddd9931613e.png) After: ![image](https://user-images.githubusercontent.com/30204471/215349247-021dd9e6-f6ca-40a2-8de8-0805af001f69.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93234 Approved by: https://github.com/drisspg	2023-02-03 15:18:35 +00:00
Nikita Karetnikov	162e3ca58e	[fx] fix type promotion in `binary_magic_impl` (#91376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91376 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-03 15:06:40 +00:00
Nikita Karetnikov	34bcbfbd6a	[fx] throw exceptions on invalid input in `FloorDiv` (#93143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93143 Approved by: https://github.com/ezyang	2023-02-03 15:06:40 +00:00
Nikita Karetnikov	ba614f3a32	[fx] test `FloorDiv` against Python impl (#93142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93142 Approved by: https://github.com/ezyang	2023-02-03 15:06:38 +00:00
Nikita Karetnikov	e7c63b962b	[fx] add SymPy assumptions to `FloorDiv` (#93185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93185 Approved by: https://github.com/ezyang	2023-02-03 15:06:36 +00:00
Edward Z. Yang	2481fc0df4	Add count to FakeTensorMode.__torch_dispatch__ (#93936 ) Most calls to fake tensor never hit `FakeTensor.__torch_dispatch__` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93936 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2023-02-03 14:21:11 +00:00
Edward Z. Yang	12f22655b1	Short circuit device property access on FakeTensor (#93946 ) Before: ``` (/home/ezyang/local/a/pytorch-env) [ezyang@devgpu020.ftw1 ~/local/a/pytorch (ab0e3db0)]$ python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18 cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:54.19504 backend_compile:33.86702 STATS: call_* op count: 1369 \| FakeTensor.__torch_dispatch__:72549 \| FakeTensorMode.__torch_dispatch__:115542 \| ProxyTorchDispatchMode.__torch_dispatch__:3103 ``` After ``` (/home/ezyang/local/a/pytorch-env) [ezyang@devgpu020.ftw1 ~/local/a/pytorch (ab0e3db0)]$ python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18 cuda eval hrnet_w18 PASS TIMING: entire_frame_compile:53.97591 backend_compile:33.60832 STATS: call_* op count: 1369 \| FakeTensor.__torch_dispatch__:4995 \| FakeTensorMode.__torch_dispatch__:89985 \| ProxyTorchDispatchMode.__torch_dispatch__:3010 ``` It doesn't really help end-to-end wall time all that much, but it does cut the number of calls to FakeTensor.__torch_dispatch__ by an order of magnitude, which hopefully has other positive effects. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93946 Approved by: https://github.com/eellison, https://github.com/albanD	2023-02-03 14:20:30 +00:00
Peter Bell	77acb556e6	[primTorch] Rewrite nan_to_num ref in terms of aten functions (#93952 ) This de-duplicates `_refs.nan_to_num` with the inductor decomposition and simplifies it to not reimplement `isnan`, `isposinf` and `isneginf`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93952 Approved by: https://github.com/lezcano	2023-02-03 13:51:37 +00:00
Peter Bell	72385bbd03	[primTorch] Rewrite is{,pos,neg}inf refs in terms of aten functions (#93951 ) `isposinf` and `isneginf` currently fallback in inductor. Here, I enable the existing decompositions to work with inductor. `isinf` can also be written with aten functions, however I don't add it to inductor's decompositions because `isinf` is lowered to `tl.libdevice.isinf` in triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93951 Approved by: https://github.com/lezcano	2023-02-03 13:51:37 +00:00
Nikita Shulga	6c4dc98b9d	[CI][BE] Move docker forlder to `.ci` (#93104 ) Follow up after https://github.com/pytorch/pytorch/pull/92569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93104 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/ZainRizvi	2023-02-03 12:25:33 +00:00
min-jean-cho	6e1cfcdf4b	cauchy_ few fixes (1) check gamma > 0 (2) better dtype error log (#93314 ) Related #92047 (1) `torch.Tensor.cauchy_` is missing check for `gamma > 0` (`torch.distributions.cauchy.Cauchy` correctly checks `gamma > 0`). (2) add better error log on dtype similar to exponential_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/93314 Approved by: https://github.com/jgong5, https://github.com/fritzo, https://github.com/lezcano	2023-02-03 11:56:28 +00:00
Jiayi Sun	d7c71a95b6	[Dynamo] modify IPEX backend (#92067 ) 1. Combine the two backends ‘ipex_fp32’ and ‘ipex_bf16’ into one backend ‘ipex’. 2. Modify IPEX backend to work in fake mode and symbolic mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92067 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-03 11:50:28 +00:00
Muhammad Firmansyah Kasim	aaa27a6b6d	Vectorized more stable complex division (#93277 ) Fixes #92043 and completing #92539 by implementing the vectorized more stable complex division. I implement this using the internal `abs_` function to avoid branching. I also re-implement the internal `abs_` to make it more stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93277 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-02-03 11:48:20 +00:00
mingfeima	b41e2779f2	cumsum, cumprod, logcumsumexp: adjust grain size (#94025 ) Common issue when paralleling with `TensorIterator`, if the problem size is described as [M, N, K] and [M, N] is reflected in TensorIterator (with K being folded), `grain_size` should also be divided by K. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94025 Approved by: https://github.com/XiaobingSuper	2023-02-03 10:42:27 +00:00
Natalia Gimelshein	ca8450849b	compute dynamic tensor shapes for indexing on the host (#93872 ) Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is ``` x1 = (xindex // pks0) % 64 ``` instead of ``` x1 = (xindex // (1 + (((((-1) + ks0) // 4))((((-1) + ks0) // 4))) + (2((((-1) + ks0) // 4))))) % 64 ``` with `pks0` arg computed on the host ``` ps0 = (1 + ((((-1) + s2) // 4)))(1 + ((((-1) + s2) // 4))) ``` It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g. ``` tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0(((((-1) + ks0) // 32))((((-1) + ks0) // 32)))) + (2x0((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32) ``` Unfortunately, `unet` which is one of the examples failing with floor does the latter: ``` tmp1 = ((-1)(1/(((-1) + (floor(2.0(ks0//16))))))) + ((1/(((-1) + (floor(2.0(ks0//16))))))*(ks0 // 16)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872 Approved by: https://github.com/jansel	2023-02-03 09:58:39 +00:00
Elias Ellison	e4f11e01bd	[Fake Tensor] Allow fake meta by default, delete unused ctor args (#93993 ) Two small changes that I'm bundling together because one of them needs to touch fbcode and I'm not sure how to do stacked diffs + internal changes + land before release cut. Remove allow_meta from ctor, and allow by default: we should be able to trace through meta with fake tensors, so in some senses it's a bit weird to expose to user to disallow this. However, it's still useful debug wise to error from time to time, so I've added an option to the config that will get back previous behavior. Remove `throw_on_data_dependent_ops=True`: this was intended as a temporary behavior as we were smoothing things turning on the erroring. There are no uses anywhere of `throw_on_data_dependent_ops=False` I could find. These are technically backward-incompatble, but fake tensor is new since the last release / in a private namespace, and I don't want to release it with baggage that would be hard to remove later. Fix for https://github.com/pytorch/pytorch/issues/92877. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93993 Approved by: https://github.com/bdhirsh, https://github.com/ezyang	2023-02-03 09:23:38 +00:00
Nikita Shulga	be364c0cda	[Inductor] Fix OpenMP discovery on MacOS (#93895 ) It's not available as system dependency, so assume that it is installed using Anaconda Also, clang on MacOS does not recognize `-fopenmp` flag, but according to https://mac.r-project.org/openmp/ and local experiments `-Xclang -fopenmp` always works Test plan: Following should run and return true ```python import torch def foo(x: torch.Tensor) -> torch.Tensor: return torch.sin(x) + torch.cos(x) if __name__=="__main__": x = torch.rand(3, 3) x_eager = foo(x) x_pt2 = torch.compile(foo)(x) print(torch.allclose(x_eager, x_pt2)) ``` Skip number of tests that fail on x86 MacOS (for example rsqrt for bool type and `test_pixel_shuffle_channels_last_cpu` on machines that do not support AVX2) Tweak few tests to use double precision when running on CPU, as type promotion for accumulator types is broken. TODO: Fix PyTorch for M1 compilation with OpenMP, bundle `omp.h` into the package and use it instead. Fixes https://github.com/pytorch/pytorch/issues/90362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93895 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-02-03 09:13:13 +00:00
fduwjj	e98a942399	[PTD] Land 'to_std' utility parser fix #93209 (#94023 ) Land https://github.com/pytorch/pytorch/pull/93209 faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94023 Approved by: https://github.com/wz337	2023-02-03 09:04:34 +00:00
Fabio Rocha	63115b70f0	Fixed issue with --diff-branch arg in dynamo benchmarks (#93989 ) As @peterbell10 pointed out, it was giving incorrect results for `compression_ratio` and `compression_latency` when you used `--diff-branch`. This fixes this by running a separate subprocess for each branch to make sure you are not being affected by run for other branch. Also added a couple of more significant figures to numbers in summary table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93989 Approved by: https://github.com/jansel	2023-02-03 08:36:57 +00:00
Driss Guessous	3df0e26e20	[SDPA] Remove private version and only utilize public version (#94004 ) # Summary Due to internal failures we needed to keep the private call in torch.nn.mha. This PR undoes this change, so that we call the public function and remove the private function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94004 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-02-03 08:12:09 +00:00
Hansong Zhang	d996acfbc2	[XNNPACK] disable ARM_BF16 and ARM_FP16_VECTOR (#94020 ) Summary: This is not used and will cause build failure Test Plan: CI Differential Revision: D42982023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94020 Approved by: https://github.com/Skylion007, https://github.com/tiandiao123, https://github.com/digantdesai	2023-02-03 05:01:00 +00:00
mingfeima	dd7d47c4ac	abstract vectorized reduction utils on CPU (#92284 ) This PR abstracts some reduction utils on CPU, which can be shared by multiple reduction operators, such as `scatter_reduce`, `segment_reduce`, `spmm_reduce`. No functional change or performance change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92284 Approved by: https://github.com/ezyang	2023-02-03 04:59:24 +00:00
Jing Xu	79243516f6	collect CPU info with collect_env.py for new issues reporting (#93899 ) Add CPU information collection feature to collect_env.py for new issues reporting. This helps us to triage issues on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93899 Approved by: https://github.com/malfet	2023-02-03 04:58:53 +00:00
blzheng	a71395dd88	[inductor] fix crash issue when input is a view tensor (#90150 ) Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-03 04:54:14 +00:00
PyTorch MergeBot	732a865c1b	[vision hash update] update the pinned vision hash (#94016 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94016 Approved by: https://github.com/pytorchbot	2023-02-03 04:21:12 +00:00
Wanchao Liang	d05ec0efeb	[dtensor] add split_with_sizes op (#93957 ) add the split_with_sizes op, sharing with split op impl Pull Request resolved: https://github.com/pytorch/pytorch/pull/93957 Approved by: https://github.com/XilunWu	2023-02-03 04:16:30 +00:00
cyy	bfe5e1258b	avoid unnecessary static_cast (#93898 ) avoid unnecessary static_cast Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898 Approved by: https://github.com/Skylion007	2023-02-03 03:44:43 +00:00
cyy	dbbcefcd78	remove std::iterator (#93924 ) std::iterator is deprecated in C++17, and it is easy to remove it Pull Request resolved: https://github.com/pytorch/pytorch/pull/93924 Approved by: https://github.com/Skylion007	2023-02-03 03:43:48 +00:00
PyTorch MergeBot	f7bd5d0ccb	Revert "[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#91… (#92402 )" This reverts commit 965f4ea3bac8186b99119e73b9ff00e390a5d28b. Reverted https://github.com/pytorch/pytorch/pull/92402 on behalf of https://github.com/zhxchen17 due to Caused a regression for an export model.	2023-02-03 03:12:43 +00:00
Jason Ansel	60e8c766b5	Refactor dynamo training backends (#93409 ) This splits training.py into many files and moves them from `dynamo.optimizations.training` to `dynamo.backends.*`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93409 Approved by: https://github.com/ezyang	2023-02-03 03:07:15 +00:00
Vasiliy Kuznetsov	f84f89b1c3	ns: add compare_weights API with a single model (#92058 ) Summary: Adds a compare weights NS API using a single model. Note: this is not intended for wide usage, so testing is limited to specific functions our customers care about. The main reason for adding this is because existing customers of NS are using the old `compare_weights` API, and we'd like to move everyone to a single-model API style. Once all the customers are moved over, we can delete all the old NS code. Test plan: ``` python test/test_quantization.py -k NShadows.test_extract_weights_linear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92058 Approved by: https://github.com/jerryzh168	2023-02-03 01:17:19 +00:00
Vasiliy Kuznetsov	660bea10ba	add add_loggers implementation using PNP (#91639 ) Summary: This PR reimplements the old `add_loggers(name_a, model_a, name_b, model_b)` API in a single-model API style, similar to PNP. This allows for memory efficiency savings of not having to load two models. Test plan: ``` python test/test_quantization.py -k NShadows.test_add_loggers ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91639 Approved by: https://github.com/jerryzh168	2023-02-03 01:17:19 +00:00
Egil Martinsson	a719bb0e37	Readme: Fix for outdated build-from-source documentation (#91861 ) ## `pip install -r requirements.txt` in build-from-source documentation This line `81b5eff3c3/README.md (L182-L188)` Is outdated. Let's default to `requirements.txt` ### My problem Without touching this codebase for years I'm trying to build repo for local development and run unit tests. I go to `build from source => Contributing.md`. I immediately run into various problems. * [Contributing.md](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#developing-pytorch) suggests one way of setting up environment different from [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) that does not work for me. * [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) suggests a different set of dependencies than [`requirements.txt`](https://github.com/pytorch/pytorch/blob/master/requirements.txt), many of which are unnecessary, and there's still missing ones to run unit tests. * Dependencies in `requirements.txt` are needed to run unit tests So there's competing, inlined and outdated equally confident recommendations on how to set up. https://github.com/pytorch/pytorch/pull/91850 tries to remove one recommendation, this PR tries to make the default one simpler. ### Goals * Improve society somewhat 😁 * Remove a dead end roundtrip in the developer onboarding funnel * Update a duplicated & outdated line of documentation * Two broken things => one broken thing * Improve doc maintainability and nudge us to a productive discussion of what `requirements.txt` is there for. ### Non-goals * Give a definite recommendation how to set up your machine for local development. I read the instructions in readme at this moment as an outline on how to do it. * Say that `requirements.txt` is a definite guide to dependencies, I know it's not (but probably should be) ### Background * Dependency handling/reproducibility in this repo is tricky! See geist of [this](`fdbbd20f32/.github/requirements/README.md`). There's many different sets of dependencies with different setups for different environments. * There's been great attempts of _"one requirements.txt to rule them all"_ which got halted https://github.com/pytorch/pytorch/pull/60697/ see https://github.com/pytorch/pytorch/issues/61375 * The unofficial `requirements.txt` file seem to be .circleci/docker/requirements-ci.txt https://github.com/pytorch/pytorch/issues/72556 * Unofficial _"how to build from source"_ docs seem to be here https://github.com/pytorch/pytorch/tree/master/.circleci#how-to-build-a-binary-locally ### Considered alternatives * a) Point only to python dependencies in `requirements.txt` (Chosen option) ``` conda install cmake ninja pip install -r requirements.txt ``` This guarantees `python setup.py` to run (on my machine) and gets me one step closer to be able to `python test/run_test.py` * b) Only add whats needed to `python setup.py install`. Point to `Contributing.md` for explanations on how to run tests (which doesn't exactly mention how yet). ``` conda create -n pytorch-source python cmake ninja pyyaml typing_extensions conda activate pytorch-source python setup.py develop ``` * c) Add dependencies needed to run (most) unit tests I assume _"Install from source"_ describes how to "install so I can do development.". This is why we recommend `python setup.py develop`. Doing development implies running unit tests. ``` conda create -n pytorch-source python cmake ninja pytest click conda activate pytorch-source pip install -r requirements.txt xdoctest python setup.py develop python test/run_test.py --keep-going ``` This still eclectically goes outside the simple principle _"Use dependencies in requirements.txt"_ without solving the whole problem. Instructions to get tests to run is not the goal of this PR. * d) Point to ex [`.circleci/docker/requirements-ci.txt`](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt) or any of the system-specific sets of pinned requirements like [`requirements-{conda-env-macOS-ARM64}.txt`](https://github.com/pytorch/pytorch/blob/master/.github/requirements/conda-env-macOS-ARM64) I don't want to jump into this rabbit hole. <details> <summary>My system according to setup.py when verifying it runs</summary> ``` Target system: Darwin-21.6.0 Target processor: arm64 Host system: Darwin-21.6.0 Host processor: arm64 Detected C compiler: AppleClang @ /Library/Developer/CommandLineTools/usr/bin/cc CMake: 3.22.1 Make program: /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/bin/ninja Python version : 3.10.8 Python executable : /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/bin/python Pythonlibs version : 3.10.8 Python library : /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/lib/libpython3.10.a Python includes : /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/include/python3.10 Python site-packages: lib/python3.10/site-packages ``` </details> See details in comments below. [skip ci] Pull Request resolved: https://github.com/pytorch/pytorch/pull/91861 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2023-02-03 00:52:23 +00:00
Chien-Chin Huang	0f5b6caa16	[FSDP][optim_state_dict] Ignore the state check on rank that does not own the corresponding parameter (#93318 ) When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one. Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93318 Approved by: https://github.com/rohan-varma	2023-02-03 00:50:04 +00:00
Huy Do	0844213f7d	Improve Windows CI logic to cleanup leftover processes (#93914 ) This is really hard to debug, the faulty runner already disappeared by the time I tried to login. However, I figure out a way to get all the processes that could potentially hold the workspace by running: ``` choco install sysinternals -y handle64.exe C:\actions-runner\_work\pytorch\pytorch\test\test-reports\ ``` This gives me a better list of processes to kill. ``` PS C:\Windows\system32> handle64.exe C:\actions-runner\_work\pytorch\pytorch\test\test-reports\ Nthandle v5.0 - Handle viewer Copyright (C) 1997-2022 Mark Russinovich Sysinternals - www.sysinternals.com python.exe pid: 1672 type: File 574: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log python.exe pid: 4604 type: File 6C8: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log python.exe pid: 4604 type: File 6CC: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log ninja.exe pid: 4764 type: File 468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log ninja.exe pid: 4764 type: File 5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log cl.exe pid: 5336 type: File 468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log cl.exe pid: 5336 type: File 5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log nvcc.exe pid: 1680 type: File 468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log nvcc.exe pid: 1680 type: File 5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log cmd.exe pid: 976 type: File 468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log cmd.exe pid: 976 type: File 5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log ``` Crossing my fingers to have this working Pull Request resolved: https://github.com/pytorch/pytorch/pull/93914 Approved by: https://github.com/clee2000	2023-02-03 00:47:50 +00:00
Peter Bell	5817695bfa	[pt2] Fix arange to match ATen behavior (#93353 ) Fixes #92676 `arange` infers the output dtype from the argument types, but in order to reduce falling back to ATen, inductor preferred to cast whole number float arguments to int which gave the wrong output dtype. Instead, this decomposes floating point arange into the prim equivalent for integers. This also changes the signature of `prims.arange` to ```python prims.iota(length, , start, step, *factory_kwargs) ``` which only supports integers arguments. This is done because calculating the output size from `start, end, step` is surprisingly complex and liable to off by one errors so should not be duplicated in each backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93353 Approved by: https://github.com/ngimel, https://github.com/lezcano	2023-02-03 00:44:32 +00:00
Rohan Varma	264c89658b	Move in backward opt setup to helper (#92059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92059 Approved by: https://github.com/awgu	2023-02-02 23:57:14 +00:00
Chien-Chin Huang	e32d99ae19	[FSDP][optim_state_dict] Make FSDP.optim_state_dict compatbile with DMP (#93285 ) `torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together. Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93285 Approved by: https://github.com/rohan-varma	2023-02-02 23:42:54 +00:00
Digant Desai	989722cd19	Use global PIC flag for XNNPACK (#93896 ) Summary: - XNNPACK Object libraries needs an explicit PIC flag when building static, PIC libXNPACK.a - Without this link process runs into relocation errors - Using this global switch to avoid updating XNNPACK CMake Test Plan: CI Differential Revision: D42944764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93896 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/salilsdesai	2023-02-02 23:38:21 +00:00
William Wen	7db4d813c3	[dynamo 3.11] fix opmap key error (#93983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93983 Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/albanD	2023-02-02 23:05:44 +00:00
William Wen	37a28255cb	[dynamo, benchmarks] Fix dashboard update location (#94006 ) Get dashboard uploading again Pull Request resolved: https://github.com/pytorch/pytorch/pull/94006 Approved by: https://github.com/yanboliang	2023-02-02 23:01:57 +00:00
Horace He	c2fb1f8ee4	Add is_integer assumption to ModularIndexing (#93903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93903 Approved by: https://github.com/ezyang	2023-02-02 22:51:34 +00:00
Peter Bell	b7a5c79399	[inductor] Fix type inference in CPU masked operations (#93842 ) Fixes #93351 The existing code guesses that `tmp3` is probably a `float`, and so truncates any `double` values ```cpp float tmp3 = 0.0; if(tmp2) { auto tmp4 = in_ptr0[i0]; tmp3 = tmp4; } ``` The proposed change is to generate a lambda expression that represents the body of the masked operation, and infer the type from the return value: ```cpp auto tmp3 = [&] { auto tmp4 = in_ptr0[i0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93842 Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/jansel	2023-02-02 22:42:19 +00:00
Nikita Shulga	fde220ca44	[BE] Get rid of `six` in caffe2 code (#93956 ) Mostly `s/string_types/str/` `s/binary_types/bytes/` and `s/text_types/str/` Also `y.extend([str(x) for x in foo])`->`y.extend(map(str, foo))` As Python-2 is long dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/93956 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-02-02 22:13:37 +00:00
Edward Z. Yang	37fcc53096	Remove import cycle from torch._refs.nn.functional (#93948 ) This makes it possible to import torch._refs from torch._subclasses.fake_tensor Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93948 Approved by: https://github.com/albanD	2023-02-02 21:06:37 +00:00
Michael Suo	4e4293f15f	Add meta registration for bucketize (#93893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93893 Approved by: https://github.com/zhxchen17	2023-02-02 21:03:08 +00:00
Jason Ansel	2b0d7e63f0	Move dynamo.optimizations.distributed to backends (#93408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93408 Approved by: https://github.com/wconstab	2023-02-02 20:42:17 +00:00
atalman	2910695942	Remove cuda 11.6 from nightly (#93979 ) Remove cuda 11.6 from CI replace with 11.7 Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/93979 Approved by: https://github.com/Skylion007, https://github.com/clee2000, https://github.com/malfet	2023-02-02 20:27:19 +00:00
Jason Ansel	ee2729890c	Refactor dynamo register_backend/BACKENDS (#93389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93389 Approved by: https://github.com/voznesenskym	2023-02-02 19:41:48 +00:00
atalman	6e285c479d	Remove cuda 11.6 from CI replace with 11.7 (#93406 ) Remove cuda 11.6 from CI replace with 11.7 Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/93406 Approved by: https://github.com/malfet, https://github.com/desertfire	2023-02-02 19:16:05 +00:00
Andrew Gu	f9d2600ce2	[Dynamo] Rename `GuardBuilder.guarded_code` -> `check_fn_manager` (#93934 ) I was reading Dynamo code to learn and thought to clarify this naming to remove the `TODO`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93934 Approved by: https://github.com/ezyang	2023-02-02 17:20:25 +00:00
PyTorch MergeBot	f5e9c8ce54	Revert "Remove CUDA 11.6 from nightly builds (#93404 )" This reverts commit c76ac8eef24299901e0b8fe163d2438528cbaf3e. Reverted https://github.com/pytorch/pytorch/pull/93404 on behalf of https://github.com/clee2000 due to breaking lint	2023-02-02 17:10:01 +00:00
PyTorch MergeBot	5d259425fc	Revert "[inductor] fix crash issue when input is a view tensor (#90150 )" This reverts commit b11ec270bad96bf6078564ec4b2dc5dc69ea5bfa. Reverted https://github.com/pytorch/pytorch/pull/90150 on behalf of https://github.com/clee2000 due to failing test_inplace_unsqueeze3 (__main__.CPUReproTests) https://github.com/pytorch/pytorch/actions/runs/4074618739/jobs/7020199369 `b11ec270ba`, marking as landrace cuz all jobs are green on pr	2023-02-02 17:06:34 +00:00
Daniel Dale	769eca6f97	Basic Validation for FSDP `state_dict` transformations of modules with persistent buffers (#93396 ) Fixes #93391 Thank you to the PyTorch Distributed team for your invaluable contributions to the PyTorch ecosystem, your work is immensely impressive and inspiring! As mentioned in #93391, in preparing the downstream package I maintain ([finetuning-scheduler](https://github.com/speediedan/finetuning-scheduler)) to support PyTorch 2.0's version of FSDP, I noticed modules that include multiple persistent buffers were not having their state properly transformed during saving of `state_dict`s. The issue was that the post-state_dict hook codepath shared by the `FULL_STATE_DICT` and `SHARDED_STATE_DICT` `_state_dict_type`s ([`_common_unshard_post_state_dict_hook`](`332d55d3df/torch/distributed/fsdp/_state_dict_utils.py (L158)`)) was inadvertently referencing a local variable (`buffer`) that was used in a [prior transformation](`332d55d3df/torch/distributed/fsdp/_state_dict_utils.py (L231)`), instead of the `buffers` variable that should have been referenced in the iteration context: `332d55d3df/torch/distributed/fsdp/_state_dict_utils.py (L251-L253)` In this case, modules with a single persistent buffer or without mixed precision enabled would be unaffected. With multiple buffers and mixed precision enabled however, the issue may appear stochastically in proportion to the ratio of persistent buffers that have compatible dimensions (since the value of the last buffer visited in the ``buffer_names`` ``Set`` is copied to all buffers and the ``Set`` iteration order will of course vary) ```bash File ".../pytorch/torch/nn/modules/module.py", line 2028, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel: size mismatch for _fsdp_wrapped_module.1._fsdp_wrapped_module.running_mean: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([10]). ``` To both address this issue and enhance coverage to avoid similar issues, this PR fixes the aforementioned typo and adds an additional set of basic tests that validate `state_dict` saving and loading for modules with persistent buffers in various contexts. I found that adding another model along with additional buffer-specific logic to adapt [`test_basic_save_and_load_state_dict`](`76b683b008/test/distributed/fsdp/test_fsdp_state_dict.py (L439)`) for the purposes of this coverage seemed to increase complexity of that test to an undesirable degree. Instead of adding additional complexity to that existing test, I've added a new test ``test_buffers_save_and_load_state_dict`` that does basic validation of ``state_dict`` saving and loading with mixed precision, ``state_dict_type`` and CPU offloading parameterization. Certainly let me know if you prefer I extend the logic of/add the persistent buffers model into the existing basic ``state_dict`` test, I'm happy to do so, just thought it was cleaner this way. Also, I thought doubling the number of tests with a ``use_orig_params`` parameterization or by testing additional different non-default buffer mixed precision data types was computationally imprudent but let me know if you'd like me to add those tests as well. The only other notable test change is that I've refactored ``TestFSDPStateDict._compare_models`` to accommodate both ``buffers`` and ``parameters`` comparisons without code duplication. Thanks again to the PyTorch Distributed team for your exceptional contributions. I've got some more to do adapting my package for 2.0's FSDP but it's been a delight so far thanks to your superlative work! Pull Request resolved: https://github.com/pytorch/pytorch/pull/93396 Approved by: https://github.com/rohan-varma, https://github.com/awgu, https://github.com/fegin	2023-02-02 15:51:58 +00:00
Bin Bao	98e1b3e93a	Merge Inductor perf smoke test with other inductor CI tests (#93395 ) Summary: Now the smoke test can also be triggered with the ciflow/inductor label. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93395 Approved by: https://github.com/weiwangmeta, https://github.com/malfet	2023-02-02 15:42:59 +00:00
Peter Bell	9ff7ddb241	[inductor] Don't import torchvision (#93027 ) Fixes #93019 Since PyTorch regularly breaks binary compatibility, `torchvision` must be compiled with the exact same version of PyTorch. If not, then importing it may cause mysterious failures at runtime due to binary incompatibility. This fixes the issue by delaying the `make_fallback` call for `torchvision.roi_align` until the operator appears in a graph being lowered, by which point the user must have imported torchvision themself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93027 Approved by: https://github.com/jansel	2023-02-02 15:42:32 +00:00
Andrew Gu	481a334b7a	[FSDP][3/N] Refactor `summon_full_params` unit tests (#92298 ) Overview - This PR refactors the `summon_full_params()` unit tests to prepare for `unshard_params()` by consolidating redundant tests and improving others. - This PR enables `CPUOffload(offload_params=True)` + `NO_SHARD` + `writeback=True`. - This PR provides an improved error message when calling `summon_full_params()` from an invalid context (i.e. from forward, backward, or in `summon_full_params()`). Details <details> <summary>Existing Unit Tests</summary> `test_summon_full_param_writeback()` with `world_size=1` `test_summon_full_param_writeback()` with `world_size=2` - Tests that `writeback=True` persists write and that `writeback=False` does not persist write when modifying a root FSDP instance's `flat_param` (`modify_outer=True`) or a non-root FSDP instance's `flat_param` (`modify_outer=False`); additionally configures with `mixed_precision` and `use_orig_params` - `CPUOffload(offload_params=True)` + `world_size=1` is not tested because it is not supported. - The write inside `summon_full_params()` is on the `flat_param` itself, which is not the expected usage. `test_summon_full_param_shard_value()` - Tests that reconstructing the `flat_param` (by re-flattening and chunking parameters) inside `summon_full_params()` gives the same as the originally constructed `flat_param` when using a single FSDP instance - This test seems to exercise the FSDP sharding algorithm, not the specification of `summon_full_params()`. The only relevant part being implicitly tested is that `model.parameters()` order is preserved. - This test assumes the current FSDP sharding algorithm. `test_summon_full_param_recursive()` - Tests that `recurse=True` recursively applies to all FSDP instances and that `recurse=False` does not - This test assumes the current FSDP sharding algorithm. `test_cannot_summon_full_params_from_forward()` `test_cannot_summon_full_params_from_backward()` - Tests that calling `summon_full_params()` from inside the forward or backward raises an error - The error message leaks `FlatParamHandle` to the user. I provided a better error in this PR. `test_summon_full_params_respects_reshard_after_forward()` - Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`) - This test depends on FSDP internals (`flat_param._full_param_padded.storage().size()`). `test_summon_single_param()` - Tests that writing to padding with `writeback=True` does not persist those writes (doing so by using a singleton `(1, 1)` parameter that gets flattened and padded to `(2,)`) - This test name is misleading. `test_summon_full_params_equivalence()` - Tests `writeback`, `rank0_only`, and `offload_to_cpu` with `writeback=not rank0_only`, using `CPUOffload(offload_params=True)` and including a `torch.cuda._sleep(int(1e6))` _after_ the write in `summon_full_params()` - The PR introducing this test said that the `torch.cuda._sleep(int(1e6))` exercised the stream synchronization in `summon_full_params()`--namely that the current stream waits for the all-gather stream after all-gathering the parameters. I did not follow conceptually how that works since the `torch.cuda._sleep()` call happens after both the all-gather and write and is in the default stream, which seems to be after the relevant ops. If we clarify this, I can re-incorporate this into the unit tests. Doing so is not a high priority since `summon_full_params()` unshards in the default stream now and does not require stream synchronization. - This unit test has overlap with `test_summon_full_param_writeback()` and can be coalesced. `test_summon_from_non_fsdp()` - Tests calling `summon_full_params()` with default args on a non-FSDP root module exposes the original parameters correctly - This test actually covers much of the specification since checking for original parameter equivalence includes shape, value, device, etc. checking. `test_reshard_outside_forward_backward_iteration()` - Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`) and that calling `summon_full_params()` after backward preserves that the padded unsharded `flat_param` data are freed; additionally configures `mixed_precision` - This test strictly dominates `test_summon_full_params_respects_reshard_after_forward()` in strictness since it includes the check after backward as well. `test_params_are_unflattenned()` - Tests that original parameters are exposed with the unflattened shape factoring in `rank0_only` (e.g. including that nonzero ranks reshard early when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision` `test_params_count_and_value()` - Tests that original parameters are all exposed and with the correct values factoring in `rank0_only` (e.g. including that nonzero ranks do not expose the original parameters when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision` `test_raises_rank0_with_writeback()` - Tests that `rank0_only` + `writeback=True` raises an error `test_named_parameters_buffers()` - Tests that `named_parameters()` and `named_buffers()` return clean names (without FSDP prefixes) inside `summon_full_params()` `test_with_grads_core()` - Tests `with_grads=True` by comparing against DDP `test_with_grads_none_grads()` - Tests `with_grads=True` when ranks' `FlatParameter`s have `None` gradient </details> <details> <summary>New Unit Tests</summary> `test_unshard_params_writeback_no_shard()` (with `world_size=1`) `test_unshard_params_writeback()` (with `world_size=2`) - Tests the `writeback` argument (using the default value for all others) `test_unshard_params_param_data_no_shard()` (with `world_size=1`) `test_unshard_params_param_data()` (with `world_size=2`) - Tests that parameters are exposed correctly for `recurse=True` and all other argument configs for a non-FSDP root module `test_unshard_singleton_param_writeback()` - Tests `writeback=True` for a singleton parameter, which includes testing that writing to padding does not persist `test_unshard_params_respects_reshard()` - Tests that unsharding parameters respects the expected reshard behavior between forward and backward as well as after backward `test_unshard_params_recurse()` - Tests the `recurse` argument (using default for all others) `test_offload_to_cpu_no_shard_raises()` - Tests that `offload_to_cpu=True` with `NO_SHARD` raises an error </details> <details> <summary>Summary of Unit Test Changes</summary> - `test_summon_full_param_writeback` -> `test_unshard_params_writeback()` - `test_summon_full_params_equivalence()`, `test_params_are_unflattenned()`, `test_params_count_and_value()` -> `test_unshard_params_param_data()` - `test_summon_full_params_respects_reshard_after_forward()`, `test_reshard_outside_forward_backward_iteration()` -> `test_unshard_params_respects_reshard()` - `test_summon_full_param_recursive()` -> `test_unshard_params_recurse()` - `test_named_parameters_and_buffers()` unchanged - `test_with_grads_core()` unchanged - `test_with_grads_none_grads()` unchanged - `test_cannot_summon_full_params_from_forward()`, `test_cannot_summon_full_params_from_backward()` -> `test_unshard_params_from_forward_raises()`, `test_unshard_params_from_backward_raises()` - `test_raises_rank0_with_writeback()` -> `test_rank0_only_with_writeback_raises()` - `test_offload_to_cpu_no_shard_raises()` new - `test_summon_full_param_shard_value()` removed </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92298 Approved by: https://github.com/rohan-varma	2023-02-02 15:10:14 +00:00
Andrew Gu	10990734ce	[FSDP][2/N] `_summon_full_params` -> `_unshard_params` (#92297 ) Overview This PR stack will add support for unsharding FSDP's sharded parameters for `fully_shard`. This PR takes the first step by doing some internal refactoring. - The existing API for wrapper FSDP is the static method `summon_full_params()`, which calls into the helper `_summon_full_params()`. - This PR refactors: - `summon_full_params()` core logic to `_unshard_params()` - `_summon_full_params()` to `_unshard_params_recurse()`, which has a `recurse: bool` argument - Previous `_unshard_params()` to `_unshard_fsdp_state_params()`, which applies to a single FSDP state Details - This PR introduces `_get_fsdp_states_with_modules()` and `_get_root_fsdp_states_with_modules()`, which additionally return the modules along with the FSDP states. The modules are needed for handling `FlatParameter` registration. - We may be able to remove this if we clean up the `use_orig_params=True` vs. `False` code paths because for `True`, the `FlatParameter` is not registered, meaning that it does not need to be de-registered. - Since `fully_shard` requires `use_orig_params=True`, we may not need `_get_fsdp_states_with_modules()` and `_get_root_fsdp_root_modules()`; however, I prefer to make the separation of FSDP state and module explicit for now for clarity. Follow-Ups - `writeback=True` and `rank0_only=True` raises an error. The previous explanation was: > is not supported, as model parameter shapes will be different across ranks, and writing to them can lead to inconsistencies across ranks when the context is exited. I am not exactly sure what the different model parameter shapes refers to. However, I believe that we can support `writeback=True` and `rank0_only=True` by broadcasting the `FlatParameter` from rank 0 in the `finally`, writing back, and freeing. This should not increase the peak memory since rank 0 already holds the unsharded `FlatParameter` in GPU memory before writing back and nonzero ranks do not have any other unsharded `FlatParameter`s in GPU memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92297 Approved by: https://github.com/rohan-varma	2023-02-02 15:10:14 +00:00
atalman	c76ac8eef2	Remove CUDA 11.6 from nightly builds (#93404 ) Remove CUDA 11.6 from nightly builds. Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/93404 Approved by: https://github.com/malfet	2023-02-02 14:26:52 +00:00
Will Constable	a14e3190e3	Mark buffers that reuse other buffers (#93329 ) Provides a way at codegen time to emit code conditioned on having a fresh allocation vs reusing an input. - For collective ops, if reusing an input, a copy can be skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/93329 Approved by: https://github.com/jansel	2023-02-02 14:22:26 +00:00
Will Constable	d69876b2f1	Refactor to allow reuse of SchedulerNode.allocate (#93328 ) Paves the way for ExternKernelSchedulerNode to also be able to use the buffer inplace logic, needed for Collective ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93328 Approved by: https://github.com/jansel	2023-02-02 14:22:26 +00:00
Nikita Vedeneev	84187399fc	retire sparse_mask_helper (#91714 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91714 Approved by: https://github.com/albanD, https://github.com/amjames, https://github.com/cpuhrsch	2023-02-02 13:53:02 +00:00
haozhe.zhu	a2fded3001	update fbgemm third party (#93907 ) To include https://github.com/pytorch/FBGEMM/pull/1572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93907 Approved by: https://github.com/jianyuh	2023-02-02 13:37:19 +00:00
blzheng	b11ec270ba	[inductor] fix crash issue when input is a view tensor (#90150 ) Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-02 12:49:26 +00:00
Liao, Xuan	a672fd1dba	[Inductor] add config for weight prepacking (#93811 ) Fixes #93495 Mkldnn weight prepacking may lead to large memory footprint for some models such as UniXcoder. In this case, disabling mkldnn weight prepacking is needed to avoid memory overload. This PR adds a config for switching mkldnn weight prepacking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93811 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-02 12:18:40 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	59ccc786df	Check for none for NNModuleVariable.__module__ (#93326 ) Test Plan: CI Differential Revision: D42869182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93326 Approved by: https://github.com/suo	2023-02-02 09:41:41 +00:00
XiaobingSuper	f4db47b176	inductor: don't assert error when do cpu fx fusion for training mode (#93837 ) This PR will do: 1. skip CPU fx fusion for training mode. 2. skip Linear packed when input dim<2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93837 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-02 08:12:07 +00:00
XiaobingSuper	3d020b6903	inductor: separate bias from PackeLinear for better performance (#93348 ) For PakedLinear with has bias, we always copy bias to output before doing the computation: `d7a3f2128f/aten/src/ATen/native/mkldnn/Linear.cpp (L389-L397)`. This PR separates bias from it which can make the bias add fused with the post-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93348 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-02 08:07:38 +00:00
Chien-Chin Huang	4b0f1cc1ee	[FSDP][optim_state_dict][10/N] Make optim_state_dict and optim_state_dict_to_load public (#92118 ) Make optim_state_dict and optim_state_dict_to_load public APIs and consolidate them with state_dict by using the same state_dict_type to decide how to perform the optimizer state_dict save and load. Differential Revision: [D42488022](https://our.internmc.facebook.com/intern/diff/D42488022/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92118 Approved by: https://github.com/rohan-varma	2023-02-02 08:04:20 +00:00
XiaobingSuper	84ee50a28a	inductor: add conv+hardsigmoid fusion for cpu path(reland) (#93341 ) re-land https://github.com/pytorch/pytorch/pull/91433. The internal ideep upgrade issue is resolved at https://github.com/pytorch/pytorch/pull/92239. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93341 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-02-02 07:59:56 +00:00
Xilun Wu	6f3018d50b	[DTensor] implement dist_split as a sharding prop rule (#93306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93306 Approved by: https://github.com/wanchaol	2023-02-02 07:56:44 +00:00
Xilun Wu	966030f7c7	[DTensor][fix] MultiThreadedTestCase misses _tls object and it won't reflect in CI (#93832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93832 Approved by: https://github.com/wanchaol	2023-02-02 07:56:44 +00:00
Xilun Wu	b82f93d561	[DTensor] fix DTensorSpec dim_map description (#93160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93160 Approved by: https://github.com/wanchaol	2023-02-02 07:56:44 +00:00
XiaobingSuper	db87396474	inductor: align the decomposition output stride with none-decomposition path for torch.lerp (#93336 ) As title, we need to align the decomposition output stride with the none-decomposition path for torch.lerp. And also enable it's lowering path for inductor. After this PR for the following case: ``` def fn(i0, i1): # i0: (10, 3, 10) # i1: (3, 10, 10) x1 = i0.transpose(-2, -3) #y = torch.lerp(x1, x1, 70000) z = torch.lerp(i1, x1, 70000) return z x0 = torch.rand(10, 3, 10) x1 = torch.rand(3, 10, 10) ret_eager = fn(x0, x1) print('==== Eager mode OK! ====') compiled = torch.compile(fn, fullgraph=True) ret_compiled = compiled(x0, x1) print('==== compile mode OK! ====') ret_compiled = compiled(x0, x1) print(torch.equal(ret_eager, ret_compiled)) print(ret_eager.stride()==ret_compiled.stride()) ``` the inductor output code will be like(CPU): ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0) { { #pragma GCC ivdep for(long i0=0; i0<3; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<10; i1+=1) { for(long i2=0; i2<0; i2+=1) { auto tmp7 = at::vec::Vectorized<float>::loadu(in_ptr0 + (10i0) + (16i2) + (30i1)); auto tmp8 = at::vec::Vectorized<float>::loadu(in_ptr1 + (10i1) + (16i2) + (100i0)); auto tmp0 = at::vec::Vectorized<float>(static_cast<float>(70000.0)); auto tmp1 = tmp0.abs(); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.5)); auto tmp3 = tmp1 >= tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1)); auto tmp5 = tmp0 - tmp4; auto tmp6 = decltype(tmp5)::blendv(tmp0, tmp5, tmp3); auto tmp9 = tmp7 - tmp8; auto tmp10 = tmp6 * tmp9; auto tmp11 = decltype(tmp7)::blendv(tmp8, tmp7, tmp3); auto tmp12 = tmp10 + tmp11; tmp12.store(out_ptr0 + (10i1) + (16i2) + (100i0)); } #pragma omp simd simdlen(8) for(long i2=0; i2<10; i2+=1) { auto tmp7 = in_ptr0[i2 + (10i0) + (30i1)]; auto tmp8 = in_ptr1[i2 + (10i1) + (100i0)]; auto tmp0 = static_cast<float>(70000.0); auto tmp1 = std::abs(tmp0); auto tmp2 = static_cast<float>(0.5); auto tmp3 = tmp1 >= tmp2; auto tmp4 = static_cast<float>(1); auto tmp5 = tmp0 - tmp4; auto tmp6 = tmp3 ? tmp5 : tmp0; auto tmp9 = tmp7 - tmp8; auto tmp10 = tmp6 tmp9; auto tmp11 = tmp3 ? tmp7 : tmp8; auto tmp12 = tmp10 + tmp11; out_ptr0[i2 + (10i1) + (100i0)] = tmp12; } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1 = args args.clear() buf1 = empty_strided((3, 10, 10), (100, 10, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf1.data_ptr())) del arg0_1 del arg1_1 return (buf1, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((10, 3, 10), (30, 10, 1), device='cpu', dtype=torch.float32) arg1_1 = rand_strided((3, 10, 10), (100, 10, 1), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1, arg1_1])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93336 Approved by: https://github.com/jansel	2023-02-02 07:40:28 +00:00
chunyuan	cff4d3bb22	inductor: fix convert_shape_to_symint (#93349 ) Fixes https://github.com/pytorch/pytorch/issues/93833. When `lst` is composed of a mix of static shapes and `sympy.Expr`, convert static shapes to ints and `sympy.Expr` to `symints`. The old logic required that all of the elements of `lst` be static and it can then convert them to ints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93349 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-02 07:34:57 +00:00
fduwjj	e7ace1ff93	[PT-D][NamedOptimizer][6/N] Upstream init_state from keyed to NamedOptimizer (#93887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93887 Approved by: https://github.com/rohan-varma	2023-02-02 07:14:49 +00:00
Jagadish Krishnamoorthy	f58ba553b7	[ROCm] Fix distributed tests failure and enable ROCm distributed CI (#92932 ) Distributed tests fails due to AttributeError: 'torch._C._distributed_c10d.ProcessGroup' object has no attribute '_set_backend' , when running distributed/test_c10d_spawn_gloo.py This leads to tests not progressing resulting in hang. Use _register_backend instead of _set_backend. Fixes https://github.com/pytorch/pytorch/pull/91632 More details of issue: https://github.com/pytorch/pytorch/pull/91632#issuecomment-1402831950 and https://github.com/pytorch/pytorch/pull/91632#issuecomment-1405646977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92932 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/H-Huang	2023-02-02 04:29:10 +00:00
Jason Ansel	569f2e3228	Remove many untested dynamo backends (#93382 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93382 Approved by: https://github.com/mlazos, https://github.com/voznesenskym	2023-02-02 04:08:22 +00:00
Driss Guessous	653dc73df0	[SDPA] Wire up FlashAttention's backward (#92917 ) # Summary This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml. The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](`33e0860c9c/flash_attn/flash_attn_interface.py (L126)`) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions. ### MetaFunctions I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below. ### Questions? Performance issues vs mem efficient when using torch.nn.mha_forward TorchCompile -> See purposed solution below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917 Approved by: https://github.com/cpuhrsch	2023-02-02 04:02:30 +00:00
Jason Ansel	b6367c8aa4	Remove torch/_dynamo/optimizations/inference.py (#93381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93381 Approved by: https://github.com/Chillee	2023-02-02 03:42:50 +00:00
Driss Guessous	68b06ee4d4	Add `torch_compile_debug/` to .gitignore (#93889 ) # Summary I have almost checked this in multiple times. Add to gitignore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93889 Approved by: https://github.com/malfet	2023-02-02 03:31:55 +00:00
PyTorch MergeBot	61d3589e07	[vision hash update] update the pinned vision hash (#93892 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93892 Approved by: https://github.com/pytorchbot	2023-02-02 03:18:25 +00:00
Huy Do	489e74cf73	Fix lint after #93278 (#93902 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/93902 Approved by: https://github.com/jansel	2023-02-02 03:16:29 +00:00
Edward Z. Yang	6c93c3b58a	Save and restore functorch configuration in minified scripts (#93853 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93853 Approved by: https://github.com/williamwen42	2023-02-02 03:09:46 +00:00
Edward Z. Yang	caf1b27196	Fix Upsample/EmbeddingBag module printing (#93850 ) The fix generalizes but I want someone else to holistically figure this out. Fixes https://github.com/pytorch/pytorch/issues/93233 Fixes https://github.com/pytorch/pytorch/issues/93512 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93850 Approved by: https://github.com/albanD	2023-02-02 02:50:29 +00:00
Edward Z. Yang	306dc2ed1a	Make ShapeEnv deepcopy'able (#93403 ) We sometimes put ShapeEnv on GraphModule, and code in our testing utils assume that you can deepcopy a GraphModule, so it's good for ShapeEnv to be deepcopy'able too. This is done by making the TLS module-wide rather than per-ShapeEnv. We never really have multiple ShapeEnv so this is a good trade. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93403 Approved by: https://github.com/jbschlosser	2023-02-02 02:50:23 +00:00
Huy Do	54eedf6fa6	Fix test_jit_cuda_archflags on Windows (#93332 ) Fixes https://github.com/pytorch/pytorch/issues/61655 The test is flaky and fails whenever `test_jit_cuda_archflags` is run. The latter `test_jit_cuda_archflags` was slow test in the old Windows runner. It's currently running again on trunk due to the problem with populating slow-test JSON file ~Interestingly, its performance is getting better in the new Windows G5 runner and it becomes a borderline slow test, where it run sometimes~. Whenever it runs, the next test `test_jit_cuda_extension` will fail. * Build and load different CUDA arch modules from `test_jit_cuda_archflags` in separate processes to avoid importing them into the current one. The test only checks the build artifacts. Importing them cause `test_jit_cuda_extension` to fail as describe in https://github.com/pytorch/pytorch/issues/61655 * Clean up the temp build dir on Windows. Windows CUDA runner is non-ephemeral, so it's better to clean thing up properly to avoid any funny business the next time the runner is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/93332 Approved by: https://github.com/davidberard98	2023-02-02 02:49:27 +00:00
Jason Ansel	d7b39b17ab	Remove torch/_dynamo/optimizations/{analysis,log_args}.py (#93279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93279 Approved by: https://github.com/voznesenskym	2023-02-02 02:34:36 +00:00
PyTorch MergeBot	d37bc6d04e	Revert "[fx] add SymPy assumptions to `FloorDiv` (#93185 )" This reverts commit c4ccf7e12147671fdc3535a222260d687c2128a2. Reverted https://github.com/pytorch/pytorch/pull/93185 on behalf of https://github.com/ezyang due to appears to be breaking people outside of ci	2023-02-02 02:26:11 +00:00
Jason Ansel	57d74aae55	Remove torch/_dynamo/optimizations/normalize.py (#93278 ) This file was largely made obsolete by dispatcher level functionalization, and has been disabled by config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93278 Approved by: https://github.com/voznesenskym	2023-02-02 02:02:54 +00:00
jon-chuang	6a4bf3b71b	feat(fx): `make_fx` should be aware of functions wrapped with `@fx.wrap` (#93273 ) Fixes https://github.com/pytorch/pytorch/issues/89421 The strategy is to patch the given function wrapped with `@torch.fx.wrap` so that if a tensor tracer is active, we will `proxy_call` the function. `proxy_call` will also skip certain checks if the function to proxy call is not a torch op (checked with `isinstance(.., OpOverload)`. @IvanYashchuk @ezyang @Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/93273 Approved by: https://github.com/ezyang	2023-02-02 01:57:52 +00:00
Wei Wang	dd8662d5c8	[BE] Migrate Anaconda Prune jobs from CircleCI to GHA (#93876 ) We need periodical anaconda prune jobs to remove older packages (e.g. pytorch, torchvision, torchaudio, torchtext etc) from channels like pytorch-nightly and pytorch-test. Currently it is done in circleci (e.g. https://app.circleci.com/pipelines/github/pytorch/pytorch/647201/workflows/72e5af30-0d54-44c1-8d9b-4c5502d27c9d/jobs/17260775) and triggered by postnightly update (https://github.com/pytorch/pytorch/tree/postnightly) However, this postnightly branch triggers so many useless jobs (dozens of them failed due to docker command too long. Why? Because change history was part of docker command and it exceeds max INT). <img width="756" alt="image" src="https://user-images.githubusercontent.com/109318740/216139179-3c913094-82cb-4605-99b7-23a21b4cbb36.png"> Therefore, we should stop the postnightly jobs (waste of resources) but save anaconda prune jobs. This PR attempts to achieve this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93876 Approved by: https://github.com/atalman	2023-02-02 01:56:13 +00:00
Edward Z. Yang	ca9ebf9e2b	Delete dynamo_import and inductor_import (#93851 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93851 Approved by: https://github.com/albanD, https://github.com/jansel	2023-02-02 01:51:29 +00:00
Jason Ansel	74592a43d0	Update tests to use ConfigModule.patch (#93254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93254 Approved by: https://github.com/voznesenskym	2023-02-02 00:56:55 +00:00
Nikita Shulga	31d466f925	[BE][ez] Move hardcoded constants to function args (#93874 ) Also use tail-recursion instead of for loop to dismantle pyramid of doom Pull Request resolved: https://github.com/pytorch/pytorch/pull/93874 Approved by: https://github.com/clee2000	2023-02-02 00:47:18 +00:00
Jason Ansel	23d58fedb1	Use ConfigModule for _functorch.config (#93375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93375 Approved by: https://github.com/Chillee	2023-02-02 00:31:24 +00:00
Horace He	0485bf5398	Avoid saving pointwise intermediate to global memory if followed by a reduction (#93810 ) Should fix https://github.com/pytorch/pytorch/issues/91880 and maybe https://github.com/pytorch/pytorch/issues/91799 For this code: ``` @torch.compile def f(a, b): return (a-b).sum(dim=-1).amax(dim=-1) N = 2*14 K = 5 A = torch.randn(N, 1, K, device='cuda') B = torch.randn(1, N, K, device='cuda') bench(lambda: f(A, B), name=f"K={K}") print(f"peak Mem: {torch.cuda.max_memory_allocated()/1e9}GB") ``` Before my change, we generated (simplified versions) ``` def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): ... for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp1 = tl.load(in_ptr1 + (5r1), rmask, eviction_policy='evict_last') ... tmp18 = tmp14 + tmp17 tl.store(out_ptr0 + (r1 + (16384x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp18, rmask & xmask) _tmp20 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf") for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp19 = tl.load(out_ptr0 + (r1 + (16384x0)), rmask & xmask, eviction_policy='evict_last') _tmp20 = tl.where(rmask & xmask & (_tmp20 < tmp19), tmp19, _tmp20) tmp20 = tl.max(_tmp20, 1)[:, None] tl.store(out_ptr1 + x0, tmp20, xmask) ``` and after ``` def triton_(in_ptr0, in_ptr1, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): ... _tmp19 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf") for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp1 = tl.load(in_ptr1 + (5r1), rmask, eviction_policy='evict_last') ... tmp18 = tmp14 + tmp17 _tmp19 = tl.where(rmask & xmask & (_tmp19 < tmp18), tmp18, _tmp19) tmp19 = tl.max(_tmp19, 1)[:, None] tl.store(out_ptr1 + x0, tmp19, xmask) ``` <details> <summary>full kernels here </summary> Before: ``` def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 16384 rnumel = 16384 xoffset = tl.program_id(0) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rbase = tl.arange(0, RBLOCK)[None, :] x0 = xindex tmp0 = tl.load(in_ptr0 + (5x0), xmask) tmp3 = tl.load(in_ptr0 + (1 + (5x0)), xmask) tmp7 = tl.load(in_ptr0 + (2 + (5x0)), xmask) tmp11 = tl.load(in_ptr0 + (3 + (5x0)), xmask) tmp15 = tl.load(in_ptr0 + (4 + (5x0)), xmask) for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp1 = tl.load(in_ptr1 + (5r1), rmask, eviction_policy='evict_last') tmp4 = tl.load(in_ptr1 + (1 + (5r1)), rmask, eviction_policy='evict_last') tmp8 = tl.load(in_ptr1 + (2 + (5r1)), rmask, eviction_policy='evict_last') tmp12 = tl.load(in_ptr1 + (3 + (5r1)), rmask, eviction_policy='evict_last') tmp16 = tl.load(in_ptr1 + (4 + (5r1)), rmask, eviction_policy='evict_last') tmp2 = tmp0 - tmp1 tmp5 = tmp3 - tmp4 tmp6 = tmp2 + tmp5 tmp9 = tmp7 - tmp8 tmp10 = tmp6 + tmp9 tmp13 = tmp11 - tmp12 tmp14 = tmp10 + tmp13 tmp17 = tmp15 - tmp16 tmp18 = tmp14 + tmp17 tl.store(out_ptr0 + (r1 + (16384x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp18, rmask & xmask) _tmp20 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf") for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp19 = tl.load(out_ptr0 + (r1 + (16384x0)), rmask & xmask, eviction_policy='evict_last') _tmp20 = tl.where(rmask & xmask & (_tmp20 < tmp19), tmp19, _tmp20) tmp20 = tl.max(_tmp20, 1)[:, None] tl.store(out_ptr1 + x0, tmp20, xmask) ``` After: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 16384 rnumel = 16384 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rbase = tl.arange(0, RBLOCK)[None, :] x0 = xindex tmp0 = tl.load(in_ptr0 + (5x0), xmask) tmp3 = tl.load(in_ptr0 + (1 + (5x0)), xmask) tmp7 = tl.load(in_ptr0 + (2 + (5x0)), xmask) tmp11 = tl.load(in_ptr0 + (3 + (5x0)), xmask) tmp15 = tl.load(in_ptr0 + (4 + (5x0)), xmask) _tmp19 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf") for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp1 = tl.load(in_ptr1 + (5r1), rmask, eviction_policy='evict_last') tmp4 = tl.load(in_ptr1 + (1 + (5r1)), rmask, eviction_policy='evict_last') tmp8 = tl.load(in_ptr1 + (2 + (5r1)), rmask, eviction_policy='evict_last') tmp12 = tl.load(in_ptr1 + (3 + (5r1)), rmask, eviction_policy='evict_last') tmp16 = tl.load(in_ptr1 + (4 + (5r1)), rmask, eviction_policy='evict_last') tmp2 = tmp0 - tmp1 tmp5 = tmp3 - tmp4 tmp6 = tmp2 + tmp5 tmp9 = tmp7 - tmp8 tmp10 = tmp6 + tmp9 tmp13 = tmp11 - tmp12 tmp14 = tmp10 + tmp13 tmp17 = tmp15 - tmp16 tmp18 = tmp14 + tmp17 _tmp19 = tl.where(rmask & xmask & (_tmp19 < tmp18), tmp18, _tmp19) tmp19 = tl.max(_tmp19, 1)[:, None] tl.store(out_ptr1 + x0, tmp19, xmask) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93810 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-02 00:02:14 +00:00
Huy Do	8594529c2e	Run ASAN in 4xlarge in all shards (#93879 ) We used to have ASAN shard 4 and 5 running in 4xlarge because they timed out. With the current issue with test time collecting, I guess the shard allocation has been changed, and there are now timeout from shard 1 to 3. It's better to just have all shards using the same runner for consistency Pull Request resolved: https://github.com/pytorch/pytorch/pull/93879 Approved by: https://github.com/clee2000	2023-02-01 23:37:23 +00:00
David Berard	3e6978172e	[dynamo] Handle general tensor attributes with a getattr proxy node (#91840 ) Background: Before this PR, support in dynamo for tensor attributes (e.g. `x.H`, `x.T`, ...) need to be individually implemented one-by-one. This could potentially lead to errors, e.g. if the implementation in [variables/tensor.py](`21c7c7c72f/torch/_dynamo/variables/tensor.py (L160)`) differs from the implementation from a direct call to the attribute. For attributes that were not special-cased in tensor.py, dynamo tracing would fail. This PR adds generic support for tensor attributes that return tensors without needing to specially handle them. (Notably, for x.real and x.imag, which previously weren't supported). In this PR: This directly creates a proxy node for a `"call_function"` node with `target=getattr`, and feeds it into wrap_fx_proxy. This will produce a TensorVariable for the attribute returned. This also removes the implementations for H, T, mH, mT which were broken (previously `torch.relu(x.T)` would fail). They now fall back to this default implementation (for which `torch.relu(x.T)` passes). Further context: * Ed's original suggestion in [90463](https://github.com/pytorch/pytorch/pull/90463#discussion_r1043398340) is to use `torch.Tensor.H.__get__(x)`. I wasn't able to get this to work; fx compilation fails with `getset_descriptor does not have attribute __module__`. Basically, the `__module__` attribute which is available on most python attributes, is not available on `getset_descriptor` objects. (i.e., these are implemented in C++ as attributes on torch.Tensor, so they don't obey some assumptions made by fx) * Although both tensor attributes and methods (like `x.relu()`) both go through this, this PR should only handle attributes (e.g. see the `"getset_descriptor"` in variables/tensor.py). Methods are handled already by by GetAttrVariable. * Prior to this PR, we already returned GetAttrVariables for unsupported attrs: the parent caller would catch the NotImplementedError and fallback to returning a GetAttrVariable. But if this GetAttrVariable was ever passed into a torch.\* function (as it could quite possibly be, since most of these attrs are tensors), it would fail because its proxy node would be missing an [example_value](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/utils.py#L1017). So: before, for some tensor x, `x.real` would work fine; but `torch.relu(x.real)` would fail. Testing: added tests in test_misc.py for x.real, x.imag, x.T, x.real.T. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91840 Approved by: https://github.com/ezyang	2023-02-01 22:34:03 +00:00
Vivswan Shah	8c1ee89f19	Added super init to Module (#91819 ) Added super init to Module for complex user modules derived from multiple python classes. And by adding the super __init__ call at the end so it doesn't change any functionality of Module class. I am working on building a module for simulating analog neural network on PyTorch. and this small change is really useful for that and we can definitely think of many other useful cases especially for more module or mro hierarchy. Issues: https://github.com/pytorch/pytorch/issues/28746, https://github.com/pytorch/pytorch/issues/48626, https://github.com/pytorch/pytorch/issues/61662, https://github.com/pytorch/pytorch/issues/74036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91819 Approved by: https://github.com/albanD	2023-02-01 22:17:59 +00:00
Edward Z. Yang	207399cf5f	Add repro_forward_only for inference debugging (#93856 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93856 Approved by: https://github.com/williamwen42	2023-02-01 22:03:13 +00:00
Edward Z. Yang	03b465a6d0	Add --iterations to benchmark script (#93858 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93858 Approved by: https://github.com/williamwen42	2023-02-01 21:56:49 +00:00
fduwjj	3fb6e119e2	[PT-D][TP] Fix the module registration in TP API (#93412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93412 Approved by: https://github.com/XilunWu	2023-02-01 21:03:56 +00:00
Edward Z. Yang	498c6ed8d8	Add missing format string (#93866 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93866 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-02-01 20:56:46 +00:00
Nikita Shulga	87b9ab4870	[CI] Add Py-3.11 wheels for all platforms (#93400 ) As python-3.11 is now available on Conda for both MacOS and Windows Disable dimtorch for Python-3.11 on Windows as its current implementation relies on internal symbols which are not exposed on Windows runtime (and to be frank, not sure why they are exposed on Linux/Mac), see https://github.com/pytorch/pytorch/issues/93854 As with the previous PR, most of the changes are not in PyTorch repo, but in builder, namely: `b71049dcbc` `ece340ef7e` `b0071ac366` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93400 Approved by: https://github.com/weiwangmeta, https://github.com/atalman	2023-02-01 19:51:19 +00:00
Jason Ansel	2ea3036d8b	Disable cudagraphs by default (#93253 ) `torch.compile` used to disable cudagraphs by default (removed one PR up in this stack), which was a bit confusing because it caused the config setting to be ignored. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93253 Approved by: https://github.com/ngimel	2023-02-01 19:38:05 +00:00
Jason Ansel	45eadc2c4d	ConfigModule for _{dynamo,inductor}.config (#93252 ) This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252 Approved by: https://github.com/voznesenskym	2023-02-01 19:38:05 +00:00
Masaki Kozuki	a23ed38f9a	[mta][foreach] Implement fused adamw (#88015 ) related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015 Approved by: https://github.com/albanD, https://github.com/ngimel	2023-02-01 19:32:29 +00:00
Jesse Cai	86ab4d49d4	[pruning][core][feature] LSTM Structured Pruning prune_functions + pattern (#90801 ) Summary: This PR adds in support for LSTM Structured Pruning. - Adds in LSTMSaliencyPruner, an implemented pruner that splits the packed weights, finds the appropriate mask for each piece individually based on saliency, and then combines to create an overall mask for the LSTM. - Adds in pruning functions for LSTM pruning, which will split the weights, apply the masks, and then recombine the pruned weights. Works for both single and multiple-layer LSTMs. Also added a basic pattern to the default set of of patterns for LSTM -> Linear pruning LSTM -> LayerNorm -> Linear pruning Adds in test to check that LSTM pruning works, as well as for LSTMSaliencyPruner Test Plan: `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_single_layer` `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_multiple_layer` `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_single_layer` `python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_multiple_layer` `python test/test_ao_sparsity.py -- TestSaliencyPruner.test_lstm_saliency_pruner_update_mask` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D42199001](https://our.internmc.facebook.com/intern/diff/D42199001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90801 Approved by: https://github.com/jerryzh168	2023-02-01 19:29:03 +00:00
Richard Barnes	f577a5279b	Enable `USE_CUDA` (#92640 ) Summary: `USE_CUDA` is needed in the bazel definitions to ensure that `USE_CUDA` is applied everywhere it should be. We also fix some test code to use the correct properties. Test Plan: Sandcastle Differential Revision: D42616147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92640 Approved by: https://github.com/ezyang	2023-02-01 19:00:26 +00:00
Catherine Lee	e80af53bf0	Move bazel back to pull (#93867 ) Fixes #ISSUE_NUMBER Revert of https://github.com/pytorch/pytorch/pull/93296 but in a new PR b/c xla was already put back in https://github.com/pytorch/pytorch/pull/93334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93867 Approved by: https://github.com/huydhn	2023-02-01 18:58:31 +00:00
Vasiliy Kuznetsov	6fe234ecc4	pnp: move shadow loggers to parent module (#91428 ) Summary: Before this PR, PNP added shadow loggers to insides of the shadow wrapper modules. This PR moves those loggers to the parent module. There are a couple of benefits: 1. this will unbreak features of quantization API which don't support loggers (such as hardcoding model output to be quantized) 2. this makes it easier to look at the parent graph and visualize what is logged, since now all the logging is in the same graph 3. this will make it easier to implement features such as propagation error calculation in the future Test plan: ``` python test/test_quantization.py -k NShadows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91428 Approved by: https://github.com/jerryzh168	2023-02-01 18:34:04 +00:00
Vasiliy Kuznetsov	56f9475625	ns: change PNP testing to use QNNPACK (#91421 ) Summary: Changes the PNP test cases to use QNNPACK. The only reason is because I'm switching to Mac M1 as my primary machine, which supports QNNPACK but not fbgemm, and it's convenient for me to be able to run these locally. PNP itself is not backend specific, so it does not matter which backend the functionality is tested on. Test plan: ``` python test/test_quantization.py -k NShadows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91421 Approved by: https://github.com/jerryzh168	2023-02-01 18:34:04 +00:00
Catherine Lee	1dcd2609b5	Add retries for get_workflow_job_id and try catch in upload_test_stats (#93401 ) upload_test_stats keeps failing b/c it can't handle when the id is workflow-<workflow_id> so add a try catch for this. Add retries to get_workflow_job_id to try and reduce the number of times the id can't be found Failure to upload test stats and inability to get the job id cause our sharding infra and slow test infra (probably also flaky test detection) to be less effective. This does not completely resolve the issue since we do rely on the job id Failure to get the workflow job id happens tragically often, hopefully retries will help Pull Request resolved: https://github.com/pytorch/pytorch/pull/93401 Approved by: https://github.com/huydhn	2023-02-01 18:33:32 +00:00
Huy Do	eb987abd24	Clean up leftover processes on non-ephemeral Windows runner (#93414 ) In some rare cases, checking out PyTorch on non-ephemeral Windows G5 runner could fail because of leftover processes from the previous workflow. For example, https://github.com/pytorch/pytorch/actions/runs/4058503816/jobs/6986773162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93414 Approved by: https://github.com/clee2000	2023-02-01 17:52:56 +00:00
soulitzer	77cbaedd5c	[docs] Add section about tensor hooks on in-place in autograd note (#93116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93116 Approved by: https://github.com/albanD	2023-02-01 17:35:21 +00:00
Soumith Chintala	76b999803a	add filelock as a dependency (#91607 ) `filelock` is a dependency now for inductor's caching mechanism and CPU backend. Add `filelock` as a dependency Fixes https://github.com/pytorch/pytorch/issues/93499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91607 Approved by: https://github.com/anijain2305, https://github.com/jansel	2023-02-01 17:30:55 +00:00
jon-chuang	d5901fcc80	fix(fx): make all `make_fx` invocations isolated (opaque to higher `make_fx` invocations) by default (#93290 ) Fixes https://github.com/pytorch/pytorch/issues/88996#issuecomment-1409174554 Example code: ```python import torch from torch.fx.experimental.proxy_tensor import make_fx, wrapper_and_args_for_make_fx @torch.fx.wrap def func(a, b): return b.expand([1, a.shape[0], b.shape[-1]]) a = torch.randn(3, 4) b = torch.randn(4) class TestMode(torch.overrides.TorchFunctionMode): def __torch_function__(self, func, types, args=(), kwargs={}): if torch.overrides.resolve_name(func) in ["torch.Tensor.expand"]: print(f"TestMode: {func} {args} {kwargs}") wrapped, all_args = wrapper_and_args_for_make_fx(func, args, kwargs) gm = make_fx(wrapped, tracing_mode="real")(all_args) return func(args, *kwargs) with TestMode(): gm = make_fx(func, tracing_mode="symbolic")(a, b) gm.graph.print_tabular() ``` Before: ``` opcode name target args kwargs ------------- ---------- ------------------- -------------------------------- -------- placeholder a_1 a_1 () {} placeholder b_1 b_1 () {} call_function detach aten.detach.default (b_1,) {} call_function detach_1 aten.detach.default (detach,) {} call_function sym_size aten.sym_size (a_1, 0) {} call_function sym_size_1 aten.sym_size (b_1, 0) {} call_function expand aten.expand.default (b_1, [1, sym_size, sym_size_1]) {} call_function detach_2 aten.detach.default (expand,) {} call_function expand_1 aten.expand.default (b_1, [1, sym_size, sym_size_1]) {} output output output (expand_1,) {} ``` After: ``` opcode name target args kwargs ------------- ---------- ------------------- -------------------------------- -------- placeholder a_1 a_1 () {} placeholder b_1 b_1 () {} call_function sym_size aten.sym_size (a_1, 0) {} call_function sym_size_1 aten.sym_size (b_1, 0) {} call_function expand aten.expand.default (b_1, [1, sym_size, sym_size_1]) {} output output output (expand_1,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93290 Approved by: https://github.com/ezyang	2023-02-01 17:28:48 +00:00
Aaron Gokaslan	2fc2ca7652	[BE]: Fix CMake LTO policy on pytorch (#93388 ) Not this is a non-functional change since non of our CIs actually build with LTO. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93388 Approved by: https://github.com/albanD	2023-02-01 17:06:53 +00:00
Angela Yi	bf2e2fea41	[dynamo] getattr for EnumVariables (#93397 ) I'm not sure if this is the correct fix, but it allowed me to enable the test case I added which I encountered in an internal model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93397 Approved by: https://github.com/yanboliang	2023-02-01 16:29:39 +00:00
cyy	37f7c00a8a	More fixes and improved clang-tidy checkers (#93213 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93213 Approved by: https://github.com/Skylion007	2023-02-01 14:44:17 +00:00
Wu, Chunyuan	679e869af0	[inductor] only check mutations attr for TritonKernel (#92277 ) Fixes https://github.com/pytorch/pytorch/issues/93506. In https://github.com/pytorch/pytorch/pull/91575, for in-place buffers reuse, a check has been added on the `mutations` attr of the kernel: `5e0d3458eb/torch/_inductor/scheduler.py (L300)` While `mutations` are not tracked in cpp kernels, `getattr(V.kernel, "mutations", None) is not None` will always be `False`. This PR only checks the `mutations` attr for TritonKernel. UT is added to guarantee that `in_out_ptr` is in the generated code. #### Cpp code before this fix: ```python kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long i0=0; i0<8; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(8.0)); auto tmp2 = tmp0 / tmp1; tmp2.store(out_ptr0 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=128; i0<128; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = static_cast<float>(8.0); auto tmp2 = tmp0 / tmp1; out_ptr0[i0] = tmp2; } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1 = args args.clear() buf0 = empty_strided((2, 8, 8), (64, 8, 1), device='cpu', dtype=torch.float32) extern_kernels.bmm(as_strided(arg0_1, (2, 8, 4), (32, 4, 1)), as_strided(arg1_1, (2, 4, 8), (32, 1, 4)), out=buf0) del arg0_1 del arg1_1 buf1 = empty_strided((1, 2, 8, 8), (128, 64, 8, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr())) return (buf1, ) ``` #### Cpp code after this fix: ```python kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(float* __restrict__ in_out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long i0=0; i0<8; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(8.0)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=128; i0<128; i0+=1) { auto tmp0 = in_out_ptr0[i0]; auto tmp1 = static_cast<float>(8.0); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1 = args args.clear() buf0 = empty_strided((2, 8, 8), (64, 8, 1), device='cpu', dtype=torch.float32) extern_kernels.bmm(as_strided(arg0_1, (2, 8, 4), (32, 4, 1)), as_strided(arg1_1, (2, 4, 8), (32, 1, 4)), out=buf0) del arg0_1 del arg1_1 buf1 = as_strided(buf0, (1, 2, 8, 8), (128, 64, 8, 1)); del buf0 # reuse kernel_cpp_0(c_void_p(buf1.data_ptr())) return (buf1, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92277 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-02-01 14:12:33 +00:00
Nikita Karetnikov	c4ccf7e121	[fx] add SymPy assumptions to `FloorDiv` (#93185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93185 Approved by: https://github.com/ezyang	2023-02-01 13:50:59 +00:00
chunyuan	f1030dcc6d	[Re-open 90267] [inductor] weight prepack for single conv_transpose2d (#91956 ) Re-open https://github.com/pytorch/pytorch/pull/90267 since earlier pr on that stack got reverted. Depend on internal ideep upgrade. [Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91956 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-02-01 12:36:52 +00:00
Edward Z. Yang	66fd99cc09	Use symbolic tracing_mode for aot repro with dynamic_shapes (#93393 ) This is by no means a complete fix for broken aot symbolic tracing, but it is definitely better what we have right now. More context: https://github.com/pytorch/pytorch/issues/93367 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93393 Approved by: https://github.com/SherlockNoMad, https://github.com/bdhirsh	2023-02-01 11:51:00 +00:00
haozhe.zhu	298075e183	use aten parallel on lu factor (#93037 ) https://github.com/pytorch/pytorch/issues/91536. One issue mentioned torch.inv is pretty slow for large batches with small matrices on cuda. I checked the CPU implementations and found we have an optimize opportunity. For torch.inv, the CPU pass chooses to solve it by `lu_factor` + `lu_solve`. The `lu_factor` loop on `batch_size` dimension and the parallel happened inside lapack - For small matrix, the computational complexity is too tiny to parallel inside lapack. - Even for large matrix, the parallelization efficiency is not good in lapack ( it performs worse than using at::parallel outside) - Only for small batch size + small matrix size, the omp overhead will take too large overhead. Based on the above observations, using at::parallel outside on lu_factor will have a pretty large benefit. Here is the code/data collected on 32 core ICX system. ```python import torch import time def bench(bs, r): x = torch.randn(int(bs), r, r) start = time.time() for i in range(100): y1 = torch.linalg.lu_factor(x) end = time.time() print(r, bs) print(end - start) print((end - start)/(r**3)) for r in (4, 16, 64): for bs in (1e2, 1e4, 1e6): bench(bs, r) ``` \| bs/rank \| 100/4 \| 10000/4 \| 1000000/4 \| 100/16 \| 10000/16\| 1000000/16\| 100/64\| 10000/64\| 1000000/64\| \| ---- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| \| parallel inside lapack \| 0.0028 \|1.077 \| 11.99\|0.0163 \| 1.5260\|153.17 \|0.2021\|20.93 \| 1877\| \| parallel outside lapack \| 0.0087 \| 0.0247 \| 1.566\| 0.0044\|0.1678 \|17.63\|0.038\|2.311 \| 208.6\| \|speed up ratio\| 0.32x \| 43.6x \| 7.65x\|3.70x \|9.09x \|8.69x \|5.32x \|9.06x \|9x \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/93037 Approved by: https://github.com/lezcano	2023-02-01 10:05:59 +00:00
jjsjann123	bdca5fcd43	cherry-picking autodiff support for gather/index_select (#93333 ) added gather & index_select in autodiff; test coverage should be handled by opinfo; Pull Request resolved: https://github.com/pytorch/pytorch/pull/93333 Approved by: https://github.com/ngimel	2023-02-01 09:47:40 +00:00
Nikita Vedeneev	b484d17c24	_sparse_coo_tensor_with_dims_and_tensors backward: simplify and optimize (#91704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91704 Approved by: https://github.com/albanD, https://github.com/cpuhrsch	2023-02-01 09:02:25 +00:00
Ivan Kobzarev	6a2838eec5	[jit] jit._drop fun modifier to allow in jit class non-jit decl funs (#93012 ) `@torch.jit.unused` and `@torch.jit.ignore` do not allow to keep in torch scripted class member function, that has non scriptable declaration (e.g. return type) Adding FunctionModifier _DROP to allow fully skip those functions from scripting and keep them in the code of the scripted class. E.g. it can be used for: ``` @torch.jit._drop def __fx_create_arg__(self, tracer: torch.fx.Tracer) -> torch.fx.node.Argument: # torch.fx classes are not scriptable return tracer.create_node( "call_function", CFX, args=(tracer.create_arg(self.features),), kwargs={}, ) def __iter__(self) -> Iterator[torch.Tensor]: return iter(self.a) ``` Testing: Added test case in `test/jit/test_types.py` with non-scriptable type annotations (fx.* classes) that fails before fix and passes after. ``` python test/test_jit.py ``` Differential Revision: [D42774830](https://our.internmc.facebook.com/intern/diff/D42774830) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93012 Approved by: https://github.com/davidberard98	2023-02-01 09:02:05 +00:00
Nikita Vedeneev	994f85d639	sparse_mask: extend lhs to sparse COO tensors (#92248 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92248 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-02-01 09:00:07 +00:00
Sherlock Huang	6a7d6cc30d	Introduce core_aten_decompositions (#93131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93131 Approved by: https://github.com/ngimel	2023-02-01 08:35:46 +00:00
Xia, Weiwen	f77f88fbc7	[Quant] X86 qengine always uses fbgemm kernels on OS other than Linux (#93218 ) Summary X86 quantization backend (qengine) with oneDNN kernels has not been validated on OS other than Linux. So, let it fall back to fbgemm if OS is not Linux. This makes sure the behavior is the same on Windows/Mac as the previous default fbgemm qengine on x86 CPUs. Test plan CI checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93218 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 08:12:39 +00:00
Huy Do	776079b5bc	Fix test_file_system_checkpoint_cpu.py temp directory usage (#93302 ) Fixes https://github.com/pytorch/pytorch/issues/93245 This failure starts to happen recently. `tempfile.mkdtemp()` has already created the temporary directory, so removing it with `shutil.rmtree`, then recreating it with `os.makedirs` doesn't make much sense to me. The flaky problem here is that `shutil.rmtree` could fail to remove the temporary directory sometimes. Here is the error: ``` ====================================================================== ERROR [1.814s]: test_load_rowwise_to_colwise_thread_count_2 (__main__.TestDistributedReshardOnLoad) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 539, in wrapper self._join_processes(fn) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 765, in _join_processes self._check_return_codes(elapsed_time) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 810, in _check_return_codes raise RuntimeError(error) RuntimeError: Process 0 exited with error code 10 and exception: Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 663, in run_test getattr(self, test_name)() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 541, in wrapper fn() File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 252, in instantiated_test test(self, *param_kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper func(self, args, **kwargs) File "/var/lib/jenkins/workspace/test/distributed/checkpoint/test_file_system_checkpoint_cpu.py", line 364, in test_load_rowwise_to_colwise os.makedirs(path) File "/opt/conda/envs/py_3.8/lib/python3.8/os.py", line 223, in makedirs mkdir(name, mode) FileExistsError: [Errno 17] File exists: '/tmp/tmps5rxw4hb' ``` If the temporary directory really needs to be cleaned up, another way would be to remove everything underneath it, but leave the folder alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93302 Approved by: https://github.com/kumpera	2023-02-01 07:52:45 +00:00
Xia, Weiwen	eea752f853	[Quant][ONEDNN] Fix weight reorder issue for grouped convolution (#91934 ) Summary For onednn quant backend only. QConv weight may be reordered to another blocked format if input shape is changed at runtime. It's a bug that group info is not retained for such reordering. This may lead to wrong shape of weight after reordering. This PR fixes this bug. Test plan python test/test_quantization.py -k test_conv_reorder_issue_onednn Pull Request resolved: https://github.com/pytorch/pytorch/pull/91934 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 07:43:53 +00:00
Yanbo Liang	2457d0ef4f	[Dynamo][Easy] Remove duplicated code in builder.py (#93809 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93809 Approved by: https://github.com/williamwen42	2023-02-01 07:26:19 +00:00
Ivan Kobzarev	9daca46dc4	[jit][await] Apply review comments (#93284 ) Differential Revision: [D42849920](https://our.internmc.facebook.com/intern/diff/D42849920) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93284 Approved by: https://github.com/malfet	2023-02-01 07:22:06 +00:00
Jacob Szwejbka	feb6c9ae9b	Partial revert of autogen view_copy ops which return lists (#93411 ) Differential Revision: D42898313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93411 Approved by: https://github.com/larryliu0820	2023-02-01 06:31:58 +00:00
AllenTiTaiWang	9d1263a88d	[ONNX] Fix Gather replacement in RNN peephole (#93120 ) From PR: https://github.com/pytorch/pytorch/pull/58691, Replacing the second input of `Gather` 0 to 1 affects other innocent Nodes. In Issue #91526 onnx::range starts from 0, the 0 is changed by this mechanism, as it's shared with onnx::Gather. This PR intends to create a whole independent Constant 0 for replacement. NOTE: The PR passes all existing RNN tests locally in case CI doesn't include RNN test. ~~TODO: test~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/93120 Approved by: https://github.com/BowenBao	2023-02-01 06:29:17 +00:00
Jason Ansel	2cd8cb02a1	[inductor] Don't skip realize heuristics with dynamic shapes (#93814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93814 Approved by: https://github.com/Chillee, https://github.com/ngimel	2023-02-01 06:27:45 +00:00
Will Constable	ac791bddce	Refactor dynamo distributed test helpers to be reusable (#93187 ) The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187 Approved by: https://github.com/kumpera	2023-02-01 06:09:42 +00:00
Wanchao Liang	60e503d468	[dtensor][6/N] change to a better/safer op registration (#90735 ) This PR changes the op registration to a better mechanism, now we require the directly overload registration instead of the op key str, this have several benefits: 1. We ensure that the op registration registers the correct op, which means it would be faild if the op registration become wrong (this PR already fixing several op registration errors as we use direct OpOverload registration 2. If the overload name get changed/deleted, we immediately know it at the source code compilation level, which is safer 3. This also keep it consistents with the op registration mechanism with other tensor subclasses within PyTorch Differential Revision: [D42876250](https://our.internmc.facebook.com/intern/diff/D42876250) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90735 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-02-01 05:06:33 +00:00
Wu, Chunyuan	42633cf5f9	Inductor cpp wrapper: cache the loading of the kernel (#89742 ) ### Pitch Cache the loaded kernel to reduce the overhead. #### Code before: ```cpp std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) { ... auto kernel_cpp_0_lib = dlopen("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so", RTLD_NOW); assert(kernel_cpp_0_lib != nullptr); void (kernel_cpp_0)(const float,const float,float,float); (void *) (&kernel_cpp_0) = dlsym(kernel_cpp_0_lib, "kernel"); kernel_cpp_0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float)(buf1.data_ptr())); ... } ``` #### Code after: ```cpp template <typename KernelFunc> KernelFunc load_cpp_kernel(const char so_filename) { KernelFunc kernel_cpp; auto kernel_cpp_lib = dlopen(so_filename, RTLD_NOW); assert(kernel_cpp_lib != nullptr); (void ) (&kernel_cpp) = dlsym(kernel_cpp_lib, "kernel"); return kernel_cpp; } std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) { ... static auto kernel_cpp_0 = load_cpp_kernel<void ()(const float,const float,float,float)>("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so"); kernel_cpp_0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float)(buf1.data_ptr())); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89742 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-02-01 05:05:50 +00:00
Wanchao Liang	9a56997fe1	[dtensor][5/N] add cached propagator for TP (#90734 ) This PR adds a cached propagator for TP use, it caches the sharding prop decision for the same input sharding on an operator. This could improve eager mode performance. Differential Revision: [D42876249](https://our.internmc.facebook.com/intern/diff/D42876249) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90734 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-02-01 05:04:08 +00:00
Wanchao Liang	b072245178	[dtensor][4/N] refactor dispatching logic and add propagator (#90733 ) This PR refactors the dispatching logic to make it more clean, and isolate the sharding propagation logic out to a separate class. This is so that we can implement more complicated propagation features later. Differential Revision: [D42876251](https://our.internmc.facebook.com/intern/diff/D42876251) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90733 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-02-01 05:02:11 +00:00
Sherlock Huang	965f4ea3ba	[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#91… (#92402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92402 Approved by: https://github.com/ezyang	2023-02-01 04:47:49 +00:00
PyTorch MergeBot	79db5bcc9d	[vision hash update] update the pinned vision hash (#93323 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93323 Approved by: https://github.com/pytorchbot	2023-02-01 03:41:20 +00:00
Yeounoh Chung	e752ec6dea	Re-enable xla workflow (#93334 ) Re-enables xla workflow after addressing https://github.com/pytorch/xla/issues/4535. The pytorch/xla repo is [green](https://app.circleci.com/pipelines/github/pytorch/xla/16130/workflows/aabf6879-b510-47e1-8abb-b3cf8398957a/jobs/38162) again after GitHub resolved the outage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93334 Approved by: https://github.com/malfet	2023-02-01 02:41:27 +00:00
Jason Ansel	10910758f4	Make dynamo tests work under pytest (#93251 ) This now runs without error: ``` pytest test/dynamo ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93251 Approved by: https://github.com/ezyang, https://github.com/voznesenskym, https://github.com/mlazos	2023-02-01 02:11:52 +00:00
Edward Z. Yang	08041c5264	Configurable repro_tolerance for same_two_models (#93398 ) Fixes https://github.com/pytorch/pytorch/issues/93293 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93398 Approved by: https://github.com/SherlockNoMad	2023-02-01 01:41:48 +00:00
Edward Z. Yang	3bae5484d0	Typofix (#93402 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93402 Approved by: https://github.com/albanD	2023-02-01 01:39:49 +00:00
leslie-fang-intel	0f802eedc2	[Quant][FX] Lower QConvAddReLU2d for onednn backend (#91155 ) Summary Add quantization mappings for QConvAddReLU2d for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. Test plan ``` python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_onednn python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_by_default python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_lowering ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91155 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 01:18:52 +00:00
leslie-fang-intel	e77f28a03d	[Quant] Add fused ConvAddReLU2d module for onednn backend (#91154 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused ConvAddReLU2d module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown. Test plan ``` python -m pytest test_quantization.py -k test_conv2d_add_relu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91154 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 01:16:23 +00:00
leslie-fang-intel	ef4118e435	[Quant][FX] Lower QConvAdd2d for onednn backend (#91153 ) Summary Add quantization mappings for QConvAdd2d for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. Test plan ``` python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_onednn python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_by_default python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_lowering ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91153 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 01:14:12 +00:00
BowenBao	eb9c4c8929	[ONNX] Properly skip tests by onnx version via 'unittest.skipIf' (#93316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93316 Approved by: https://github.com/justinchuby	2023-02-01 01:14:07 +00:00
leslie-fang-intel	53c3555a6a	[Quant] Add fused ConvAdd2d module for onednn backend (#91152 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `ConvAdd2d` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown. Test plan ``` python -m pytest test_quantization.py -k test_conv2d_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91152 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-02-01 01:11:25 +00:00
Salil Desai	7bcc446ede	[Vulkan][Optimize for Mobile] Avoid dereferencing element [0] if the vector is empty (#92918 ) Summary: Avoid dereferencing element [0] if the vector is empty. ___ In ```transferInputOutputBackends```, one of the rewrite passes for Vulkan ```optimize_for_mobile```, an out of bounds access happens when trying to insert a backend transfer for an input if that input's ```uses()``` is empty. This diff corrects that issue. Test Plan: Run tests ___ Phabricator + CI Tests Reviewed By: SS-JIA Differential Revision: D41296037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92918 Approved by: https://github.com/SS-JIA, https://github.com/kirklandsign	2023-02-01 01:09:19 +00:00
Nikita Shulga	e83f473bb7	[BE] Don't use `six` in torch.utils.tensorboard (#93383 ) As PyTorch is Python-3.8+ project only Pull Request resolved: https://github.com/pytorch/pytorch/pull/93383 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ZainRizvi	2023-02-01 00:22:23 +00:00
Svetlana Karslioglu	218d4eac56	Remove submission form (#93287 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93287 Approved by: https://github.com/orionr	2023-01-31 23:41:16 +00:00
Svetlana Karslioglu	8dfcb59d66	Update version of Python to 3.8 in the prerequisites (#93399 ) With support of Python 3.7 being deprecated, updating the prerequisites to list Python 3.8 or later. Fixes #93256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93399 Approved by: https://github.com/atalman, https://github.com/Skylion007	2023-01-31 23:38:19 +00:00
akhilkedia	129a1bc715	Minor error in docs regarding execution time (#93258 ) The previous sentence seemed to imply that sparse may not always be helpful, ie, your execution time may increase when using sparse. But the docs mentioned otherwise. A simple re-ordering of two words in the documentation to better align with the contextual sentiment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93258 Approved by: https://github.com/cpuhrsch	2023-01-31 23:32:42 +00:00
Jiawen Liu	7d7c4d9c1f	[inductor] Minor fix of addmm shape padding (#93320 ) Summary: Minor fix of addmm shape padding Test Plan: CI Differential Revision: D42855212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93320 Approved by: https://github.com/jansel	2023-01-31 23:21:22 +00:00
🌌	b179a097ea	Add platform markers for linux x86_64 only extra_install_requires (#93066 ) Like #89924 #91083 #85097 added new extra dependencies on nvidia-*. They are linux x86_64 (GPU) only packages, but were not marked as such, causing issues installing pytorch 1.13 via Poetry (and possibly other tools that follow PyPI's metadata API) on Linux aarch64 systems. This "fixes" the issue by adding the `and platform_machine == 'x86_64'` marker on these dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93066 Approved by: https://github.com/malfet	2023-01-31 22:23:51 +00:00
atalman	18c6ca1ee1	Add release matrix to release.md (#93392 ) Add Release Compatibility Matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/93392 Approved by: https://github.com/weiwangmeta, https://github.com/albanD, https://github.com/seemethere	2023-01-31 21:28:02 +00:00
Edward Z. Yang	902b4dba75	Change capture_scalar_outputs to use SymInt/SymFloat rather than Tensor to model scalars (#93150 ) Previously, Dynamo faked support for item() when `capture_scalar_outputs` was True by representing it internally as a Tensor. With dynamic shapes, this is no longer necessary; we can represent it directly as a SymInt/SymFloat. Do so. Doing this requires you to use dynamic shapes; in principle we could support scalar outputs WITHOUT dynamic shapes but I won't do this unless someone hollers for it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D42885775](https://our.internmc.facebook.com/intern/diff/D42885775) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93150 Approved by: https://github.com/voznesenskym	2023-01-31 21:23:23 +00:00
Edward Z. Yang	76b683b008	Correctly propagate compiler kwargs to aot minifier (#93308 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93308 Approved by: https://github.com/Chillee, https://github.com/voznesenskym	2023-01-31 20:25:27 +00:00
Nikita Shulga	295fd20eb5	[CI] Add Python-3.11 Linux conda builds (#93186 ) This PR almost a no-op, as most of the logic resides in the builder repo, namely: `6342242c50` `8f361d91e1` Remove `conda-forge` channel dependency for test job, but add `malfet` channel for 3.11 testing (as numpy is not in default channel yet) Build and upload following dependencies to `pytorch-nightly` channel: ``` anaconda copy --to-owner pytorch-nightly malfet/numpy/1.23.5 anaconda copy --to-owner pytorch-nightly malfet/numpy-base/1.23.5 anaconda copy --to-owner pytorch-nightly malfet/mkl-service/2.4.0 anaconda copy --to-owner pytorch-nightly malfet/mkl_random/1.2.2 anaconda copy --to-owner pytorch-nightly malfet/mkl_fft/1.3.1 anaconda copy --to-owner pytorch-nightly malfet/sympy/1.11.1 anaconda copy --to-owner pytorch-nightly malfet/mpmath/1.2.1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93186 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2023-01-31 20:24:03 +00:00
Edward Z. Yang	811e95a15e	--dynamic-ci-skips now works for all backends (#93369 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93369 Approved by: https://github.com/albanD	2023-01-31 20:07:58 +00:00
Huy Do	4d504a9ce8	Fix Windows python3 path (#93387 ) If a Windows runner is re-used, python3 should have already been setup. We will just need to make it available in `GITHUB_PATH`, so subsequent actions can use it Pull Request resolved: https://github.com/pytorch/pytorch/pull/93387 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/seemethere	2023-01-31 19:52:30 +00:00
Edward Z. Yang	2a31c3589b	Report suppressed exception in minifier (#93368 ) Suppressing exceptions is bad! If you're debugging PyTorch itself you want to see the exception so you can do something about it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93368 Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/bdhirsh	2023-01-31 19:31:50 +00:00
Edward Z. Yang	e5235fb62c	Convert GuardOnDataDependentSymNode into graph break (#93373 ) Extracted from https://github.com/pytorch/pytorch/pull/93150 because I need it earlier in trunk. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93373 Approved by: https://github.com/Skylion007	2023-01-31 19:31:44 +00:00
Omkar Salpekar	44a948c820	Fix MSVC compiler error in basic_ops.h (#93322 ) https://github.com/pytorch/pytorch/pull/93069 introduces a compiler error in some internal Windows builds using MSVC: ``` stderr: d:\full-fbsource\xplat\caffe2\torch\csrc\autograd\functions\basic_ops.h(43): fatal error C1001: An internal error has occurred in the compiler. ``` This may be related to older versions of MSVC not recognizing the `[[maybe-unused]]` attribute: https://developercommunity.visualstudio.com/t/compiler-bug-on-parsing-maybe-unused-in-range-base/209488. This PR reverts the changes in `basic_ops.h` that resolves those errors. Verified this fixes the internal jobs, and landed as [D42854205](https://www.internalfb.com/diff/D42854205). Pull Request resolved: https://github.com/pytorch/pytorch/pull/93322 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-01-31 19:14:48 +00:00
mantaionut	5b2afaaca8	Fix Vulkan compiling issues on Windows (#92207 ) PR based on #61431 Fix USE_VULKAN=1 and USE_VULKAN_WRAPPER=0 not compiling on Windows. Change designated initializers since they require C++20. Rename Hasher typename since it's not compiling due to https://developercommunity.visualstudio.com/t/1397858 Fixes #59519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92207 Approved by: https://github.com/ezyang	2023-01-31 18:58:15 +00:00
Sherlock Huang	438f12d91a	Rewrite some decomps to allow producing aten ops (#93099 ) This introduces a new stop to the decomposition train. Before reaching prims.view_of, it will stop at aten.alias. Export path wants to get off the train at aten ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93099 Approved by: https://github.com/ngimel	2023-01-31 17:46:20 +00:00
Yanbo Liang	332d55d3df	[Dynamo] UserDefinedClassVariable supports python type (#93310 ) Fixes #93260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93310 Approved by: https://github.com/mlazos	2023-01-31 17:41:51 +00:00
Elias Ellison	7b426e8da2	Remove fake tensor cache clearing in dynamo (#93304 ) Summary: We originally cleared the cache of the converter to avoid memory leaks; now that the cache uses a weak map this is no longer necessary. Clearing of the cache caused an error in an interaction with the minifier because the minifier uses delayed compilation, so the cleanup had occurred before inductor was invoked. Test Plan: Memory regression is being checked via dashboard and on master. Differential Revision: D42858624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93304 Approved by: https://github.com/ezyang	2023-01-31 17:40:15 +00:00
Bert Maher	cfff440614	[inductor] Lower fallback kernel warnings from WARNING to INFO (#93330 ) Summary: These are useful to us as developers, or maybe folks working really closely with us, but they seem kind of unnecessarily alarming to others, even ML/Torch experts. E.g.: https://github.com/karpathy/nanoGPT/pull/102 Test Plan: debate Differential Revision: D42876146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93330 Approved by: https://github.com/soumith, https://github.com/jansel	2023-01-31 17:34:17 +00:00
Catherine Lee	46c05a7ae3	[ez] Update base branch when updating python docs (#93305 ) Every now and then, the python docs push will fail because the base branch (pytorchbot/base) is too old and accumulates commits that might cause the cla check to fail. Pushing to the base branch will prevent it from being old. The site branch cannot be used because the following push to site will cause the pr to be closed, preventing us from getting the cla check the next day, which is what happened to https://github.com/pytorch/pytorch.github.io/pull/1157 when I was trying to figure this out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93305 Approved by: https://github.com/huydhn	2023-01-31 17:29:16 +00:00
Jainta Paul	d72db37c4a	Remove a redundant check from code. (#93025 ) In file: combinatorics.py, the comparison of Collection length creates a logical short circuit. if isinstance(self.sampler, Sized) and len(self.sampler) >= 0: Here, the right side of the comparison will always return true. I suggested that the Collection length check should be removed since this is redundant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93025 Approved by: https://github.com/albanD	2023-01-31 16:45:32 +00:00
Nikita Vedeneev	bb6af061a0	`torch.triangular_solve` for CSR: materialize diagonal elements when `unitriangular=True`. (#93352 ) Fixes https://github.com/pytorch/pytorch/issues/88890 A temporary fix until MKL is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93352 Approved by: https://github.com/cpuhrsch	2023-01-31 16:33:57 +00:00
103yiran	d9117b93fb	unsqueeze only when dim = 3 (#91052 ) unsqueeze is not necessary if use view Pull Request resolved: https://github.com/pytorch/pytorch/pull/91052 Approved by: https://github.com/albanD	2023-01-31 16:28:23 +00:00
chunyuan	bd4a5b400a	[Re-open 90266] [inductor] weight prepack for _convolution_transpose_pointwise (#91955 ) Re-open https://github.com/pytorch/pytorch/pull/90266 since earlier pr on that stack got reverted. Depend on internal ideep upgrade. [Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91955 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-31 13:28:57 +00:00
chunyuan	cc49f5abd3	[Re-land 90265] [inductor] add conv_transpose2d unary fusion for cpu in inference mode (#91954 ) Re-land https://github.com/pytorch/pytorch/pull/90265. Depend on internal ideep upgrade. [Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91954 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-31 13:17:53 +00:00
chunyuan	3870fdabfb	[Re-land 90264] add conv_transpose2d pointwise(unary) fusion kernel (#91953 ) Re-land https://github.com/pytorch/pytorch/pull/90264. Depend on internal ideep upgrade. [Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91953 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-31 12:58:05 +00:00
Ivan Yashchuk	fba13d94a1	Remove deprecated torch.symeig (#70988 ) The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`. - [x] XLA PR: https://github.com/pytorch/xla/pull/4498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988 Approved by: https://github.com/lezcano, https://github.com/kit1980, https://github.com/malfet	2023-01-31 11:59:11 +00:00
Edward Z. Yang	ec2461bbd8	Remove proxy tensor's check for data dependent output (#93265 ) We'll rely on the underlying fake tensor to raise an error in these cases. We only raise the error if there is an input to the data dependent operation that is a real tensor (and thus we are at risk of accidentally burning in real values) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93265 Approved by: https://github.com/albanD	2023-01-31 11:58:49 +00:00
Masaki Kozuki	d7a3f2128f	pass `None` instead of `False` inside `Adam.__setstate__` (#93289 ) with `a061f139dc`, `fused`'s type hint is `Optional[bool]` and its default value is `None`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93289 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2023-01-31 09:41:35 +00:00
Yanbo Liang	af5b01294e	[Dynamo] Fix bug if module calls module with static forward function (#93299 ) Fix a regression I found from 14k github models(10+ models failed since today), it's because of #93115. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93299 Approved by: https://github.com/williamwen42	2023-01-31 06:16:33 +00:00
Jason Ansel	91a4947e28	Populate extern_kernels on import (#93282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93282 Approved by: https://github.com/ngimel	2023-01-31 04:52:10 +00:00
Jason Ansel	8c09a005c5	[inductor] Pattern matching engine (copy) (#93291 ) This is an exact duplicate of https://github.com/pytorch/pytorch/pull/90739 The fbcode workflow for landing that diff seems buggy. The github-export-checks task is failing with credentials errors. Plan to try to land it using GH1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93291 Approved by: https://github.com/desertfire	2023-01-31 04:51:00 +00:00
Khushi Agrawal	aee5f84ac3	[c++] use constexpr instead of const (#93267 ) As discussed in https://github.com/pytorch/pytorch/pull/93199#discussion_r1089777684. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93267 Approved by: https://github.com/Skylion007	2023-01-31 04:33:22 +00:00
Nikita Shulga	f9c08e25a1	Fix MacOS nightly builds (#93331 ) By setting python_desired version to 3.8 Test Plan: Add `ciflow/binaries_libtorch` and see what will happen Pull Request resolved: https://github.com/pytorch/pytorch/pull/93331 Approved by: https://github.com/huydhn	2023-01-31 04:31:28 +00:00
Chien-Chin Huang	888771dc5d	[FSDP][optim_state_dict] Fix `_is_named_optimizer` when the state is empty (#93303 ) Optimizer state is not eager initializaion -- only NamedOptimizer and KeyedOptimizer are. This PR makes it `_is_named_optimizer` work with regular optimizers. Differential Revision: [D42858589](https://our.internmc.facebook.com/intern/diff/D42858589/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93303 Approved by: https://github.com/fduwjj	2023-01-31 03:49:26 +00:00
Nikita Shulga	441b09d1b7	[CI][ez] Rename some jobs (#93327 ) periodic debug builds are actually running against Python-3.10 Remove Python version specifier from libtorch builds, as it kind of irrelevant (libtorch is C++ only build, so Python version should not matter) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93327 Approved by: https://github.com/kit1980	2023-01-31 03:02:30 +00:00
sli	524ee07143	Fix https://github.com/pytorch/pytorch/issues/92377 (#92379 ) Fixes #92377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92379 Approved by: https://github.com/Chillee	2023-01-31 02:22:16 +00:00
Peter Bell	782b9a9cde	Use _exchange_device to reduce torch.cuda.device overhead (#91127 ) This must wait for the forward compatibility period since it requires the `cuda::_exchange_device` primitive for TorchScript. Also since TorchScript doesn't support inheritance, we can't just inherit from `_DeviceGuard` here. This saves around 2 us per `with` statement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91127 Approved by: https://github.com/ngimel	2023-01-31 01:56:40 +00:00
Han Qi	fc4e9931da	[fx.GraphModule] Populate memo in deepcopy BEFORE copying children. (#93295 ) Summary: Apparently if not then at somepoint, we might lose fields if the submodules have circular reference Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93295 Approved by: https://github.com/jerryzh168	2023-01-31 01:45:35 +00:00
leslie-fang-intel	21c7c7c72f	[Quant] Use the true src zero point to query and create conv pd (#90818 ) Summary Previously, we use `DNNL_RUNTIME_S32_VAL` as the `zero point` for `src` in both weight prepack and convolution forward to ensure the same block format of weight is used. The problem is `DNNL_RUNTIME_S32_VAL` may query out a different block format weight comparing with the true `zero point` for `src`. It makes oneDNN convolution into `jit` path instead of `brgconv` path. Here we will use the true `zero point` for `src` to create pd and make reorder if it's a different block format weight as weight prepack generated. Test Plan ``` python -m pytest quantization/core/test_quantized_op.py::TestQuantizedConv::test_conv_transpose_reorder_issue_onednn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90818 Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5, https://github.com/jerryzh168	2023-01-31 01:23:41 +00:00
leslie-fang-intel	a71d9a928f	[Quant] Add fused conv2d_add_relu op for onednn backend (#90364 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused conv2d_add_relu op for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this op with other quantization backends otherwise an error is thrown. Test Plan ``` python -m pytest test_quantization.py::TestQuantizedConv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90364 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-01-31 01:20:50 +00:00
PyTorch MergeBot	01687a6bad	Revert "add numpy typing plugin to mypy config (#92930 )" This reverts commit 5f1ac188f8dd01a81d0ddeebdbc4d22e25311b72. Reverted https://github.com/pytorch/pytorch/pull/92930 on behalf of https://github.com/clee2000 due to causing test_doc_examples (main.TestTypeHints) to fail https://github.com/pytorch/pytorch/actions/runs/4049393005/jobs/6965869223 `5f1ac188f8`, note for revert review: PR was forced merged after first failure, which was flaky	2023-01-31 01:13:01 +00:00
Nikita Shulga	1a454310b9	Update SECURITY.MD (#93313 ) To recommend reporting issues via advisories Pull Request resolved: https://github.com/pytorch/pytorch/pull/93313 Approved by: https://github.com/atalman, https://github.com/seemethere	2023-01-31 00:36:47 +00:00
Sergei Vorobev	aeac7f4203	[bazel] Fix gloo.BUILD (#92858 ) After the recent gloo submodule bump, bazel build that uses gloo needs a slight update. Tested that now I was able to build :torch with gloo (on our internal build) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92858 Approved by: https://github.com/dagitses, https://github.com/malfet	2023-01-31 00:22:28 +00:00
Wanchao Liang	5f1ac188f8	add numpy typing plugin to mypy config (#92930 ) This added the numpy typing plugin to mypy config so that we could use it for DeviceMesh typing annotations Please see https://github.com/pytorch/pytorch/pull/92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like: ```python with FakeTensorMode(): device_mesh = DeviceMesh("cuda", torch.arange(4)) ``` It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92930 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-01-31 00:13:12 +00:00
William Wen	2a6e085704	Update custom backend docs (#92721 ) Title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92721 Approved by: https://github.com/jansel	2023-01-30 23:54:49 +00:00
Max Ren	c499e760f5	[XNNPACK] Enable Memopt for OSS (#93097 ) Summary: D38543798 Enabled Memopt previously to fix a bug with memory planner Mirroring the changes we made Internally to OSS Test Plan: OSS CI Reviewed By: digantdesai Differential Revision: D42782958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93097 Approved by: https://github.com/digantdesai	2023-01-30 23:36:41 +00:00
Jason Ansel	24b501903c	Minor sympy usage fix in fbcode (#93171 ) Summary: To supports older versions of sympy. Test Plan: ``` buck2 run @//mode/opt @//mode/inplace -c python.package_style=inplace -c fbcode.enable_gpu_sections=true //caffe2/benchmarks/dynamo:torchbench -- -dcuda --performance --inductor --only hf_T5 ``` Differential Revision: D42812188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93171 Approved by: https://github.com/eellison	2023-01-30 23:34:22 +00:00
Sherlock Huang	36fe31f537	[Reland] Refactor stack_trace preservation for node meta preservation (#90803 ) (#92400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90803 Approved by: https://github.com/jerryzh168, https://github.com/albanD ghstack-source-id: 5848cca08ef5d6f8868f4f79d8bc29711e9a52c2 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92400 Approved by: https://github.com/jerryzh168	2023-01-30 23:30:43 +00:00
Ching-Hsiang Chu	1fa68d40b8	[pytorch] fix backend_type for backend/PG plugin (#93129 ) Summary: For backend/PG plugin, use `ProcessGroup.BackendType.CUSTOM` to avoid uninitialized variable during `pg._register_backend` later Test Plan: CI/CD and internal tests Differential Revision: D42793222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93129 Approved by: https://github.com/H-Huang	2023-01-30 23:16:08 +00:00
Jacob Szwejbka	2e9107ec1e	[Pytorch][Executorch] Handwritten view copy out ops should resize out (#91194 ) Summary: Handwritten out ops should have feature parity with the codegend ones. This means they should resize out to the appropriate size. Q. Why are these handwritten instead of codegend anyway? Q2. Wheres a good spot to put the resize and copy helpers since they are reused in the codegend out kernels Test Plan: ci. Differential Revision: D42177051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91194 Approved by: https://github.com/ezyang	2023-01-30 23:07:14 +00:00
ssjia	7dabb8b53b	[vulkan] Enable command buffer reuse and add keys to Tensor/StorageBuffer objects (#92993 ) Differential Revision: [D42614180](https://our.internmc.facebook.com/intern/diff/D42614180/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92993 Approved by: https://github.com/salilsdesai	2023-01-30 23:03:07 +00:00
Jerry Zhang	ae79f95cb8	[quant][fx][pt2e][refactor] Refactor prepare.py for upcoming quantize_pt2e changes (#92641 ) Summary: Changes node.meta["target_dtype_info"] to store observer/fake_quant constructors instead of (dtype, is_dynamic), so that in the future user can provide configure this by themselves, follow up refactors: (1). generalized structure for "target_dtype_info": right now, we have "input_act_obs_or_fq_ctr", "weight_obs_or_fq_ctr", "bias_obs_or_fq_ctr", "output_obs_or_fq_ctr" this works OK for current use cases, and users are using a different config to specify which input is weight and which input is bias, to generalize it we should just expose an api that allow users to specify either a dictionary from input_index to obs_or_fq_ctr, and output_index to obs_or_fq_ctr, e.g. e.g. out1, (out2, out3) = op(arg0, (arg1, arg2)) "input_act_obs_or_fq_ctr" = {0: obs1, 1: obs2} "output_act_obs_or_fq_ctr" = {0: obs3, 1: obs4} note that this would not allow configuring obs/fq for nested structures or have a config that mimics the structure of arguments and output, e.g. out1, (out2, out3) = op(arg0, (arg1, arg2)), we can have "input_act_obs_or_fq_ctr" = (obs1, (obs2, obs3)) "output_act_obs_or_fq_ctr" = (obs4, (obs5, obs6)) (2). use these observer/fq directly for inserting observers instead of using qconfig (3). clean up the TODOs in the code base Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/92641 Approved by: https://github.com/jcaip	2023-01-30 22:57:20 +00:00
Natalia Gimelshein	dd0ba2076a	return clone in case of 1 input cat (#93294 ) Fixes #93283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93294 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-01-30 22:55:26 +00:00
atalman	286cca8929	Add cudnn install 8.7.0.84 for CUDA 11.8 (#93086 ) Add cudnn install 8.7.0.84 for CUDA 11.8 . Same as: https://github.com/pytorch/pytorch/pull/84964 Related to https://github.com/pytorch/builder/pull/1271 Test PR: https://github.com/pytorch/pytorch/pull/92971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93086 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-01-30 22:53:20 +00:00
Jane Xu	0ecb071fc4	[BE][CI] change references from .jenkins to .ci (#92624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92624 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-01-30 22:50:07 +00:00
Bin Bao	2b267fa7f2	[inductor] Check memory compression ratio in model tests (#89305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89305 Approved by: https://github.com/weiwangmeta	2023-01-30 22:01:06 +00:00
Jason Ansel	53a669869c	Remove checks for refs/prims (#93250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93250 Approved by: https://github.com/voznesenskym	2023-01-30 21:42:10 +00:00
ssjia	e17bfde622	[vulkan] Create separate BUCK target for command buffer recording (#92157 ) Differential Revision: [D42502843](https://our.internmc.facebook.com/intern/diff/D42502843/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42502843/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/92157 Approved by: https://github.com/salilsdesai	2023-01-30 21:34:23 +00:00
Sherlock Huang	710fe40597	[Export] Introduce as_none in ex.Argument union type (#93210 ) This design has two implications - We are NOT modeling nullable argument types, e.g `Tesnor?`, `int?`, `int[]?` as a special argument type - Python None is treated as a special argument type, downstream executor/runtime need know to handle this. For aten.convolution's schmea, it accepts an optional input: `Tensor? bias` ``` convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups) -> Tensor ``` Example: notice the None argument in the following fx.node ``` convolution_default = torch.ops.aten.convolution.default(arg0, _param_constant0, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1) ``` would be exported as ``` Node( op='call_function', target='aten.convolution.default', args=[ Argument(as_tensor=TensorArgument(name='arg0')), Argument( as_tensor=TensorArgument(name='_param_constant0') ), Argument(as_none=True), Argument(as_ints=[2, 2]), Argument(as_ints=[3, 3]), Argument(as_ints=[1, 1]), Argument(as_bool=False), Argument(as_ints=[0, 0]), Argument(as_int=1) ], kwargs={}, outputs=[ ReturnArgument( as_tensor=TensorArgument(name='convolution_default') ) ], metadata='Skipped' ), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93210 Approved by: https://github.com/suo	2023-01-30 21:32:49 +00:00
Sherlock Huang	1d25070949	[Export] Refine design around TensorValue (renamed IValue) (#93217 ) See discussion in my comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93217 Approved by: https://github.com/suo	2023-01-30 21:32:32 +00:00
Kshiteej K	845e4b8a47	[fix] legacybatching: getPhysicalDims (#93261 ) Fixes #92985 Minimum Repro: ```python import torch from torch._vmap_internals import vmap input = torch.randn(2, 2) def fn(x): return x.sum(()) o = vmap(fn)(input) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93261 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-01-30 21:06:32 +00:00
Nikita Shulga	7a621c443b	[GHF] Fix ghstack branches in sync logic (#93298 ) Test plan: ```python from git_utils import are_ghstack_branches_in_sync,GitRepo repo=GitRepo("/Users/nshulga/git/pytorch/pytorch") are_ghstack_branches_in_sync(repo, "gh/SS-JIA/206/head") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93298 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-01-30 21:00:51 +00:00
atalman	54056c1705	Update cudnn_frontend to 0.7.3 (#93272 ) Updating cudnn_frontend to 0.7.3 To enable CUDNN 8.7 integration Pull Request resolved: https://github.com/pytorch/pytorch/pull/93272 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-01-30 20:45:00 +00:00
Catherine Lee	c516e5488e	Move bazel and xla to unstable (#93296 ) Fixes #ISSUE_NUMBER currently they are failing due things like ``` ERROR: An error occurred during the fetch of repository 'tf_runtime': Traceback (most recent call last): File "/var/lib/jenkins/workspace/xla/third_party/tensorflow/third_party/repo.bzl", line 73, column 33, in _tf_http_archive_impl ctx.download_and_extract( Error in download_and_extract: java.io.IOException: Error downloading [`3367783466`.tar.gz, `3367783466`.tar.gz] to /home/jenkins/.cache/bazel/_bazel_jenkins/b463291cb8b07b4bfde1e3a43733cd1a/external/tf_runtime/temp17509854002229755553/3367783466dff91b8b283d61c7fe8abc9e7bbb80.tar.gz: Checksum was 4d2fc38d8b6edd1a478ea2fcb88491eeaf7378e5ffe9f4e3eb3b821df1d1c5ba but wanted 5e6bab71ce31b4b56105ac4567f8bffa5f5b3de7ad3064638297249e69375623 ``` so I move to unstable until we investigate and fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/93296 Approved by: https://github.com/huydhn	2023-01-30 20:15:41 +00:00
Jane Xu	4fc19e1a71	[optim][adam] use fastest impl whenever possible, add util (#93184 ) This allows it so that ONLY when the users don't set anything for foreach or fused do we switch the default and cascades adam so that we default to fused, then foreach, then single-tensor. To clarify: * if the user puts True in foreach _only_, it will run the foreach implementation. * if the user puts True in fused _only_, it will run the fused implementation. * if the user puts True in foreach AND for fused, it will run the fused implementation. And: * if the user puts False in foreach _only_, it will run the single tensor implementation. * if the user puts False in fused _only_, it will still run the single tensor implementation. * if the user puts False in foreach AND for fused, it will run the single tensor implementation. I also didn't trust myself that much with the helper function, so I ran some local asserts on _default_to_fused_or_foreach. The only point left to really test is the type(p) -- torch.Tensor but I think the distributed tests will catch that in CI. ``` cuda_only_fp_list = [ torch.rand((1, 2), device="cuda", dtype=torch.float32), torch.rand((1, 2), device="cuda", dtype=torch.float64), torch.rand((1, 2), device="cuda", dtype=torch.float16), torch.rand((1, 2), device="cuda", dtype=torch.bfloat16), ] cuda_only_int_list = [ torch.randint(1024, (1, 2), device="cuda", dtype=torch.int64), ] cpu_list = [ torch.rand((1, 2), device="cpu", dtype=torch.float32), torch.rand((1, 2), device="cpu", dtype=torch.float64), torch.rand((1, 2), device="cpu", dtype=torch.float16), ] none_list = [None] # differentiable should always make it return false for both assert _default_to_fused_or_foreach([cuda_only_fp_list], True, True) == (False, False) assert _default_to_fused_or_foreach([cuda_only_fp_list], True, False) == (False, False) # cpu lists should always make it return false for both assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, True) == (False, False) assert _default_to_fused_or_foreach([cpu_list], False, True) == (False, False) assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, False) == (False, False) assert _default_to_fused_or_foreach([cpu_list], False, False) == (False, False) # has fused triggers correctly assert _default_to_fused_or_foreach([cuda_only_fp_list], False, True) == (True, False) assert _default_to_fused_or_foreach([cuda_only_fp_list], False, False) == (False, True) # ints always goes to foreach assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, True) == (False, True) assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, False) == (False, True) # Nones don't error assert _default_to_fused_or_foreach([cuda_only_fp_list, none_list], False, True) == (True, False) assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list, none_list], False, True) == (False, True) assert _default_to_fused_or_foreach([none_list], False, True) == (True, False) assert _default_to_fused_or_foreach([none_list], False, False) == (False, True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93184 Approved by: https://github.com/albanD	2023-01-30 19:58:55 +00:00
Edward Z. Yang	efee879695	Don't suppress warnings in CI. (#93269 ) Warnings are an important clue that something bad is going on. You want to see them in logs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93269 Approved by: https://github.com/voznesenskym	2023-01-30 19:21:09 +00:00
Edward Z. Yang	5d9902cbcd	Beef up error when converting sympy expr to int/float/bool fails (#93198 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93198 Approved by: https://github.com/albanD	2023-01-30 18:35:52 +00:00
Ivan Kobzarev	2fc73622f8	[jit] Support Awaitable type (#90863 ) We want to make TorchRec sharded models TorchScriptable. TorchRec sharded models uses generic types Awaitable[W] and LazyAwaitable[W] (https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/types.py#L212). In sharded model those types are used instead of contained type W, having the initialization function that produces object of type W. At the moment when the first attribute of W is requested - `LazyAwaitable[W]` will call its initialization function (on the same stack), cache the result inside and work transparently as an object of W. So we can think about it as a delayed object initialization. To support this behavior in TorchScript - we propose a new type to TorchScript - `Await`. In eager mode it works the same as `LazyAwaitable[W]` in TorchRec, being dynamically typed - acting as a type `W` while it is `Await[W]`. Within torchscript it is `Await[W]` and can be only explicitly converted to W, using special function `torch.jit.awaitable_wait(aw)`. Creation of this `Await[W]` is done via another special function `torch.jit.awaitable(func, args)`. The semantic is close to `torch.jit.Future`, fork, wait and uses the same jit mechanics (inline fork Closures) with the difference that it does not start this function in parallel on fork. It only stores as a lambda inside IValue that will be called on the same thread when `torch.jit.awaitable_wait` is called. For example (more examples in this PR `test/jit/test_await.py`) ``` def delayed(z: Tensor) -> Tensor: return Tensor 3 @torch.jit.script def fn(x: Tensor): aw: Await[int] = torch.jit._awaitable(delayed, 99) a = torch.eye(2) b = torch.jit._awaitable_wait(aw) return a + b + x ``` Functions semantics: `_awaitable(func -> Callable[Tuple[...], W], args, *kwargs) -> Await[W]` Creates Await object, owns args and kwargs. Once _awaitable_wait calls, executes function func and owns the result of the function. Following _awaitable_wait calls will return this result from the first function call. `_awaitable_wait(Await[W]) -> W` Returns either cached result of W if it is not the first _awaitable_wait call to this Await object or calls specified function if the first. `_awaitable_nowait(W) -> Await[W]` Creates trivial Await[W] wrapper on specified object To be type complaint for the corner cases. Differential Revision: [D42502706](https://our.internmc.facebook.com/intern/diff/D42502706) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90863 Approved by: https://github.com/davidberard98	2023-01-30 17:38:59 +00:00
Aleksandar Samardžić	53f7fb9a22	Add CSC->BSC conversion (#92307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92307 Approved by: https://github.com/cpuhrsch	2023-01-30 17:03:36 +00:00
Edward Z. Yang	434eb16deb	Correctly restore pybind11 error_already_set (#93238 ) We would handle py::error_already_set correctly from pybind11 bindings, but not from our regular TH bindings, which meant that anything from an inner pybind11 function call was getting unconditionally transformed into a RuntimeError. Not too many cases where we do this, but PySymNodeImpl was one of them. To test this, I need to raise a non-RuntimeError from a function which is invoked from pybind11 and then propagated to a non-pybind11 call site. I introduce GuardOnDataDependentSymNode for expressly this purpose (this is how I discovered the bug anyway.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93238 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-01-30 16:43:01 +00:00
Andrew Gu	3e4d0e8d82	[Reland][FSDP] Do not clean FQNs for `use_orig_params=True` (#92662 ) The last PR (https://github.com/pytorch/pytorch/pull/91767/) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92662 Approved by: https://github.com/rohan-varma	2023-01-30 16:07:44 +00:00
Edward Z. Yang	c7b03010ec	Split the aot/dynamo TORCHDYNAMO_REPRO_AFTER cases (#93226 ) I often copy paste this line and it is annoying to have to modify the inside to select aot/dynamo Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93226 Approved by: https://github.com/desertfire	2023-01-30 14:23:16 +00:00
Edward Z. Yang	9eb402d18e	Update dynamic benchmark skips (#93228 ) Data from https://github.com/pytorch/pytorch/pull/93223 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93228 Approved by: https://github.com/desertfire	2023-01-30 14:22:53 +00:00
Nikita Karetnikov	04082fc042	[inductor] enable more dynamic shapes tests (#93216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93216 Approved by: https://github.com/ezyang	2023-01-30 09:05:45 +00:00
Li-Huai (Allan) Lin	5112f44dc4	Add vmap support for torch.index_fill (#91364 ) Fixes #91177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91364 Approved by: https://github.com/zou3519	2023-01-30 08:08:33 +00:00
blzheng	08035b1eb9	inductor: support more conv+unary fusion (#92518 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92518 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-30 07:21:50 +00:00
cyy	4d51c8532c	Some simple fixes (#93221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93221 Approved by: https://github.com/Skylion007	2023-01-30 05:14:03 +00:00
Aaron Gokaslan	e790281a85	SymInt'ify view_as (#93242 ) Follow up to #93241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93242 Approved by: https://github.com/ezyang	2023-01-30 01:56:50 +00:00
Edward Z. Yang	3c570a2be3	SymInt'ify reshape_as (#93241 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93241 Approved by: https://github.com/Skylion007	2023-01-30 01:46:16 +00:00
Aaron Gokaslan	0247ed27cc	Apply Clang-Tidy readability-container-size-empty (#93236 ) Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236 Approved by: https://github.com/malfet	2023-01-29 23:28:19 +00:00
Nikita Shulga	239afa0e43	Revert accidental change to libkineto version (#93237 ) Introduced by https://github.com/pytorch/pytorch/pull/93155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93237 Approved by: https://github.com/Skylion007	2023-01-29 23:14:14 +00:00
Yanbo Liang	b3e422948d	[Dynamo] Support out variants of ops mutate the tensors out of the function frame (#93177 ) Fixes #93136 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93177 Approved by: https://github.com/jansel	2023-01-29 22:22:58 +00:00
Edward Z. Yang	129f136179	Move Sherlock to snooping dynamic shapes (#93239 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93239 Approved by: https://github.com/kit1980	2023-01-29 20:22:56 +00:00
Nikita Shulga	5976f0bdfe	Set min supported Python version to 3.8 (#93155 ) Also, grep for `if sys.version_info .cond. (3, 8)` and replaces them with appropriate action. This is a last in a series of PRs that moved CI/CD away from testing PyTorch behavior against Python-3.7. Fixes https://github.com/pytorch/pytorch/issues/80513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93155 Approved by: https://github.com/huydhn	2023-01-29 18:28:46 +00:00
Michael Lazos	0dceaf07cd	Add two decomps for optimizer fusion (#93193 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93193 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-01-29 10:36:43 +00:00
Michael Gschwind	878f4f09d2	Warn about deprecation of private decoder builtins (#93181 ) Summary: Warn about deprecation of private decoder builtins Test Plan: sandcastle & github CI Differential Revision: D42816960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93181 Approved by: https://github.com/drisspg	2023-01-29 09:34:20 +00:00
Yanbo Liang	304d8dd6c8	[Dynamo] Support enum.Enum type as dict key (#93026 ) Fixes Meta internal user case of using ```enum.Enum``` type as dict key, pleaser refer the added test case for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93026 Approved by: https://github.com/mlazos	2023-01-29 06:37:10 +00:00
XiaobingSuper	9a2becf60a	inductor: fix inplace op's wrong lowering issue when preop is NopKernel (#92247 ) For TIMM ghostnet_100, there has such case, concat+inplace_add: ``` import torch from torch._inductor import config config.debug = True torch._dynamo.config.verbose=True class MockModule(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x, y, z): out = torch.cat([x, y], dim=1) out+=z return out mod = MockModule().eval() inputs = ( torch.randn([1, 64, 16, 16]), torch.randn([1, 64, 16, 16]), torch.randn([1, 128, 16, 16]), ) ref = mod(inputs) with torch.no_grad(): opt_model = torch._dynamo.optimize('inductor')(mod) out = opt_model(inputs) out = opt_model(inputs) out = opt_model(inputs) print(torch.equal(ref, out)) ``` the inductor always get a wrong result, I find that inductor get a wrong code: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, const float* __restrict__ in_ptr2, const float* __restrict__ in_ptr3, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1, float* __restrict__ out_ptr2) { { for(long i0=0; i0<1024; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); tmp0.store(out_ptr0 + 16i0); } #pragma omp simd simdlen(8) for(long i0=16384; i0<16384; i0+=1) { auto tmp0 = in_ptr0[i0]; out_ptr0[i0] = tmp0; } } { for(long i0=0; i0<1024; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); tmp0.store(out_ptr1 + 16i0); } #pragma omp simd simdlen(8) for(long i0=16384; i0<16384; i0+=1) { auto tmp0 = in_ptr1[i0]; out_ptr1[i0] = tmp0; } } { for(long i0=0; i0<2048; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr2 + 16i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr3 + 16i0); auto tmp2 = tmp0 + tmp1; tmp2.store(out_ptr2 + 16i0); } #pragma omp simd simdlen(8) for(long i0=32768; i0<32768; i0+=1) { auto tmp0 = in_ptr2[i0]; auto tmp1 = in_ptr3[i0]; auto tmp2 = tmp0 + tmp1; out_ptr2[i0] = tmp2; } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() buf3 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32) buf0 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1)) # alias buf1 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1), 16384) # alias buf2 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr())) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32) arg1_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32) arg2_1 = rand_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1, arg1_1, arg2_1])) ``` you can see that the add operation always adds a random value, see the ir code: 1. ir_pre_fusion.txt* ``` buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))] buf0.group.device = cpu buf0.group.iteration = ((16384,), ()) buf0.sizes = ([16384], []) buf0.aliases = ['buf3'] class buf0_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf0', get_index_1, load, None) return store buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))] buf1.unmet_dependencies = [] buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))] buf1.group.device = cpu buf1.group.iteration = ((16384,), ()) buf1.sizes = ([16384], []) buf1.aliases = ['buf3'] class buf1_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store buf2: NopKernelSchedulerNode(ConcatKernel) buf2.writes = [StarDep(name='buf2')] buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')] buf2.met_dependencies = [] buf3: SchedulerNode(ComputedBuffer) buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))] buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))] buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))] buf3.group.device = cpu buf3.group.iteration = ((32768,), ()) buf3.sizes = ([32768], []) class buf3_loop_body: var_ranges = {z0: 32768} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf2', get_index) get_index_1 = self.get_index('index0') load_1 = ops.load('arg2_1', get_index_1) add = ops.add(load, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf3', get_index_2, add, None) return store ``` 2. ir_post_fusion.txt ``` buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))] buf0.group.device = cpu buf0.group.iteration = ((16384,), ()) buf0.sizes = ([16384], []) buf0.aliases = ['buf3'] class buf0_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf0', get_index_1, load, None) return store buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))] buf1.unmet_dependencies = [] buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))] buf1.group.device = cpu buf1.group.iteration = ((16384,), ()) buf1.sizes = ([16384], []) buf1.aliases = ['buf3'] class buf1_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store buf2: NopKernelSchedulerNode(ConcatKernel) buf2.writes = [StarDep(name='buf2')] buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')] buf2.met_dependencies = [] buf3: SchedulerNode(ComputedBuffer) buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))] buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))] buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))] buf3.group.device = cpu buf3.group.iteration = ((32768,), ()) buf3.sizes = ([32768], []) class buf3_loop_body: var_ranges = {z0: 32768} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf2', get_index) get_index_1 = self.get_index('index0') load_1 = ops.load('arg2_1', get_index_1) add = ops.add(load, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf3', get_index_2, add, None) return store ``` From the ir code, you can see the buf3 always adds an empty buf2 which has never been written. The root cause is that there has a potential issue when doing the mutation for inplace add when its' input is a NopKernel. After this PR, the ir will be like(ir_pre_fusion.txt): ``` buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))] buf0.group.device = cpu buf0.group.iteration = ((16384,), ()) buf0.sizes = ([16384], []) buf0.aliases = ['buf2'] class buf0_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf0', get_index_1, load, None) return store buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))] buf1.unmet_dependencies = [] buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))] buf1.group.device = cpu buf1.group.iteration = ((16384,), ()) buf1.sizes = ([16384], []) buf1.aliases = ['buf2'] class buf1_loop_body: var_ranges = {z0: 16384} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) get_index_1 = self.get_index('index0') store = ops.store('buf1', get_index_1, load, None) return store buf2: NopKernelSchedulerNode(ConcatKernel) buf2.writes = [StarDep(name='buf2')] buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')] buf2.met_dependencies = [] buf3: SchedulerNode(ComputedBuffer) buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))] buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,)), StarDep(name='buf2')] buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))] buf3.group.device = cpu buf3.group.iteration = ((32768,), ()) buf3.sizes = ([32768], []) buf3.mutations = ['buf2'] class buf3_loop_body: var_ranges = {z0: 32768} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('buf2', get_index) get_index_1 = self.get_index('index0') load_1 = ops.load('arg2_1', get_index_1) add = ops.add(load, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf3', get_index_2, add, None) return store ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92247 Approved by: https://github.com/ngimel, https://github.com/desertfire, https://github.com/jansel	2023-01-29 05:35:21 +00:00
XiaobingSuper	900f8886e2	inductor: make as_strided support non-contiguous input and always fix it's input layout using eager stride (#92063 ) GIven the following small case: ``` import torch import torch._dynamo class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): return torch.as_strided(x + 1, (8, 384, 2, 20, 12), (153600, 1, 61440, 384, 7680))+ 2 x = torch.randn(8, 384, 20, 20).to(memory_format=torch.channels_last) model= Model().eval() model = model.to(memory_format=torch.channels_last) ref = model(x) with torch.no_grad(): opt_model = torch._dynamo.optimize('inductor')(model) with torch.no_grad(): for i in range(2): y1 = opt_model(x) print(torch.equal(ref, y1)) ``` inductor always gets a wrong result: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<8; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<384; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<400; i2+=1) { auto tmp0 = in_ptr0[i1 + (384i2) + (153600i0)]; auto tmp1 = static_cast<float>(1); auto tmp2 = tmp0 + tmp1; out_ptr0[i2 + (400i1) + (153600i0)] = tmp2; } } } } { #pragma omp for collapse(2) for(long i0=0; i0<8; i0+=1) { for(long i1=0; i1<2; i1+=1) { for(long i2=0; i2<5760; i2+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + (16i2) + (61440i1) + (153600i0)); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(2)); auto tmp2 = tmp0 + tmp1; tmp2.store(out_ptr1 + (16i2) + (92160i1) + (184320i0)); } #pragma omp simd simdlen(8) for(long i2=92160; i2<92160; i2+=1) { auto tmp0 = out_ptr0[i2 + (61440i1) + (153600i0)]; auto tmp1 = static_cast<float>(2); auto tmp2 = tmp0 + tmp1; out_ptr1[i2 + (92160i1) + (184320i0)] = tmp2; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 384, 20, 20), (153600, 400, 20, 1), device='cpu', dtype=torch.float32) buf1 = empty_strided((8, 384, 2, 20, 12), (184320, 1, 92160, 384, 7680), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr())) del arg0_1 return (buf1, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((8, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` the reason is that there always convert the input to a contiguous layout at as_strided lowering step, which is not aligned with the eager model input stride. ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: f32[8, 384, 20, 20]): # File: model_test.py:52, code: return torch.as_strided(x + 1, (8, 384, 2, 20, 12), (153600, 1, 61440, 384, 7680))+ 2 add: f32[8, 384, 20, 20] = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None as_strided: f32[8, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(add, [8, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]); add = None add_1: f32[8, 384, 2, 20, 12] = torch.ops.aten.add.Tensor(as_strided, 2); as_strided = None return (add_1,) ``` This PR will always fix as_strided stride with eager model's stride, and also make as_strided support channels_last input: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<76800; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(1)); auto tmp2 = tmp0 + tmp1; tmp2.store(out_ptr0 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=1228800; i0<1228800; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = static_cast<float>(1); auto tmp2 = tmp0 + tmp1; out_ptr0[i0] = tmp2; } } { #pragma omp for collapse(2) for(long i0=0; i0<8; i0+=1) { for(long i1=0; i1<2; i1+=1) { for(long i2=0; i2<5760; i2+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + (16i2) + (61440i1) + (153600i0)); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(2)); auto tmp2 = tmp0 + tmp1; tmp2.store(out_ptr1 + (16i2) + (92160i1) + (184320i0)); } #pragma omp simd simdlen(8) for(long i2=92160; i2<92160; i2+=1) { auto tmp0 = out_ptr0[i2 + (61440i1) + (153600i0)]; auto tmp1 = static_cast<float>(2); auto tmp2 = tmp0 + tmp1; out_ptr1[i2 + (92160i1) + (184320i0)] = tmp2; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((8, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32) buf1 = empty_strided((8, 384, 2, 20, 12), (184320, 1, 92160, 384, 7680), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr())) del arg0_1 return (buf1, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((8, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92063 Approved by: https://github.com/jansel	2023-01-29 05:30:59 +00:00
Aaron Gokaslan	cac1912bfb	Add some more missing moves to aten functorch (#93098 ) Add a couple of additional moves to aten functorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/93098 Approved by: https://github.com/ezyang	2023-01-29 04:50:57 +00:00
Sherlock Huang	61fd1188ba	[Export] Remove the concept of Scalar in export schema (#93211 ) Scalar is a union type of [int, float, bool], it's only needed for the representation of operation schema. During export, we always have the concrete argument. As ex.Argument is already an union type, we don't need Scalar type anymore. Example Here's the schema for aten.add.Scalar ``` add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> Tensor ``` A fx.node ``` add_tensor: f32[s0, s0] = torch.ops.aten.add.Scalar(arg0, 1.1) ``` would be exported as ``` Node( op='call_function', target='aten.add.Tensor', args=[ Argument(as_tensor=TensorArgument(name='arg0')), Argument(as_float=1.1) ], outputs=[ ReturnArgument(as_tensor=TensorArgument(name='add_tensor')) ] ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93211 Approved by: https://github.com/suo	2023-01-29 04:50:32 +00:00
Sherlock Huang	68a1065bd7	[Export] Remove op filed from ex.Node schema (#93208 ) Node can only be 'call_function' ops 'placeholder' and 'output' are serialized as inputs and outputs of the Graph 'get_attr' is not needed anymore, as it's an implicit lookup from GraphModule's parameters/buffers 'call_method' and 'call_module' is not supported, as it's not used in the canonical FX Graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/93208 Approved by: https://github.com/suo, https://github.com/Neilblaze	2023-01-29 04:35:46 +00:00
PyTorch MergeBot	7cc91f4002	[vision hash update] update the pinned vision hash (#93189 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93189 Approved by: https://github.com/pytorchbot	2023-01-29 03:33:34 +00:00
Kazuaki Ishizaki	cb817d6176	Fix endian handling in THPStorage_fromBuffer (#92834 ) Fixes #92831 This PR fixes a test failure of `TestTorch.test_from_buffer` on a big-endian machine. The root cause of this failure is that current `THPStorage_fromBuffer` does not perform endian handling correctly on a big-endian. In `THPStorage_fromBuffer`, the given buffer is stored as machine native-endian. Thus, if the specified byte order (e.g. `big`) is equal to machine native-endian, swapping elements should not be performed. However, in the current implementation, [`decodeBE()`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/byte_order.cpp#L72-L109) always swaps elements regardless of machine native-endian (i.e. these methods assume buffer is stored as little-endian). Thus, this PR uses the following approaches: - if the specified byte order (e.g. `big`) is equal to machine native-endian, call `decodeLE()` that does not swap elements by passing `torch::utils::THP_LITTLE_ENDIAN` to `THP_decodeBuffer()`. - if the specified byte order (e.g. `big`) is not equal to machine native-endian, call `decodeBE()` that always swap elements by passing `torch::utils::THP_BIG_ENDIAN` to `THP_decode*Buffer()`. After applying this PR to the master branch, I confirmed that the test passes on a big-endian machine. ``` % python test/test_torch.py TestTorch.test_from_buffer /home/ishizaki/PyTorch/master/test/test_torch.py:6367: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() self.assertEqual(torch.ByteStorage.from_buffer(a).tolist(), [1, 2, 3, 4]) ... /home/ishizaki/PyTorch/master/test/test_torch.py:6396: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() self.assertEqual(bytes.tolist(), [1, 2, 3, 4]) . ---------------------------------------------------------------------- Ran 1 test in 0.021s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92834 Approved by: https://github.com/ezyang	2023-01-29 00:55:54 +00:00
cyy	1e0c57b645	More fixes found in tidy and libc++ (#93138 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93138 Approved by: https://github.com/Skylion007	2023-01-28 20:55:16 +00:00
Michael Voznesensky	4ca511c69e	Fix positional issues in dedup guards (#93137 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93137 Approved by: https://github.com/bertmaher, https://github.com/wconstab, https://github.com/bdhirsh	2023-01-28 19:21:32 +00:00
Huy Do	ef988c2b37	Add post cleanup step for MacOS (#93126 ) This goes together with https://github.com/pytorch/test-infra/pull/1548 to clean up MacOS M1 runner after the workflow finishes. I'm referring to my test branch here to test https://github.com/pytorch/test-infra/pull/1548. Once that PR is merged, I will switch to the main branch, i.e. `pytorch/test-infra/.github/actions/setup-miniconda@main` and `pytorch/test-infra/.github/actions/check-disk-space@main` In the future, if there are more steps need to be done after MacOS workflow finishes, this can be also be refactored into a separate action like `teardown-linux`. There is only one step at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93126 Approved by: https://github.com/ZainRizvi	2023-01-28 17:53:20 +00:00
Jithun Nair	cfb160185e	Update ROCm CI builds to 5.4.2 (#93163 ) PR https://github.com/pytorch/pytorch/pull/92972 was meant to upgrade to ROCm5.4.2, not ROCm5.4. This PR rectifies that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93163 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2023-01-28 17:34:51 +00:00
Will Constable	648202ceb9	Improve DDPOptimizer by avoiding small preamble graph (#93162 ) This optimizes an edge case where some compute-only ops (e.g. add) could end up in an orphan graph at the input side due to the bucket for the next graph being full already. The fix is to fuse this graph (which is "empty" in parameter count) together with the adjoining "full" bucket. Note: i encountered this when trying to repro some suspected duplicate argument errors, but this is unrelated and I have not yet repro'd a duplicate arg issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93162 Approved by: https://github.com/davidberard98	2023-01-28 15:33:53 +00:00
Xiao Wang	f40183d374	Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally (#93192 ) Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256. The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI? ``` python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time ``` E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192 ``` 2023-01-27T19:41:32.2312162Z test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s) ``` The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42 This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503 > soumith on Feb 21, 2019: Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93192 Approved by: https://github.com/ezyang	2023-01-28 09:06:10 +00:00
Huy Do	aac9e5288f	Increase test multiprocessing waiting time (#93183 ) Fixes https://github.com/pytorch/pytorch/issues/67002 This is a follow-up from https://github.com/pytorch/pytorch/pull/91459 which fixed the flaky test everywhere excepts ROCm and MacOS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93183 Approved by: https://github.com/clee2000	2023-01-28 07:59:59 +00:00
Jeff Daily	72502b94f3	correct use of torch.backends.cudnn.flags() (#93182 ) Fixes #77467. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93182 Approved by: https://github.com/ngimel	2023-01-28 06:50:06 +00:00
leslie-fang-intel	a62fc09a1f	[Quant] Add fused conv2d_add op for onednn backend (#90262 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `conv2d_add` op for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this op with other quantization backends otherwise an error is thrown. Test Plan ``` python -m pytest test_quantization.py::TestQuantizedConv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90262 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-01-28 06:30:29 +00:00
Minh-Long Luu (刘明龙)	00b3f22210	Add missing scalar example in docs of `torch.where` (#93145 ) [`torch.where(condition, x, y)`](https://pytorch.org/docs/stable/generated/torch.where.html) accepts `x` and `y` as either `Tensor` or Scalar, but the Scalar example is missing in the docs. I simply add the example. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93145 Approved by: https://github.com/ngimel	2023-01-28 03:46:44 +00:00
Driss Guessous	ca8f5e177a	Use the old aten underscored function for Predictor (#93096 ) Summary: Errors reported via https://fb.prod.workplace.com/groups/1405155842844877/permalink/6644919482201794/ The problem is that the scriptable op set between predictor and the latest build of master is different. Test Plan: Sandcastle testing Differential Revision: D42786069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93096 Approved by: https://github.com/mikekgfb	2023-01-28 03:14:18 +00:00
Nikita Shulga	189ae948d3	[CI] Move XLA to Python-3.8 (#93178 ) Depends on https://github.com/pytorch/xla/pull/4527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93178 Approved by: https://github.com/huydhn	2023-01-28 02:58:18 +00:00
min-jean-cho	2f0b0c5dd7	exponential_ few fixes (1) lambda > 0 (2) mkl kernel to continuous (3) better error log on dtype (#92891 ) Exponential distribution is continuous. Fixes CPU MKL exponential implementation to exclude integer dtypes. ```python import torch dtypes = [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64] for dtype in dtypes: x = torch.empty(10000, dtype=dtype).exponential_() # should fail ! print("dtype: ", x.dtype, "sum: ", x.sum()) ``` ### Additional Context Related to #92709. This issue propagates to OpInfo of exponential. ``` AssertionError: The supported dtypes for exponential on device type cpu are incorrect! The following dtypes worked in forward but are not listed by the OpInfo: {torch.int64, torch.uint8, torch.int8, torch.int16, torch.int32}. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92891 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/ngimel	2023-01-28 02:27:16 +00:00
bcoutinho	42d4eca796	Update submodule kineto fix bazel1 (#92318 ) Update kineto submodule and fix bazel build issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92318 Approved by: https://github.com/aaronenyeshi	2023-01-28 02:26:28 +00:00
Sherlock Huang	b74a0fc486	Mark aten.flip and aten.alias as core aten op (#93130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93130 Approved by: https://github.com/qihqi, https://github.com/zhxchen17	2023-01-28 00:41:35 +00:00
Sherlock Huang	4d107e3426	torch.export Logical Schema V1 (#93135 ) This PR is for landing the initial version of logical schema. See previous discussions in https://github.com/pytorch/pytorch/pull/91287 This is a starting point for iterations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93135 Approved by: https://github.com/suo	2023-01-28 00:35:06 +00:00
Edward Z. Yang	1ff292abe0	Make CPU inductor work with dynamic shapes (#93077 ) These errors were found by looking at wav2vec2 See https://github.com/pytorch/pytorch/issues/91719 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93077 Approved by: https://github.com/voznesenskym, https://github.com/ngimel	2023-01-27 23:18:55 +00:00
Larry Liu	a0ca9dc8ca	[torchgen] Small fix for empty yaml file edge case (#92938 ) Rely on CI. Avoid issues such as: ``` Traceback (most recent call last): File "<string>", line 38, in <module> File "<string>", line 36, in __run File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 690, in <module> main() File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 626, in main parsed_yaml, custom_ops_parsed_yaml = parse_yaml_files( File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 505, in parse_yaml_files translate_native_yaml( File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 448, in translate_native_yaml for e in native_es: TypeError: 'NoneType' object is not iterable ``` Differential Revision: [D42729435](https://our.internmc.facebook.com/intern/diff/D42729435) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92938 Approved by: https://github.com/JacobSzwejbka	2023-01-27 22:45:21 +00:00
mfkasim1	75cfc0be21	Logcumsumexp for CPU (#93153 ) Partial work from #90847, in the direction of solving #89205. Most of the content is from #90847, but this is only for CPU, so hopefully it does not increase the build time by a lot. tag: @albanD, @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/93153 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-01-27 22:29:33 +00:00
Jerry Zhang	61457671a5	[quant][fx][be] Remove _input_output_observed from backend_config (#92589 ) Summary: This is no longer needed, we can use dtype to decide whether an observer is needed or not Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/92589 Approved by: https://github.com/jcaip	2023-01-27 22:17:05 +00:00
David Berard	58acab4616	[dynamo] support [tensor].type(torch.FloatTensor) (#93043 ) for some tensor x, x.type(torch.FloatTensor) will essentially do the same thing as x.to(torch.float). x.type can be called with at least 3 types of inputs: * a string "torch.FloatTensor" * a dtype torch.float * a tensor type torch.FloatTensor the third option (torch.FloatTensor) fails in fx, because fx cannot trace torch.FloatTensor objects. So this PR will replace the torch.FloatTensor type with a string "torch.FloatTensor" Why not fix this in fx? Well, it's possible, but I'm not sure a nice way to do it. We would want to update [torch.fx.node.BaseArgumentTypes](`d88bc38b0c/torch/fx/node.py (L17)`) to contain torch.FloatTensor etc. We could hard-code a list of tensor types there (the types vary depending on build type, e.g. whether or not cuda tensors are available), but that's not great in case our hardcoded list differs from the actual list registered by python_tensor.cpp. Another option is to dynamically populate the list of types with `Union[tuple(...)])`, and fill the tuple with `torch._tensor_classes` (which is directly populated by python_tensor.cpp), but apparently this breaks most typecheckers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93043 Approved by: https://github.com/jansel	2023-01-27 21:27:13 +00:00
Edward Z. Yang	35ea82541b	Send float32 to a different GitHub issue (#93168 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93168 Approved by: https://github.com/Chillee, https://github.com/jansel	2023-01-27 19:55:06 +00:00
Pearu Peterson	65d6802e2f	Improve error messages for sparse methods on tensors with unsupported backends/layouts. (#93149 ) Fixes https://github.com/pytorch/pytorch/issues/92790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93149 Approved by: https://github.com/cpuhrsch	2023-01-27 19:50:23 +00:00
Catherine Lee	27ab1dfc28	Remove print_test_stats, test_history, s3_stat_parser (#92841 ) Pritam Damania no longer uses it (and is no longer with FB), and I don't know who else has interest in this Pull Request resolved: https://github.com/pytorch/pytorch/pull/92841 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/seemethere	2023-01-27 18:11:42 +00:00
Rohan Varma	975feb606e	[DDP][Easy] Remove unused var (#93128 ) removes this unused var, the overall buffer comm hook feature is also not being used, we should deprecate / remove it as it is still a private API. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93128 Approved by: https://github.com/awgu	2023-01-27 18:08:29 +00:00
Pruthvi Madugundu	4eb69af5af	Upgrade CI to ROCm 5.4.2 (#92972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92972 Approved by: https://github.com/malfet	2023-01-27 17:57:33 +00:00
Catherine Lee	00f3e0d8c9	[ci] Set step level timeout (#93084 ) Not super important, but it is nice for the logs because the logs now say "the action timed out" instead of "the action was cancelled". It also makes the job status "failure" instead of "cancelled" also adds timeout minutes as an input for rocm and mac tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/93084 Approved by: https://github.com/huydhn	2023-01-27 17:52:33 +00:00
PyTorch MergeBot	62aa4e096b	Revert "Add cudnn install 8.7.0.84 for CUDA 11.8 (#93086 )" This reverts commit 3a10bf791f53c65e4c38c29e366b45504425832a. Reverted https://github.com/pytorch/pytorch/pull/93086 on behalf of https://github.com/malfet due to Failures are related	2023-01-27 16:22:14 +00:00
Cristian Panaite	d3049378be	Repair the path to jni.h for libtorch windows build (#93057 ) Fixes #86536 It seems like the file is not found when the environment is populate, so the BUILD_JNI flag is false. To mark it as true, I had to add a `/pytorch/` when adding paths in `POSSIBLE_JAVA_HOMES`. This way, it seems like the file is found and the flag it's true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93057 Approved by: https://github.com/malfet, https://github.com/Blackhex	2023-01-27 15:20:30 +00:00
Michael Gschwind	64d0624cee	Explicit Name needed to run with buck test (#93035 ) Summary: Explicit Name needed to run with buck test Test Plan: sandcastle Differential Revision: D42763774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93035 Approved by: https://github.com/cpuhrsch	2023-01-27 14:36:46 +00:00
atalman	3a10bf791f	Add cudnn install 8.7.0.84 for CUDA 11.8 (#93086 ) Add cudnn install 8.7.0.84 for CUDA 11.8 . Same as: https://github.com/pytorch/pytorch/pull/84964 Related to https://github.com/pytorch/builder/pull/1271 Test PR: https://github.com/pytorch/pytorch/pull/92971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93086 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-01-27 13:13:31 +00:00
Kshiteej K	68a98537d5	[fix] nn c++ : segfault in modulelist and moduledict (#93074 ) Fixes https://github.com/pytorch/pytorch/issues/73565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93074 Approved by: https://github.com/albanD	2023-01-27 12:20:19 +00:00
Felix Divo	219e9533f0	Improve autograd doc on complex numbers (#93065 ) A tiny change to fix formatting and clarify a bit in [this section](https://pytorch.org/docs/stable/notes/autograd.html#what-are-complex-derivatives). Pull Request resolved: https://github.com/pytorch/pytorch/pull/93065 Approved by: https://github.com/albanD	2023-01-27 09:36:38 +00:00
Dmytro Dzhulgakov	5105a8d3fc	Enable Kineto in OSS builds by fixing build condition (resubmit) (#93033 ) Resubmit of https://github.com/pytorch/pytorch/pull/89174 . I think I fixed underlying issues back then, but only CI would tell. Context: This PR enables Kineto on OSS builds because of how the flags were misconfigured before. I think generally having global observer in OSS is nice. There's some work to release on demand profiling with dynolog, and right now its build instructions start with "go change pytorch's CMake": https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md#pytorch-setup The previous PR was reverted because of the bug in Kineto that got fixed in https://github.com/pytorch/kineto/pull/696 (and the submodule was updated since) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93033 Approved by: https://github.com/kimishpatel	2023-01-27 08:58:03 +00:00
Xiaodong Wang	070163fb53	[inductor] Clean up TRITON_CACHE_DIR (#92879 ) Summary: As a follow up in https://github.com/pytorch/pytorch/pull/92664 (D42619405 (`e6a8267cf5`)), clean up the TRITON_CACHE_DIR settings. There are a few places touching TRITON_CACHE_DIR: 1. triton/fb/triton_util.py: when import triton 2. caffe2/torch/_inductor/codecache.py 3. caffe2/torch/_inductor/triton_ops/autotune.py 4. triton/triton/python/triton/compiler.py IIUC there are two entry points: * kernel.run(args): 1 -> 3 -> 4 * async_compile(kernel): 1 -> 2 -> 3 -> 4 * calling triton jit-annoated func directly: 4 I'm removing the TRITON_CACHE_DIR in 1 and 2. Test Plan: Run local repro Differential Revision: D42694374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92879 Approved by: https://github.com/jansel	2023-01-27 08:08:27 +00:00
Xia, Weiwen	6fa84fdea2	[FX][Quant] Enable FX quant for patterns like x.view(x.size(...), ...) (#90001 ) Summary This work continues with https://github.com/pytorch/pytorch/pull/83784 by @vkuzo and includes all the changes in that PR. Quote from https://github.com/pytorch/pytorch/pull/83784: > Issue #83658 reports that ops followed by a certain pattern of `view` and `size` ops were not quantized correctly by FX graph mode quantization. Before this PR, the "size" op was in the "op shares qparams with input" category, and the code assumed that the input of this op has the same dtype as its output. This led to incorrectly propagating the `int` dtype as the output of whichever op was preceding the `view` op, which in turn made that op blocklisted from quantization. > The fix is to create a new category of ops which work on different dtypes of tensors but are not observed. This PR does so for `size`, and also for `shape` since it works the same way. Note: This PR needs https://github.com/pytorch/pytorch/pull/91297 to be landed first otherwise there is a UT failure. Test plan ``` python test/test_quantization.py -k test_linear_size_view python test/test_quantization.py -k test_linear_shape_view ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90001 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-01-27 07:56:29 +00:00
Chien-Chin Huang	a4238976a8	[FSDP][optim_state_dict] Ensure correct devices for tensors when doing all_gather (#92992 ) When doing `_all_gather_optim_state`, we need to ensure that `step` tensors are on CPU and other tensors are on GPUs. This PR add the logic to ensure the locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92992 Approved by: https://github.com/fduwjj	2023-01-27 06:50:36 +00:00
Chien-Chin Huang	8b1b47c36a	[FSDP][optim_state_dict] Use all_gather to deal with uneven size tensors (#92991 ) The current `_all_gather_optim_state` pads the uneven tensors which is not necessary as `all_gather` support the uneven tensors. This PR removes the padding logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92991 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2023-01-27 06:46:44 +00:00
cyy	f172feae0d	More tidy fixes (#93069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069 Approved by: https://github.com/Skylion007	2023-01-27 06:40:50 +00:00
William Wen	5bae580502	Don't graph break on patched module methods (#93115 ) Fix one case for https://github.com/pytorch/pytorch/pull/91018 since it's needed soon. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93115 Approved by: https://github.com/angelayi	2023-01-27 06:14:44 +00:00
Kwanghoon An	a2e0f8e529	[ FL-gradient quantization] Adding QNN unpack feature (#92714 ) Summary: We are trying to add a new feature for quantized gradient computation which enables backward() function for QNNPACK Test Plan: buck2 test //caffe2/test/quantization:quantization -- test_qlinear_qnnpack_free_memory_and_unpack Differential Revision: D40927291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92714 Approved by: https://github.com/digantdesai, https://github.com/jianyuh	2023-01-27 05:37:03 +00:00
Nikita Shulga	661800a2cf	Fix BC-breaking change introduced by #91499 (#93091 ) This fixes BC-breaking changes introduced by https://github.com/pytorch/pytorch/pull/91499 Make enum accept both `min` and `amin` values Reinstante testing To reiterate `454361435c/torch/masked/_ops.py (L786)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93091 Approved by: https://github.com/ngimel	2023-01-27 03:58:35 +00:00
jjsjann123	7fade4f771	fixing flag to skip nvfuser_tests build (#93080 ) Slowly pushing cmake cleanup to upstream. avoids building nvfuser_tests when BUILD_TEST is disabled. nvfuser_tests uses googletest from pytorch, which is only dragged when BUILD_TEST is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93080 Approved by: https://github.com/davidberard98, https://github.com/huydhn, https://github.com/malfet	2023-01-27 03:48:31 +00:00
PyTorch MergeBot	e2739372eb	[vision hash update] update the pinned vision hash (#93114 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93114 Approved by: https://github.com/pytorchbot	2023-01-27 03:29:39 +00:00
Huy Do	074f5ce0b7	Install Torchvision in all Linux shards (#93108 ) Also skip `test_roi_align_dynamic_shapes` for cuda as introduced by https://github.com/pytorch/pytorch/pull/92667. With Torchvision properly installed, the test fails with the following error: ``` 2023-01-26T04:46:58.1532060Z test_roi_align_dynamic_shapes_cuda (__main__.CudaTests) ... /var/lib/jenkins/workspace/test/inductor/test_torchinductor.py:266: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2023-01-26T04:46:58.1532195Z buffer = torch.as_strided(x, (x.storage().size(),), (1,), 0).clone() 2023-01-26T04:46:58.1532383Z test_roi_align_dynamic_shapes_cuda errored - num_retries_left: 3 2023-01-26T04:46:58.1532479Z Traceback (most recent call last): 2023-01-26T04:46:58.1532725Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 1155, in run_node 2023-01-26T04:46:58.1532821Z return node.target(args, kwargs) 2023-01-26T04:46:58.1533056Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 499, in __call__ 2023-01-26T04:46:58.1533160Z return self._op(args, **kwargs or {}) 2023-01-26T04:46:58.1533304Z RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides ``` https://github.com/pytorch/pytorch/issues/93054 reveals a blindspot in the CI where Torchvision was only installed in the first and second shard. The above test should show that failure as part of https://github.com/pytorch/pytorch/pull/92667, but then it was skipped because Torchvision was not installed (in the 3rd shard) for `test_roi_align` to run. The test is still skipped here, but in a more explicit way. Fixes https://github.com/pytorch/pytorch/issues/93054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93108 Approved by: https://github.com/clee2000, https://github.com/jjsjann123, https://github.com/nkaretnikov	2023-01-27 03:15:18 +00:00
Edward Z. Yang	025ef99ddf	Get rid of dedicated inductor dynamic_shapes config (#93076 ) Instead, use Dynamo dynamic_shapes config Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93076 Approved by: https://github.com/voznesenskym	2023-01-27 02:58:16 +00:00
Wanchao Liang	f3fcc80622	[dtensor][7/N] remove backend in with_comms (#93040 ) backend is not actually getting used in anywhere, so we remove the backend option Pull Request resolved: https://github.com/pytorch/pytorch/pull/93040 Approved by: https://github.com/wz337	2023-01-27 02:53:27 +00:00
Xilun Wu	8b3e01cd30	[DTensor] implement dist_cat as a sharding prop rule (#92677 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92677 Approved by: https://github.com/wanchaol	2023-01-27 02:14:17 +00:00
BowenBao	24172eebac	[ONNX] Export 'aten::index_put(self, mask, v)' when rank(mask) < rank(self) (#92862 ) Fix #92540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92862 Approved by: https://github.com/justinchuby	2023-01-27 02:00:56 +00:00
Thiago Crepaldi	95dfad9d93	Add kwargs support to torch.export() API (#92013 ) Fixes [#1997](https://github.com/pytorch/torchdynamo/issues/1997) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92013 Approved by: https://github.com/jansel	2023-01-27 01:58:51 +00:00
Catherine Lee	ae171cf623	[ci] Move sm86 from trunk to pull (#93085 ) Experiment on capacity Pull Request resolved: https://github.com/pytorch/pytorch/pull/93085 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-27 01:44:36 +00:00
Xiang Gao	d1807dc1f4	Fix topk IMA (#93095 ) Hopefully, this will fix https://github.com/pytorch/pytorch/issues/93006. ~I can not reproduce that issue: I can catch the IMA with compute sanitizer on nightly build, but not on source build of master. So there is no way for me to validate if my fix is correct or not.~ Edit: Thanks for the help of @ptrblck, this fix is validated. But by reading the code, I believe this is a similar issue as https://github.com/pytorch/pytorch/pull/83042, so I apply the same fix for `mbtopk::gatherTopK`. We can wait until tomorrow's nightly build to see if #93006 disappear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93095 Approved by: https://github.com/ngimel	2023-01-27 01:39:49 +00:00
Han Qi	8d7f9e2f79	Make __deepcopy__ of GraphModule able to handle circular reference. (#93038 ) Summary: One of such places where circular reference can occur is: _load_state_dict_pre_hooks contains a _WrappedHook, _WrappedHook has a weakref to the same module. Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93038 Approved by: https://github.com/jerryzh168	2023-01-27 01:19:59 +00:00
Nikita Shulga	ceb44350cf	[CI] Move parallel native builds to 3.8 (#93103 ) As well as nightly docs builds Followup after https://github.com/pytorch/pytorch/pull/92928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93103 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/kit1980	2023-01-26 23:29:48 +00:00
soulitzer	f6f46ba3bb	[Reland] aot autograd explicitly errors on double backward (#92893 ) This reverts commit fb980581a7b41a5ea570fcb03829463b806b3bbc. Testing: `python benchmarks/dynamo/timm_models.py --float32 --training --only=mobilevit_s --performance --inductor --disable-cudagraphs` ``` main: memory: eager: 12.30 GB, dynamo: 12.28 GB, ratio: 1.00 + #90896 reverted: memory: eager: 12.30 GB, dynamo: 8.81 GB, ratio: 1.40 + this PR: memory: eager: 12.30 GB, dynamo: 8.81 GB, ratio: 1.40 ``` For comparison, if we apply old version of this PR instead: ``` main: + #90896 reverted: memory: eager: 12.30 GB, dynamo: 8.81 GB, ratio: 1.40 + old version of this PR memory: eager: 12.30 GB, dynamo: 10.36 GB, ratio: 1.19 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92893 Approved by: https://github.com/bdhirsh	2023-01-26 23:19:27 +00:00
PyTorch MergeBot	913cf2908e	Revert "Disable torch_jit_fuser_te for dynamo CI (#92945 )" This reverts commit 0fc2f9febb8147183bcf8321ea80ab8e48ced875. Reverted https://github.com/pytorch/pytorch/pull/92945 on behalf of https://github.com/huydhn due to The test looks ok now after moving dynamo shard to 3.8 https://github.com/pytorch/pytorch/issues/92942, so trying to re-enable it	2023-01-26 21:41:17 +00:00
Elias Ellison	340811bf8d	Torchinductor randn_like lowering (#93005 ) Add lowering for randn_like, fixes https://github.com/pytorch/pytorch/issues/92368 by virtue of not taking a fallback path, although the 0-element prim stride is still incorrect. Would be nice to submit as a decomposition, but that is blocked by https://github.com/pytorch/pytorch/issues/92920. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93005 Approved by: https://github.com/ngimel	2023-01-26 21:35:27 +00:00
Edward Z. Yang	1b5bfe9dd1	Properly compute device for elementwise operations with CPU scalar tensor (#93073 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93073 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2023-01-26 21:27:57 +00:00
Han Qi	1f352f7c1f	Update flatbuffer test models to match pkl models (#93022 ) Also regenerate upgrader with ``` python torchgen/operator_versions/gen_mobile_upgraders.py ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93022 Approved by: https://github.com/tugsbayasgalan	2023-01-26 21:17:57 +00:00
Huy Do	68a49322e7	[MacOS] Explicitly use cmake from cloned conda environment (#92737 ) My first attempt to fix `Library not loaded: @rpath/libzstd.1.dylib` issue on MacOS M1 in https://github.com/pytorch/pytorch/pull/91142 provides some additional logs about flaky error but doesn't fix the issue as I see some of them recently, for example * `e4d83d54a6` Looking at the log, I can see that: * CMAKE_EXEC correctly points to `CMAKE_EXEC=/Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/cmake` * The library is there under the executable rpath ``` ls -la /Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/../lib ... 2023-01-20T23:22:03.9761370Z -rwxr-xr-x 2 ec2-user staff 737776 Apr 22 2022 libzstd.1.5.2.dylib 2023-01-20T23:22:03.9761630Z lrwxr-xr-x 1 ec2-user staff 19 Jan 20 22:47 libzstd.1.dylib -> libzstd.1.5.2.dylib ... ``` Then calling cmake after that suddenly uses the wrong cmake from miniconda package cache: ``` 2023-01-20T23:22:04.0636880Z + cmake .. 2023-01-20T23:22:04.1924790Z dyld[85763]: Library not loaded: @rpath/libzstd.1.dylib 2023-01-20T23:22:04.1925540Z Referenced from: /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake ``` This is weird, so my second attempt will be more explicit and use the correct cmake executable in `CMAKE_EXEC`. May be something manipulates the global path in between making ` /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake` comes first in the PATH Pull Request resolved: https://github.com/pytorch/pytorch/pull/92737 Approved by: https://github.com/ZainRizvi	2023-01-26 21:07:41 +00:00
Elias Ellison	15c46eb89b	Remove try catch in test_torchinductor (#93004 ) I think this was holdover compat code from multiple repros. we should error on failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93004 Approved by: https://github.com/ngimel	2023-01-26 20:57:36 +00:00
Edward Z. Yang	17803fb36e	Make meshgrid support symbolic shapes (#93075 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93075 Approved by: https://github.com/Skylion007	2023-01-26 20:57:29 +00:00
Michael Lazos	5de19dd348	Don't copy name_to_input in OutputGraph (#93034 ) This copy isn't necessary and regressed tracing Adam by ~10s with a 1000 parameter model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93034 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-01-26 20:53:46 +00:00
Huy Do	f30787e52d	Update XLA docker image to v0.8 (#93041 ) Given the context in https://github.com/pytorch/xla/pull/4489, we now have a new XLA Docker image `v0.8`. This should fix the flaky sccache initialization failures with XLA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93041 Approved by: https://github.com/malfet	2023-01-26 20:18:37 +00:00
Andrey Talman	d9f0d14835	Update RELEASE.md with pinning xla and builder PRs (#93079 ) Provide example PRs necessary for pinning xla and builder repos for release Pull Request resolved: https://github.com/pytorch/pytorch/pull/93079 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-01-26 20:11:30 +00:00
Pearu Peterson	0e92bbe5b1	Add sparse COO tensor support to torch.sum(dim=..., keepdim=...) (#92979 ) Fixes #92757, #86232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92979 Approved by: https://github.com/cpuhrsch	2023-01-26 18:42:51 +00:00
Nikita Shulga	ca2a23c243	[BE][CI] Move more builds from 3.7 to 3.8 (#92928 ) Part of https://github.com/pytorch/pytorch/issues/80513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92928 Approved by: https://github.com/weiwangmeta, https://github.com/ZainRizvi	2023-01-26 18:13:16 +00:00
Edward Z. Yang	729f1a8ef2	Setup shebang and set -x on generated runner script (#93007 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93007 Approved by: https://github.com/williamwen42	2023-01-26 16:52:38 +00:00
PyTorch MergeBot	7012d985fa	Revert "Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. (#88078 )" This reverts commit 46f16b93636615a81242b0d5cded84c5a57fd2e2. Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/ZainRizvi due to Causing a test to fail consistently: test_decomp.py::HasDecompTest::test_has_decomposition	2023-01-26 16:22:29 +00:00
Aaron Gokaslan	3888555fa1	Apply some more missing moves in aten native (#92983 ) Add some additional missing moves to further improve vmap and related operators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92983 Approved by: https://github.com/ezyang	2023-01-26 15:52:16 +00:00
Edward Z. Yang	7e449e8ba7	Fix some silly Inductor bugs (#92997 ) Should probably figure out how to get type checking going, would have caught these cases. Discovered in pursuit of https://github.com/pytorch/pytorch/issues/91719 though this is not enough. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92997 Approved by: https://github.com/Chillee	2023-01-26 15:31:54 +00:00
Edward Z. Yang	abcaa05f55	Revert spurious submodule change from #92107 (#93067 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93067 Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007, https://github.com/malfet	2023-01-26 14:57:36 +00:00
Edward Z. Yang	5e9fa0a8fc	Mark crossvit_9_240 as passing dynamic=True (#92981 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92981 Approved by: https://github.com/Chillee	2023-01-26 13:05:37 +00:00
Xia, Weiwen	1d03a6a901	[Quant][Fx] Fix issue: qconfig_mappings of onednn backend are not correctly set for fused modules (#91297 ) Summary For onednn quantization backend only. Currently, FX fusion requires that all separate ops in a fused module/op have the same `qconfig`. To support `linear - leaky_relu` and `linear - tanh` fusion with onednn backend, we previously explicitly set the same `qconfig` to `linear`, `leaky_relu` and `tanh`. However, this brings two problems: - It breaks fusion of `linear - relu` since `relu` does not have the same `qconfig` as `linear` does. And it does not look good if we set `qconfig` to all these ops. They should use a global `qconfig` by default. - `Tanh` requires `fixed_qparams_qconfig` otherwise it is not quantized. So, we cannot set another `qconfig` to `tanh`. Looks like there is not a straightforward way to solve the problems. This PR fixes them by the following: - Do not set `qconfig` to these ops so that these ops use a global `qconfig` and `linear - relu` and `linear - leaky_relu` can be fused correctly. - Set the same `qconfig` to `linear` and `tanh` manually by users when they want to fuse `linear - tanh` with onednn backend. A known issue still exists: users cannot fuse `linear - tanh` and quantize standalone `tanh` at the same time. Test plan python test/test_quantization.py -k test_qconfig_dict_with_fused_modules Pull Request resolved: https://github.com/pytorch/pytorch/pull/91297 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-01-26 09:55:34 +00:00
fduwjj	913866efbf	[PT-D][TP] Fix TP API for FQN path based parallelization (#93029 ) We have not tested dict based parallelize_module and turns out we had mistakes here. 1. Fix the error. 2. Add unit test cases for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93029 Approved by: https://github.com/wz337	2023-01-26 09:10:21 +00:00
Nikita Vedeneev	46f16b9363	Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. (#88078 ) As per title. Additionally we also introduce support for: - Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation). - Batch support with broadcasting for either of the arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078 Approved by: https://github.com/cpuhrsch	2023-01-26 07:58:27 +00:00
Khushi Agrawal	4c074ddfd2	[functorch][reland] vmap: bitwise operators (#92836 ) Previous PR: #91971 Fixes: https://github.com/pytorch/functorch/issues/1069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92836 Approved by: https://github.com/Chillee	2023-01-26 06:12:47 +00:00
Jason Ansel	ccad2e5000	Include cublasLt as an option in max_autotune mode (#92915 ) Differential Revision: D42720376 (has some internal results) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92915 Approved by: https://github.com/Chillee	2023-01-26 06:08:17 +00:00
kshitij12345	d88bc38b0c	[functorch] fix batching rule for dropout (#92975 ) Fixes https://github.com/pytorch/pytorch/issues/92283 The repro now works: ```python import torch import torch.func import torch.nn as nn x = torch.randn(3, device='cuda') y = torch.randn(1, 3, device='cuda') def fn(x, y): # previously output of dropout used to be incorrect [B, 3] (B=1) and thus `mean(1)` used to fail # post the fix output of dropout is [B, 1, 3] and `mean(1)` works. return x + nn.functional.dropout(y, 0.3).mean(1) o = torch.func.vmap(fn, in_dims=(0, None), randomness='different')(x, y) ``` NOTE: `native_dropout_batching_rule(const Tensor& tensor, double p, c10::optional<bool> train)` was called only for CUDA tensor. Hence this issue only affected CUDA tensors and not CPU tensors Ref: `a6ac922eab/aten/src/ATen/functorch/PyTorchOperatorHacks.cpp (L251-L258)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92975 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2023-01-26 05:07:26 +00:00
fduwjj	77f336600a	[PT-D] Enable Meta Tensor Support for DTensor (#92652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92652 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2023-01-26 04:54:57 +00:00
Jane Xu	e714e37a06	[optim][sgd] default to foreach when CUDA + differentiable=False (#92730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92730 Approved by: https://github.com/albanD	2023-01-26 04:52:58 +00:00
Jane Xu	8c9f745af1	[foreach] guard default support on native tensors only (#92923 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92923 Approved by: https://github.com/ngimel, https://github.com/crcrpar	2023-01-26 04:52:58 +00:00
Yanbo Liang	c9ce0e63e8	[Dynamo] Support context wrapping(e.g, torch.no_grad) on nested functions w/o closure (#92922 ) Fixes 14k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_ELEKTRONN_elektronn3.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/92922 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-01-26 04:23:35 +00:00
Yanbo Liang	a6b51448f5	[Dynamo] Supports if condition on user defined object (#90892 ) Fixes Meta internal user case, see the pattern in unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90892 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-01-26 04:19:32 +00:00
Jane Xu	819bd5b77a	[nn] add set_to_none flag for C++ optim endpoint (#92989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92989 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2023-01-26 04:16:52 +00:00
PyTorch MergeBot	dbeb513192	[vision hash update] update the pinned vision hash (#92937 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92937 Approved by: https://github.com/pytorchbot	2023-01-26 04:02:28 +00:00
PyTorch MergeBot	68f198913a	Revert "Mark XLA Linux jobs as unstable temporarily (#92634 )" This reverts commit 3cc103132205820fc0c571e3e68dd5e9b5b85727. Reverted https://github.com/pytorch/pytorch/pull/92634 on behalf of https://github.com/huydhn due to XLA has been forward fixed by `341613fc14`	2023-01-26 03:59:51 +00:00
soulitzer	f646126ecd	Running timm benchmarks no longer silently retries (#93030 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93030 Approved by: https://github.com/eellison	2023-01-26 03:44:38 +00:00
Michael Voznesensky	d322f82b05	Add @count util to torch, use it to track benchmark stats (#93013 ) <img width="1333" alt="image" src="https://user-images.githubusercontent.com/4755252/214687911-f766f072-c162-4298-9aed-c889f1375336.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93013 Approved by: https://github.com/ezyang	2023-01-26 03:09:12 +00:00
jjsjann123	c11b301bcd	[NVFUSER] refactor nvfuser build (#89621 ) This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library. Contents inside this PR: 1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp) 2. splits the build system so nvfuser is generating its own `.so` files. Currently there are: - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser` 3. nvfuser cpp tests is currently being compiled into `nvfuser_tests` 4. cmake is refactored so that: - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`. - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built. - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary` Future work that's scoped in following PR: - Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet - Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621 Approved by: https://github.com/davidberard98	2023-01-26 02:50:44 +00:00
Loren Arthur	0a57a20c02	[caffe2] Fix pybind11 native python link error (#92325 ) Summary: Currently, we define some C++ functions in one C++ Python extension which are used by another. This happens to work, but isn't guaranteed to. This diff moves these functions to a separate C++ library rule to fix this. Test Plan: CI Differential Revision: D42552515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92325 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2023-01-26 02:33:17 +00:00
JackCaoG	341613fc14	Move the pin to latest to unbreak the xla CI (#93000 ) This should unbreak the XLA CI since we disabled the failing test on our end. @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/93000 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-26 02:13:49 +00:00
Sahan Paliskara	32bcb97c7a	[package] Add better debugging for torch.package (#92939 ) Summary: Makes torch.package debugging more transparent by 1. Pointing out not implictily externed modules in the standard library. 2. Creating a debug mode for users to find the source of broken modules. Test Plan: Run package tests Differential Revision: D42728753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92939 Approved by: https://github.com/kurman	2023-01-26 02:11:12 +00:00
Jithun Nair	22b6a5fda9	Update base docker image tags for ROCm CI (#90694 ) to make them agnostic of ubuntu version, ROCm version and python minor version. This should help avoid frequent updates to the docker image tags when upgrading ROCm version in PyTorch CI, which has creation of new ECR tags as a blocking step. Reference: https://github.com/pytorch/pytorch/pull/88297#issuecomment-1307873280 The BUILD_ENVIRONMENT flag will continue to specify the exact versions for the above, in case it is needed for debug. @malfet @seemethere Hope that's not going away, otherwise we might have a harder time debugging issues where we need to figure out these environment details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90694 Approved by: https://github.com/malfet	2023-01-26 02:00:15 +00:00
Peter Bell	cee5174d44	Add test tracking operators without decompositions (#90887 ) This test inspects the dispatcher directly, so captures operators without `OpInfo` including internal helper operators and backward operators that might appear in a trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90887 Approved by: https://github.com/ezyang	2023-01-26 01:44:42 +00:00
Wei Wang	345695e8f7	Remove PY37 from binary build matrix (#92919 ) Similar to https://github.com/pytorch/test-infra/pull/1416 but for binary build Pull Request resolved: https://github.com/pytorch/pytorch/pull/92919 Approved by: https://github.com/atalman	2023-01-26 01:25:47 +00:00
Elias Ellison	1af9231c98	Replace IndexingDiv with FloorDiv in test_torchinductor (#93003 ) Holdover from https://github.com/pytorch/pytorch/pull/92878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93003 Approved by: https://github.com/ngimel	2023-01-26 01:23:09 +00:00
mfkasim1	1f55f3b0de	Solving the under/overflow for complex division (#92539 ) Fixes #92043. I'm following numpy's implementation as suggested by @min-jean-cho. I found out that this implementation still produces overflow if we're working with numbers greater than `finfo.max / 2`, but this is still much better than the previous implementation where it gets overflow with numbers greater than `finfo.max ** 0.5`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92539 Approved by: https://github.com/lezcano	2023-01-26 01:14:06 +00:00
Jane Xu	b90496eef5	[nn] zero_grad() set_to_none default True (#92731 ) Attempts to fix #92656 BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731 Approved by: https://github.com/ngimel	2023-01-26 01:04:28 +00:00
Will Constable	5441f2c067	Fix DDPOptimizer fake_mode execution (#92986 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #92986 When running compiled submods for the purpose of producing outputs to pass to the compilation step for the next submod, we use fake parameters and assume fake inputs, but we forgot to activate our fake_mode during execution. This caused certain edge cases where tensors other than activations or parameters got created during execution, such as scalar->tensor expansion in the case of executing torch.where(tensor, scalar, scalar). Also add a test and clarify behavior of DDPOptimizer via comments. Fixes #92941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92986 Approved by: https://github.com/bdhirsh	2023-01-26 00:37:54 +00:00
Omkar Salpekar	e7b7e8dc3d	[SDPA] Remove unused rng_engine_inputs (#93024 ) The unused variable in `fmha_api.cpp` [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp#L313) was causing build failures (internally) due to to the `-Wunused-variable` flag being used. For example: ``` [2023-01-24T20:32:00.241-08:00] Stderr: aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp:313:25: error: unused variable 'rng_engine_inputs' [-Werror,-Wunused-variable] [CONTEXT] [2023-01-24T20:32:00.241-08:00] at::PhiloxCudaState rng_engine_inputs; [CONTEXT] [2023-01-24T20:32:00.241-08:00] ^ [2023-01-24T21:09:33.507-08:00] Stderr: aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp:313:25: error: unused variable 'rng_engine_inputs' [-Werror,-Wunused-variable] [CONTEXT] [2023-01-24T21:09:33.507-08:00] at::PhiloxCudaState rng_engine_inputs; [CONTEXT] [2023-01-24T21:09:33.507-08:00] ``` This PR removes that unused variable. Mirroring this same patch made by @drisspg internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93024 Approved by: https://github.com/drisspg	2023-01-26 00:10:26 +00:00
Iris	dd05f028e2	[PT-D][Checkpoint] Rename DCP storage layer init() (#92869 ) Rename DCP storage layer init() and update tests accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869 Approved by: https://github.com/kumpera	2023-01-25 23:52:45 +00:00
Jane Xu	b0f3736fa2	[BE][CI] symlink .jenkins to .ci (#92846 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92846 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-01-25 23:47:38 +00:00
Jane Xu	b453adc945	[BE][CI] rename .jenkins (#92845 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92845 Approved by: https://github.com/clee2000	2023-01-25 23:47:38 +00:00
Shunting Zhang	67689c823f	refactor: move dynamo/TorchXLA bridge to pytorch/xla repo (#92601 ) This is a follow up from the previous PR: https://github.com/pytorch/pytorch/pull/88449 , to move the dynamo/TorchXLA bridge from pytorch repo to xla repo. Overall the dynamo/TorchXLA integration has the following four layers of code - pybind layer: This is the bottom layer containing various pybind APIs as the foundation. This part resident in xla repo - bridge layer: build upon the pybind layer to implement the trace once functionality. This layer and it's corresponding unit test are in pytorch repro previously. This PR (and the corresponding xla pr https://github.com/pytorch/xla/pull/4476 ) moves them to the xla repo. - dynamo backend registration: this a thin layer registers 4 dynamo backends (training/inference/trace_once/trace_everytime). It remains in pytorch repo. - benchmark script: the torchbench.py script in dynamo is adapted so it can be used in dynamo/TorchXLA integration. This one remains in pytorch repo. We think the new code organization is cleaner. I'll wait for the xla PR in first before trying to merge this one. Tests 1. run the unit tests moved to the xla repo 2. Test for inference: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --backend=torchxla_trace_once --only resnet18` 3. Test for training: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only resnet18 --collect-outputs` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92601 Approved by: https://github.com/wconstab	2023-01-25 23:15:02 +00:00
Nikita Shulga	b2f3ff6183	[Py3.11] Remove skip logic from vmap and forward_ad (#91825 ) Depends on https://github.com/pytorch/pytorch/pull/91805 Fixes https://github.com/pytorch/pytorch/issues/85506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91825 Approved by: https://github.com/albanD	2023-01-25 22:40:56 +00:00
Aaron Gokaslan	f2f42e54ca	Apply some std::move and param value fixups to aten (#92901 ) I noticed a few perf issues in the latest ATen and decided to fixup a few other miscellaneous ones I noticed recently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92901 Approved by: https://github.com/ezyang	2023-01-25 21:06:51 +00:00
SvenDS9	b073c09f7a	Added keep_key option to Grouper (#92532 ) Fixes https://github.com/pytorch/data/issues/256 The testing of this module is currently suboptimal in general. We should improve this in the future. @ejguan Pull Request resolved: https://github.com/pytorch/pytorch/pull/92532 Approved by: https://github.com/ejguan	2023-01-25 20:58:21 +00:00
Edward Z. Yang	63331a5fac	Add --timing and --explain to CI runs (#92980 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92980 Approved by: https://github.com/msaroufim	2023-01-25 20:46:12 +00:00
Michael Suo	63e47c68a6	[cpp] remove checks from embedding bag impl (#92982 ) These checks incur an H2D sync on every embedding bag forward. Also, the equivalent python code for embedding_bag does not have them. Kill! Pull Request resolved: https://github.com/pytorch/pytorch/pull/92982 Approved by: https://github.com/ezyang	2023-01-25 20:36:44 +00:00
Peter Bell	99ced6482a	Disable vml's abs and log1p (#92113 ) I noticed that `torch.log1p` is ridiculously slow compared to `torch.log` on CPU, and looking at the assembly it seems vsLog1p doesn't use any vector instructions. I saw the same for abs, though AFAICT this is dead code anyway as `abs` is implemented with `cpu_kernel_vec`. Locally I see a 14x speedup in `torch.log1p`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92113 Approved by: https://github.com/jgong5	2023-01-25 20:09:40 +00:00
Peter Bell	d4c8e37b85	Improve performance for unary kernels using vml (#91963 ) This gives some speedups for kernels implemented with `at::vml`: - Make vml ops serial and use `TensorIterator.for_each` for better parallism with discontiguous tensors - Reduce buffer size for discontiguous data to 8 KiB to increase chance of fitting in L1d cache, but is still wide enough to utilize AVX-512. - Avoid a copy if only one of input and output is discontiguous There is no change for contiguous tensors, but I see significant speedup for the following benchmarks: ``` import torch a = torch.randn(210*6, device="cpu") %timeit a.view(100, 20000)[:,::2].sqrt() %timeit a.view(200, 10000)[::2].sqrt() ``` For discontiguous last dimension I see a 27x speedup and for discontiguous batch dimension I see an 8x speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91963 Approved by: https://github.com/jgong5	2023-01-25 20:09:40 +00:00
clee2000	0de81906cc	Add get-job-id in get-workflow-job-id action (#93001 ) ids for composite workflows are really strange, both the calling step and the step in the composite workflow need an id, but when they're different, the calling step's id takes precedence Should fix test uploading problem Pull Request resolved: https://github.com/pytorch/pytorch/pull/93001 Approved by: https://github.com/huydhn	2023-01-25 19:44:52 +00:00
Sean Ross-Ross	d354499faf	adding some more missing ops to vmap (#92110 ) removes some xfails that were a part of https://github.com/pytorch/functorch/issues/1009 and https://github.com/pytorch/functorch/issues/1087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92110 Approved by: https://github.com/zou3519	2023-01-25 19:43:12 +00:00
Zain Rizvi	92fbb35bff	Upload failures shouldn't fail a CI that passed tests (#92996 ) This'll reduce some flakiness we've been seeing recently Pull Request resolved: https://github.com/pytorch/pytorch/pull/92996 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-01-25 19:23:51 +00:00
cyy	e292ddff4e	More clang-tidy fixes (#92944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944 Approved by: https://github.com/Skylion007	2023-01-25 19:11:51 +00:00
Nikita Shulga	4e67332677	Add few more tests to 3.11 smokechecks (#92946 ) Namely: - test_foreach - test_schema_check - test_weak Pull Request resolved: https://github.com/pytorch/pytorch/pull/92946 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn	2023-01-25 19:02:16 +00:00
Edward Z. Yang	b399007a07	Make TensorIterator give better error message for symbolic tensors (#92914 ) This is one of the more common reasons to see "RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides" Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92914 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-01-25 16:32:10 +00:00
Andrew Gu	c0ed0f22cd	[FSDP] Fix `no_sync()`, `use_orig_params=True`, mixed precision, sharded (#92874 ) When there is an original parameter with 1D shape that is fully assigned to one rank, then its `param.shape == view.shape` in `_use_unsharded_grad_views()`. In that case, we still want to check whether `param.dtype == view.dtype` and bypass as necessary. The previous PR had an additional `and not self.uses_sharded_strategy` because the unit test did not require the check for sharded strategies, and I was conservatively adding a minimal fix. This was happenstance and because there was no 1D parameter fully assigned to one rank. Including the bias in the linear layer achieves that case, and removing the `and not self.uses_sharded_strategy` is necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92874 Approved by: https://github.com/zhaojuanmao	2023-01-25 14:47:37 +00:00
Yanli Zhao	077e135ed6	add number of cuda retries into tracker (#92557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92557 Approved by: https://github.com/fegin, https://github.com/mrshenli	2023-01-25 14:44:34 +00:00
Sherlock Huang	a6ac922eab	Rename Canonical Aten IR to Core Aten IR (#92904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92904 Approved by: https://github.com/bdhirsh	2023-01-25 05:12:23 +00:00
Joel Schlosser	e5fd7e6d8f	Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854 ) For the `crossvit_9_240` model - it works now with dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92854 Approved by: https://github.com/ezyang	2023-01-25 05:08:02 +00:00
Nikita Shulga	0fc2f9febb	Disable torch_jit_fuser_te for dynamo CI (#92945 ) Not clear, what caused SIGIOT, but we need to get signal from other tests (and NNC+Dynamo is probably not the most important usecase) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92945 Approved by: https://github.com/ezyang, https://github.com/huydhn	2023-01-25 05:02:59 +00:00
Edward Z. Yang	2ee94633a1	Change ciflow/inductor to test inductor inference with dynamic shapes (#92771 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92771 Approved by: https://github.com/voznesenskym	2023-01-25 02:21:02 +00:00
Edward Z. Yang	f724ecbd52	Add dynamic shapes aot_eager to periodic (#92770 ) This means it overlaps with ciflow/inductor, but I'm about to change that soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92770 Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/desertfire	2023-01-25 02:21:02 +00:00
Edward Z. Yang	9c487a4b91	Fix #92814 : assertion error when explicitly provide out=None (#92873 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92873 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-01-25 02:20:53 +00:00
PyTorch MergeBot	f180873fd5	Revert "[CI] Disable regularly failing CUDA 11.8 windows periodic tests (#92902 )" This reverts commit bcbc522d1f76892b89d9ffb9f581a744c959fbd7. Reverted https://github.com/pytorch/pytorch/pull/92902 on behalf of https://github.com/atalman due to Fixed by reverting https://github.com/pytorch/pytorch/pull/91727	2023-01-25 01:39:03 +00:00
Nikita Karetnikov	e45b566018	[inductor] skip CUDA tests under ASAN (#92883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92883 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-01-25 01:29:39 +00:00
Driss Guessous	a3715efd8b	Remove windows check for cmake to build Fused kernels (#91909 ) # Summary Add support for fused attention kernels (FlashAttention and memory-efficient attention) on Windows. Previously we could not do this because the fixes required c++17 to do this but we have since update the PyTorch standard. This PR: - Changes invocations of unsigned long to the fixed width integer type - Adds in the #define FP16_SWITCH(COND, ...) which has been added to the flash_attention main branch - Changes the some macros used within mem-efficient attention code in order to work around the VA_ARG discrepancy between clang/gcc and msvc. An alternative would be setting the global flag Zc:preprocessor - Selectively applies /Zc:lambda to only the mem-efficient sources since applying this globally caused quantization files to not compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/91909 Approved by: https://github.com/cpuhrsch	2023-01-25 01:21:12 +00:00
Nikita Shulga	f0d09572b0	[CI] Rename TSAN job (#92929 ) Underlying docker has actually been migrated from py3_7 to py3_9 as part of https://github.com/pytorch/pytorch/pull/92712 but I forgot to update the TSAN names. I.e. this is a no-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92929 Approved by: https://github.com/clee2000, https://github.com/weiwangmeta, https://github.com/osalpekar	2023-01-25 00:54:36 +00:00
PyTorch MergeBot	01f1097770	Revert "Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854 )" This reverts commit d49187bf8882dabfb307de4f3f6a9031426e677a. Reverted https://github.com/pytorch/pytorch/pull/92854 on behalf of https://github.com/malfet due to Resulted in 50+% flaky failures in dynamo, reverting	2023-01-25 00:10:14 +00:00
Michael Voznesensky	54bbb446ca	lru_cache shape expansion (20-25% speedup on local bench) (#92860 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92860 Approved by: https://github.com/ezyang, https://github.com/Chillee	2023-01-25 00:01:55 +00:00
zhxchen17	78caa7921c	[dynamo] Allow DynamicShapeVariable as predicate to cond() op. (#92864 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92864 Approved by: https://github.com/tugsbayasgalan	2023-01-24 23:26:30 +00:00
Howard Huang	2503a4a7c6	Fix MPI backend PG initialization (#92847 ) Fixes #92573 Add test to check that all default backends can be initialized to prevent the above from regressing in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847 Approved by: https://github.com/rohan-varma	2023-01-24 23:24:41 +00:00
Elias Ellison	18d5288010	Add support for Generator=None in inductor (#92851 ) Fix for https://github.com/pytorch/pytorch/issues/92633. We don't support generators still but in the case that None is passed in for the generator argument we don't fail now. Generators are sparsely used so we should defer adding full support until it's necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92851 Approved by: https://github.com/ngimel	2023-01-24 23:22:38 +00:00
Kurt Mohler	f3266015a4	Add `_StorageMeta` metaclass for `StorageBase` (#92648 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92648 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-01-24 23:08:23 +00:00
Kurt Mohler	4d9920fa9c	Move PyInterpreter code in `python_variable.cpp` to its own files (#92647 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92647 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-01-24 23:08:23 +00:00
Driss Guessous	4bc0491752	Add USE_FLASH_ATTENTION flag to setup.py (#92903 ) # Summary Adds documentation to setup.py for USE_FLASH_ATTENTION=0 disabling to decrease build times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92903 Approved by: https://github.com/cpuhrsch, https://github.com/bdhirsh	2023-01-24 22:59:51 +00:00
VRShard	bf1ff4918f	Fix Dockerfile conda install error for some shells (#92702 ) The issue was first solved in [/pull/91371] for CI/CD, but the main Dockerfile in the repo root still has this issue for people trying to test build custom image manually. Without it the build fails at installing miniconda ``` #14 3.802 Preparing transaction: ...working... done #14 4.087 Executing transaction: ...working... done #14 5.713 /root/miniconda.sh: 438: /root/miniconda.sh: [[: not found #14 5.713 #14 5.713 Installing * environment... #14 5.713 #14 5.714 /root/miniconda.sh: 444: /root/miniconda.sh: [[: not found #14 6.050 #14 6.050 CondaFileIOError: '/opt/conda/pkgs/envs//env.txt'. [Errno 2] No such file or directory: '/opt/conda/pkgs/envs//env.txt' #14 6.050 ``` With the modification, locally tested build successfully with `make -f ./docker.Makefile` as instructed in the README Pull Request resolved: https://github.com/pytorch/pytorch/pull/92702 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-01-24 22:54:22 +00:00
Nikita Shulga	b0f5e15c4c	[CI] Enable Python-3.11 in smoke CPU testing (#92787 ) Add bionic-py3.11-clang9, and move vulkan testing to it. Test only fx and jit for the time being (will add more in followup PRs) Do not install numba, is it's not yet available for python-3.11 Change installed mkl version as the one installed before was incompatible with numpy TODO: Remove `-c malfet` when required packages become available on default conda channel, namely `numpy`, `setuptools`, `coverage`, `mypy-exensions`, `typing-extensions`, `psutils` and `pyyaml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92787 Approved by: https://github.com/albanD	2023-01-24 22:34:35 +00:00
Nikita Shulga	6c7e6d9689	Make `torch.fx` compatible with Python-3.11 (#92895 ) In 3.11 bytecode size is not constant, so in order to get from `f_lasti` to opcode index, one need to search for the closes offset in disassembled instructions. Update `_patch_function` to construct code with all the properties that exist in 3.11 runtime. Update `_torchscript_schema_to_signature` to mark `from` named arg as positional argument only, as this is a reserved keyword in Python and as such checked by `inspect` package in 3.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92895 Approved by: https://github.com/albanD	2023-01-24 22:11:50 +00:00
PyTorch MergeBot	a2da0a0b02	Revert "Add test tracking operators without decompositions (#90887 )" This reverts commit 2740daf7014f34e7c0305694cfb8d51cc6712d2a. Reverted https://github.com/pytorch/pytorch/pull/90887 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. We reverted https://github.com/pytorch/pytorch/pull/70988 in `acdd462b1a` and this test starts to fail. There is probably a dependency between the twos	2023-01-24 21:56:58 +00:00
Will Constable	e665f03ad8	Fix dynamo func defaults handling for torch.device, size, dtype (#92880 ) Previously, these torch types were not handled in the wrap_bound_arg handler. Add a unit test and verify it is fixed. Fixes #91084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92880 Approved by: https://github.com/ezyang	2023-01-24 21:50:43 +00:00
Joel Schlosser	d49187bf88	Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854 ) For the `crossvit_9_240` model - it works now with dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92854 Approved by: https://github.com/ezyang	2023-01-24 21:36:17 +00:00
PyTorch MergeBot	9b23fd378f	Revert "Logcumsumexp for complex in CPU and CUDA (#90847 )" This reverts commit 64985123e48cc9a78545780b23071b445ebddc45. Reverted https://github.com/pytorch/pytorch/pull/90847 on behalf of https://github.com/malfet due to Reverting to decrease build time, let's discuss the alternatives here	2023-01-24 20:49:08 +00:00
PyTorch MergeBot	acdd462b1a	Revert "Remove deprecated torch.symeig (#70988 )" This reverts commit d70ed68162521341060b06985620cdbef04a8fa9. Reverted https://github.com/pytorch/pytorch/pull/70988 on behalf of https://github.com/kit1980 due to Failing XLA tests, forward fix unsuccessful	2023-01-24 19:03:40 +00:00
Catherine Lee	16f7db5287	Don't fail-fast for docs, only push on schedule and some tags (#92853 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92853 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-24 18:48:06 +00:00
PyTorch MergeBot	d4a35e21c0	Revert "[MacOS] Explicitly use cmake from cloned conda environment (#92737 )" This reverts commit b6f41e2bcd69e3e38109232f6684063ab828473d. Reverted https://github.com/pytorch/pytorch/pull/92737 on behalf of https://github.com/huydhn due to This does not work `abe64889b8`, still have no idea why this is flaky, need rework	2023-01-24 18:34:39 +00:00
Kshiteej K	550f98332b	[fix] vmap and anomaly mode interaction (#92672 ) Fixes https://github.com/pytorch/functorch/issues/1049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92672 Approved by: https://github.com/albanD	2023-01-24 18:12:52 +00:00
Edward Z. Yang	fb46d3e138	Run all of the timm models shards in the periodic (#92900 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92900 Approved by: https://github.com/bdhirsh, https://github.com/atalman	2023-01-24 17:56:20 +00:00
Peter Bell	2740daf701	Add test tracking operators without decompositions (#90887 ) This test inspects the dispatcher directly, so captures operators without `OpInfo` including internal helper operators and backward operators that might appear in a trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90887 Approved by: https://github.com/ezyang	2023-01-24 17:38:27 +00:00
PyTorch MergeBot	5f09f76b5d	Revert "Revert 61cdae0ce58bcbe048b143356fd9ded821225657 to fix CI (#92631 )" This reverts commit 0998ec1e27b9d929275d43d324dd9342409f705c. Reverted https://github.com/pytorch/pytorch/pull/92631 on behalf of https://github.com/huydhn due to Windows G5 runner has been switched to non-ephemeral. All tests pass on https://github.com/pytorch/pytorch/pull/92876	2023-01-24 17:31:13 +00:00
Edward Z. Yang	a817008bb3	Fix #92108 (#92870 ) You can easily test this by adding ``` @patch.object(config.triton, "convolution", "triton") ``` to test_convolution1 but it takes a long time to autotune so I don't want to add it to the unit tests. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92870 Approved by: https://github.com/albanD	2023-01-24 17:22:52 +00:00
Rodrigo Kumpera	9e56378ef2	Add documentation for DCP. (#92813 ) This populates the website with some basic documentation. It's far from ideal as we should include some basic usage example. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92813 Approved by: https://github.com/wz337	2023-01-24 17:21:51 +00:00
Zain Rizvi	bcbc522d1f	[CI] Disable regularly failing CUDA 11.8 windows periodic tests (#92902 ) These periodic tests were introduced in https://github.com/pytorch/pytorch/pull/92137 They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: `d8aa68c683` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92902 Approved by: https://github.com/malfet	2023-01-24 17:20:40 +00:00
min-jean-cho	68a40a47a0	[Inductor] Lower aten.tan (#92837 ) Related #92047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92837 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-01-24 16:35:40 +00:00
Horace He	19c9b09449	Replace IndexingDiv with FloorDiv in Inductor (#92878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878 Approved by: https://github.com/ezyang	2023-01-24 15:06:22 +00:00
Horace He	c0327eb463	Some more inductor fixes for symbolic shapes (#92867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92867 Approved by: https://github.com/ezyang	2023-01-24 15:05:46 +00:00
Vincent Cloutier	0fe5367058	[Vulkan] implement abs (#87414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87414 Approved by: https://github.com/albanD	2023-01-24 14:20:34 +00:00
Michael Gschwind	7265f60ad0	Regularize mask handling for attn_mask and key_padding_mask (#92733 ) Summary: Regularize mask handling for attn_mask and key_padding_mask * Update documentation to remove reference to byte masks (which were deprecated long ago) * Introduce check and warn about deprecation if attn_mask and key_padding_mask types mismatch * Convert all masks to float before combining * Combine by adding Test Plan: sandcastle & github CI Differential Revision: D42653215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92733 Approved by: https://github.com/ngimel, https://github.com/drisspg	2023-01-24 14:12:05 +00:00
Xuehai Pan	a2e1365248	[functorch] Remove not needed named member polyfill functions (#92613 ) The `nn.Module` APIs already support `remove_duplicate` argument. It's time to retire these not needed polyfill functions. They are identical to the `nn.Module.named_parameters` and `nn.Module.named_buffers` methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92613 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-01-24 13:15:32 +00:00
albanD	d8aa68c683	make sure that our error handling runs with the GIL enabled (#92848 ) Fixes https://github.com/pytorch/pytorch/issues/92684 I checked the other use case of this API and they never release the GIL Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848 Approved by: https://github.com/ngimel	2023-01-24 09:30:42 +00:00
Nikita Karetnikov	abe64889b8	[inductor] make `conv2d` tests pass (#91952 ) ``` TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 python -m pytest -v test/inductor/test_torchinductor.py -k test_conv2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91952 Approved by: https://github.com/ezyang	2023-01-24 09:08:34 +00:00
cyy	045d1de02d	Fix some code issues (#92760 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92760 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-01-24 08:19:03 +00:00
Yukio Siraichi	3f64c96655	`asarray`: Add support for NumPy scalars (#90914 ) Follow up from: Quansight-Labs/numpy_pytorch_interop#3 This PR adds support for NumPy scalars for `torch.asarray`. Before: treats the scalar as an object that implements the buffer protocol. Thus, interprets the data as the default data type (`float32`) ```python >>> torch.asarray(numpy.float64(0.5)) tensor([0.0000, 1.7500]) ``` After: identifies the NumPy scalar, and does the "right" thing. i.e. creates a 0-dimensional tensor from the NumPy array that doesn't share its memory ```python >>> torch.asarray(numpy.float64(0.5)) tensor(0.5000, dtype=torch.float64) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90914 Approved by: https://github.com/lezcano, https://github.com/mruberry	2023-01-24 08:09:30 +00:00
Zheng Yan	cc4fbd1077	remove default implementation for RoIAlignRotatedOp::RunOnDevice (#92885 ) Summary: the default implementation is not needed as there are template specialization defined in the cpp and cu files. Test Plan: CI Differential Revision: D42697874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92885 Approved by: https://github.com/davidberard98	2023-01-24 07:20:37 +00:00
Elias Ellison	70f4b3551c	Add Hook to store arbitrary python objects that are copied over in tls (#89169 ) For the cudagraphs implementation, we would like to reuse objects that are defined in python across the forward and backward. The backward is run in a different thread, so to handle this we add an api for copying over arbitrary python objects in pytorch's thread local state, in the same way that C++ objects are copied over currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89169 Approved by: https://github.com/albanD	2023-01-24 05:24:57 +00:00
PyTorch MergeBot	118a6dd1f1	[vision hash update] update the pinned vision hash (#92875 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92875 Approved by: https://github.com/pytorchbot	2023-01-24 05:23:43 +00:00
Huy Do	b6f41e2bcd	[MacOS] Explicitly use cmake from cloned conda environment (#92737 ) My first attempt to fix `Library not loaded: @rpath/libzstd.1.dylib` issue on MacOS M1 in https://github.com/pytorch/pytorch/pull/91142 provides some additional logs about flaky error but doesn't fix the issue as I see some of them recently, for example * `e4d83d54a6` Looking at the log, I can see that: * CMAKE_EXEC correctly points to `CMAKE_EXEC=/Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/cmake` * The library is there under the executable rpath ``` ls -la /Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/../lib ... 2023-01-20T23:22:03.9761370Z -rwxr-xr-x 2 ec2-user staff 737776 Apr 22 2022 libzstd.1.5.2.dylib 2023-01-20T23:22:03.9761630Z lrwxr-xr-x 1 ec2-user staff 19 Jan 20 22:47 libzstd.1.dylib -> libzstd.1.5.2.dylib ... ``` Then calling cmake after that suddenly uses the wrong cmake from miniconda package cache: ``` 2023-01-20T23:22:04.0636880Z + cmake .. 2023-01-20T23:22:04.1924790Z dyld[85763]: Library not loaded: @rpath/libzstd.1.dylib 2023-01-20T23:22:04.1925540Z Referenced from: /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake ``` This is weird, so my second attempt will be more explicit and use the correct cmake executable in `CMAKE_EXEC`. May be something manipulates the global path in between making ` /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake` comes first in the PATH Pull Request resolved: https://github.com/pytorch/pytorch/pull/92737 Approved by: https://github.com/ZainRizvi	2023-01-24 05:14:14 +00:00
Eddie Yan	0bf7506051	[CUDA] Drop CUDA < 11.0 test flags (#92605 ) Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed. CC @ptrblck @malfet @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605 Approved by: https://github.com/ngimel	2023-01-24 04:34:06 +00:00
Danny Jeck	a799acec8b	Allow cublas an cudnn to be in different nvidia folders (#92122 ) Fixes #92096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92122 Approved by: https://github.com/malfet	2023-01-24 04:11:44 +00:00
Jacob Szwejbka	eb32bb2ca6	[Executorch][Quantization] Backend Config for functional embedding (#92700 ) Summary: title Test Plan: ci Differential Revision: D42643985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92700 Approved by: https://github.com/jerryzh168	2023-01-24 03:12:56 +00:00
Driss Guessous	9613395e2f	[SDPA] Integrating the main branch of flash_attn instead of cutlass (#91994 ) ### Background Early on in this process of integrating the FlashAttention code into core we were speaking with Tri and we came to the conclusion that the main branch of Flash Attention wasn't suitable for integration. We instead went with a [refactored version](https://github.com/HazyResearch/flash-attention/tree/cutlass) that more heavily depended upon cutlass. That is the current version of FlashAttention in PyTorch. However there are some limitations with that branch. - No backward support for SDPA - Not as performant for some large MHA setups. ### Sumary This PR pulls in the latest version of the main branch of [FlashAttention](https://github.com/HazyResearch/flash-attention/tree/main). It does not register the backward for the aten function SDPA_flash_attn. That will be done in a follow up PR. ### Changeset A few changes were made to the original code for PyTorch. - Flattened one layer of folder structure. (This is to match the the existing FlashAttention in core structure) - Remove return_softmax param and change mha_fwd signature. Since the SDPA in core public function does not support need_weights we remove this argument. - Add a lot of `#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >=530` around sections of code that will not compile for architecture less or equal to 520. Most of these blocks of code are half based asm or _hmul2 operations. An example update ```cpp #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >=530 float f; asm volatile("cvt.f32.f16 %0, %1;\n" : "=f"(f) : "h"(h)); return f; #else assert(false); return 0; #endif } ``` - Remove any blocksparse functions and files. And comment out utility functions that are used in the blockspase kernels written for FlashAttention since we did not pull in those functions. - Update gemm_cl in **/gemm.h to: ``` c++ #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800 using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>; #elif defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 750 using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>; #else assert(0); // THIS IS NOT CORRECT BUT THE ASSERT WILL STOP THIS using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>; // TD [2022-06-02] We don't support Volta (SM70) yet. #endif ``` ### Reasoning: FlashAttention is only designed to run on gpus that support sm7.5 or later. However PyTorch is generally build and released using `TORCH_CUDA_ARCH_LIST=5.2,..,8.6`. This means that source code must be compilable for these lower archs even if it is not run. But how are we sure that it won't be run? That should be handled by the runtime dispatch mechanism, specifically here: [check_arch](`d70ed68162/aten/src/ATen/native/transformers/cuda/sdp_utils.h (L308)`) There is however one edge case for building from source: User specifies TORCH_CUDA_ARCH_LIST={something less than 7.5} and they are running on a gpu that is >= 7.5 This will cause the runtime dispatcher to think it is okay to run FlashAttention even though the compiled code is bogus. I tested this with arch=5.3 on an a100 and get the following result:` RuntimeError: CUDA error: no kernel image is available for execution on the device` coming from torch.rand. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91994 Approved by: https://github.com/cpuhrsch	2023-01-24 03:11:46 +00:00
pierreHaslee	1c30844eaa	where() function added as a Tensor method as well (#92849 ) Fixes #88470 I added the "method" keyword in `aten/src/ATen/native/native_functions.yaml` for the function `where` with Scalar Overload. This way, you can now use `Tensor.where()` with a scalar parameter the same way `torch.where()` can. I added a test in `test/test_torch.py` as requested. It uses the `where()` method on a tensor and then checks it has the same results as the `torch.where()` function. The test is roughly the same as the one provided by the author of the issue. PS: this is the second PR I make to resolve this issue, the first one is #92747. I had troubles with commit signatures and is therefore closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92849 Approved by: https://github.com/albanD	2023-01-24 03:09:33 +00:00
soulitzer	fb980581a7	Revert #92688 and #92348 (aot autograd explicitly errors on double backward) (#92863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92863 Approved by: https://github.com/eellison	2023-01-24 03:04:04 +00:00
Kurt Mohler	397b1a3da0	Remove unnecessary includes from `python_variable.cpp` (#92839 ) Follow-up from #92647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92839 Approved by: https://github.com/Skylion007	2023-01-24 02:59:08 +00:00
Aaron Gokaslan	8c8cd9539d	Add missing moves to torch autograd (#92772 ) Applies some additional std::move functions to torch/csrc/autograd to opportunities that were found via static analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92772 Approved by: https://github.com/ezyang	2023-01-24 02:01:52 +00:00
Eli Uriegas	2a8669c54c	ci: Increase timeout for linux binary builds (#92859 ) Not entirely sure why conda builds would take 3 hours but failure from https://github.com/pytorch/pytorch/actions/runs/3984411372/jobs/6842256518 seems to indicate that this isn't an issue with the build itself but rather the time limit. We should _probably_ do an investigation as to why the conda build is taking 3+ hours on a 12 core machine but that's a problem for a different day. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92859 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2023-01-24 01:20:21 +00:00
Gleb Kazantaev	402c6d4299	Add Meta backend into tensor type strings (#92697 ) Add Meta backend into tensor type strings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92697 Approved by: https://github.com/wconstab	2023-01-24 00:47:03 +00:00
Iris	dd4b46e010	[PT-D][Checkpoint]rename init() (#92829 ) Fixes [#90346](https://github.com/pytorch/pytorch/issues/90346) Rename init() method in planner to be set_up_planner() to avoid confusion between __init__() and init(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92829 Approved by: https://github.com/kumpera	2023-01-24 00:12:21 +00:00
lezcano	7560660bd3	Update XLA pin (#92806 ) This should allow re-enabling/reverting `3cc1031322` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92806 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-01-23 23:58:29 +00:00
Ivan Kobzarev	57fe33403d	[lint] clang-format register_prim_ops_fulljit.cpp (#92150 ) Differential Revision: [D42502705](https://our.internmc.facebook.com/intern/diff/D42502705) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92150 Approved by: https://github.com/davidberard98	2023-01-23 23:51:13 +00:00
PyTorch MergeBot	2cf03bbbab	Revert "Run all of the timm models shards in the periodic (#92743 )" This reverts commit de69cedf98ae578f26add662c6387a43cf098066. Reverted https://github.com/pytorch/pytorch/pull/92743 on behalf of https://github.com/atalman due to This needs to be landed after https://github.com/pytorch/pytorch/pull/92845 and https://github.com/pytorch/pytorch/pull/92846 are landed	2023-01-23 23:44:09 +00:00
Ivan Yashchuk	d70ed68162	Remove deprecated torch.symeig (#70988 ) The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988 Approved by: https://github.com/lezcano, https://github.com/kit1980	2023-01-23 22:51:40 +00:00
Peter Bell	dd25111250	[caffe2] Remove OperatorBase::newstyle_outputs_ (#67093 ) `OperatorBase` maintains `output_tensors_` and `newstyle_outputs_` which hold the same list of tensors except one is `vector<caffe2::Tensor>` and the other is `List<at::Tensor>`. This instead maintains only `output_tensors_` and handles the conversions inside of export_caffe2_op_to_c10. Differential Revision: [D32289811](https://our.internmc.facebook.com/intern/diff/D32289811) Pull Request resolved: https://github.com/pytorch/pytorch/pull/67093 Approved by: https://github.com/dagitses, https://github.com/malfet	2023-01-23 22:41:59 +00:00
Fabio Rocha	e137dcc2c8	Splitting #91254 into two PRs (#92748 ) This one handles the xnumel=1 part, and introduces no performance regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92748 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-01-23 22:02:14 +00:00
Iris	f7e1f3e8bb	[PT-D][Checkpoint]Resolve issue #89501 : Rename _nested_tensor.py to (#92705 ) Fixes https://github.com/pytorch/pytorch/issues/90350. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92705 Approved by: https://github.com/kumpera	2023-01-23 21:45:11 +00:00
pbialecki	9bfd1357d5	Add CUDA 11.8 CI workflows (#92137 ) Fixes #92090 CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/92137 Approved by: https://github.com/atalman	2023-01-23 21:03:53 +00:00
Natalia Gimelshein	f333885704	Create pt2_bug_report.yml (#92773 ) Moves pt2 bug template from dynamo, we want all user issues to be filed in pytorch/pytorch repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/92773 Approved by: https://github.com/albanD	2023-01-23 21:00:49 +00:00
Nikita Shulga	3643d5deed	Move ASAN and ONNX to Python 3.9 and 3.8 (#92712 ) As 3.7 is getting deprecated Pull Request resolved: https://github.com/pytorch/pytorch/pull/92712 Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/seemethere	2023-01-23 20:55:57 +00:00
AllenTiTaiWang	4e9539e002	[ONNX] Support ListConstruct in quantized_args (#92009 ) Fixes #91303 quantized_args didn't support ListConstruct leading to an error when user uses quantized op with list inputs, ex: aten::cat. After this PR, converter can successfully export the issued model and pass ONNX checker. However, ORT doesn't seem to support it with the very same error as https://github.com/microsoft/onnxruntime/issues/12131. Update: I find test_quantized_cat_when_concatinating_the_same_tensor is even similar to the new case we have in here. The only difference is whether the inputs are already quantized. ONNX graphs both seem to be valid. [test_quantized_cat_when_concatinating_the_same_tensor.zip](https://github.com/pytorch/pytorch/files/10396798/test_quantized_cat_when_concatinating_the_same_tensor.zip) [test_quantized_list_of_inputs_with_cat.zip](https://github.com/pytorch/pytorch/files/10396799/test_quantized_list_of_inputs_with_cat.zip) issue raised https://github.com/microsoft/onnxruntime/issues/14245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92009 Approved by: https://github.com/BowenBao	2023-01-23 20:55:08 +00:00
Driss Guessous	df14650f0b	[SDPA] Update SDPA API and make function Public (#92189 ) # Summary In preparation for pt 2.0 launch this PR updates SDPA's API and makes the function a nn.funcitonal public function. ## Changes ### API Previously the the function signature was: `scaled_dot_product_attention(query, key, value, attn_mask=None, need_attn_weights=False, dropout_p=0.0, is_causal=False) -> (Tensor, Tensor)` Updated signature: `scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) -> Tensor` This PR removes the need_attn_weights optional boolean variable and updates the return type to a singular tensor. #### Reasoning: The main goal of this function is to provide an easy interface for users to call into fused attention kernels e.g. (FlashAttention). The fused kernels do not currently support arbitrary attn_mask or dropout but there is a PR to mem-efficient attention to enable these. We want to have the API surface ready for when the backing kernels get updated. The fused kernels save on memory usage by not materializing the weights and it is unlikely that a fast fused implementation will enable this feature so we are removing. Discussed with folks at FAIR/Xformers and +1 this API change. #### Make function Public In preparation for the pt 2.0 launch we make the function public to start to generate user feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/92189 Approved by: https://github.com/cpuhrsch	2023-01-23 20:50:46 +00:00
Edward Z. Yang	1237cf6b6c	Allow direct Tensor constructor to return preexisting PyObject (#92754 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92754 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2023-01-23 20:20:43 +00:00
vfdev-5	e994e78397	Added vectorized horizontal flip path for channels last for NcHW (#91806 ) ## Description - Added AVX2-only vectorization for horizontal flip op applied on channels last NCHW input, where *2 <= C sizeof(dtype) <= 16**. PR is a bit faster than Pillow and largely faster (x2 - x5) than Nightly. - ~Still keeping `cpu_vflip_memcpy` code ([it's PR](https://github.com/pytorch/pytorch/pull/89414) was reverted and is under investigations)~ ## Benchmarks ``` [---------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------] \| torch (2.0.0a0+gitf6d73f3) PR \| Pillow (9.4.0) \| torch (2.0.0a0+git4386f31) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------- channels=2, size=256, dtype=torch.uint8, mf=channels_last \| 31.859 (+-0.498) \| \| 190.599 (+-7.579) channels=2, size=520, dtype=torch.uint8, mf=channels_last \| 60.648 (+-0.074) \| \| 706.895 (+-11.219) channels=2, size=712, dtype=torch.uint8, mf=channels_last \| 95.994 (+-2.510) \| \| 1340.685 (+-169.279) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 45.490 (+-0.108) \| 47.359 (+-0.942) \| 179.520 (+-2.916) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 146.802 (+-2.175) \| 174.201 (+-4.124) \| 707.765 (+-2.691) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 215.148 (+-0.925) \| 313.606 (+-3.972) \| 1346.678 (+-89.854) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 43.618 (+-0.160) \| \| 191.613 (+-16.252) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 147.487 (+-0.691) \| \| 755.020 (+-25.045) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 216.687 (+-0.906) \| \| 1314.854 (+-31.137) channels=4, size=256, dtype=torch.uint8, mf=channels_last \| 32.169 (+-0.092) \| \| 195.415 (+-3.647) channels=4, size=520, dtype=torch.uint8, mf=channels_last \| 89.465 (+-0.154) \| \| 776.459 (+-14.845) channels=4, size=712, dtype=torch.uint8, mf=channels_last \| 152.773 (+-0.610) \| \| 1456.304 (+-45.280) channels=8, size=256, dtype=torch.uint8, mf=channels_last \| 43.444 (+-0.158) \| \| 163.669 (+-4.580) channels=8, size=520, dtype=torch.uint8, mf=channels_last \| 151.285 (+-0.602) \| \| 642.396 (+-13.500) channels=8, size=712, dtype=torch.uint8, mf=channels_last \| 278.471 (+-0.912) \| \| 1205.472 (+-47.609) channels=16, size=256, dtype=torch.uint8, mf=channels_last \| 75.176 (+-0.188) \| \| 181.278 (+-3.388) channels=16, size=520, dtype=torch.uint8, mf=channels_last \| 291.105 (+-1.163) \| \| 716.906 (+-30.842) channels=16, size=712, dtype=torch.uint8, mf=channels_last \| 893.267 (+-10.899) \| \| 1434.931 (+-40.399) channels=2, size=256, dtype=torch.int16, mf=channels_last \| 31.437 (+-0.143) \| \| 195.299 (+-2.916) channels=2, size=520, dtype=torch.int16, mf=channels_last \| 89.834 (+-0.175) \| \| 774.940 (+-8.638) channels=2, size=712, dtype=torch.int16, mf=channels_last \| 154.806 (+-0.550) \| \| 1443.435 (+-37.799) channels=3, size=256, dtype=torch.int16, mf=channels_last \| 70.909 (+-0.146) \| \| 195.347 (+-1.986) channels=3, size=520, dtype=torch.int16, mf=channels_last \| 212.998 (+-1.181) \| \| 776.282 (+-15.598) channels=3, size=712, dtype=torch.int16, mf=channels_last \| 382.991 (+-0.968) \| \| 1441.674 (+-9.873) channels=4, size=256, dtype=torch.int16, mf=channels_last \| 43.574 (+-0.157) \| \| 163.176 (+-1.941) channels=4, size=520, dtype=torch.int16, mf=channels_last \| 151.289 (+-0.557) \| \| 641.169 (+-9.457) channels=4, size=712, dtype=torch.int16, mf=channels_last \| 275.275 (+-0.874) \| \| 1186.589 (+-12.063) channels=8, size=256, dtype=torch.int16, mf=channels_last \| 74.455 (+-0.292) \| \| 181.191 (+-1.721) channels=8, size=520, dtype=torch.int16, mf=channels_last \| 289.591 (+-1.134) \| \| 715.755 (+-2.368) channels=8, size=712, dtype=torch.int16, mf=channels_last \| 923.831 (+-68.807) \| \| 1437.078 (+-14.649) channels=2, size=256, dtype=torch.int32, mf=channels_last \| 44.217 (+-0.203) \| \| 163.011 (+-1.497) channels=2, size=520, dtype=torch.int32, mf=channels_last \| 150.920 (+-0.950) \| \| 640.761 (+-1.882) channels=2, size=712, dtype=torch.int32, mf=channels_last \| 281.648 (+-1.163) \| \| 1188.464 (+-10.374) channels=3, size=256, dtype=torch.int32, mf=channels_last \| 103.708 (+-0.517) \| \| 165.001 (+-1.315) channels=3, size=520, dtype=torch.int32, mf=channels_last \| 409.785 (+-8.004) \| \| 647.939 (+-11.431) channels=3, size=712, dtype=torch.int32, mf=channels_last \| 790.819 (+-16.471) \| \| 1219.206 (+-9.503) channels=4, size=256, dtype=torch.int32, mf=channels_last \| 72.975 (+-0.155) \| \| 181.298 (+-1.059) channels=4, size=520, dtype=torch.int32, mf=channels_last \| 291.584 (+-0.905) \| \| 716.033 (+-4.824) channels=4, size=712, dtype=torch.int32, mf=channels_last \| 938.790 (+-15.930) \| \| 1434.134 (+-15.060) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/8e8c989d35835d7ab20567bff36632be#file-20230123-143303-pr_vs_nightly-md) ## Context: Follow-up work to PRs : https://github.com/pytorch/pytorch/pull/88989, https://github.com/pytorch/pytorch/pull/89414 and https://github.com/pytorch/pytorch/pull/90013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91806 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-01-23 20:15:30 +00:00
soulitzer	a112814a7f	Simplify retains grad hook implementation (#92604 ) How the old retains_grad hooks was implemented: - retains_grad hooks are stored on the autograd_meta, as entries in a vector - upon registration, a wrapper hook CppFunctionTensorPreHook is created to wrap that vector, and then that wrapper hook is registered to the grad_fn, i.e., by appending it to a vector of retains_grad hooks on the grad_fn - upon in-place, for the old grad_fn we set the retains_grad hook to nullptr, so that even though the old grad_fn still references the vector, the vector contains a single nullptr. For the new grad_fn, we create a new wrapper hook around the vector (storing the single retains_grad hook) on autograd_meta. The new retains_grad hook implementation: - we store std::function by value, and we store it on the grad_fn rather than the autograd_meta - a single grad_fn can have multiple outputs, so it can potentially hold multiple retains_grad hooks. We use an unordered_map (previously a vector). - on in-place we remove the hook from the old grad_fn and put it in the new grad_fn (small implication of this change is that we we now need to have access to both the old grad_fn and new grad_fn, this isn't a problem) Other details: - CppFunctionTensorPreHook took a shared_ptr to vector of std::function. In our new implementation, we add a new wrapper hook CppFunctionSingleTensorPreHook, which takes a single std::function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92604 Approved by: https://github.com/albanD	2023-01-23 20:10:46 +00:00
Nikita Shulga	71b1051230	[Docker] Factor GHCR push into its own step (#92832 ) As I had a really hard time figuring out what is failing in https://github.com/pytorch/pytorch/actions/runs/3987520975/jobs/6837450121 Together with https://github.com/pytorch/pytorch/pull/92816 it will ensure, that even if ghcr upload fails, CI will continue to work Per @ZainRizvi suggestion added retry logic for the upload step Test plan: push temp change(`0fe7f8c2ed`) to validate that this portion of the workflow actually doing the job Pull Request resolved: https://github.com/pytorch/pytorch/pull/92832 Approved by: https://github.com/weiwangmeta, https://github.com/ZainRizvi	2023-01-23 19:43:52 +00:00
Nikita Vedeneev	9f381c9b7f	sparse_sparse_matmul: simplify backward (#91712 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91712 Approved by: https://github.com/albanD	2023-01-23 19:24:28 +00:00
Aaron Gokaslan	36ba2ce546	[BE]: remove old dataclasses install from CI (#92763 ) Saw some places we missed some old requirements that are no longer necessary (dataclasses and future). Testing to see if all the CIs still work. We don't need dataclasses anymore now that we are on Python >= 3.7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92763 Approved by: https://github.com/ezyang	2023-01-23 18:23:44 +00:00
Fabio Rocha	a43b55e135	A few usability improvements for the dynamo benchmarks. (#92713 ) --diff_main renamed to --diff-branch BRANCH and now works again Summary table splits results per branch. csv output now has column with branch name when run in this mode Added --progress flag so you can track how many models are going to be run. Example output: ``` $ python benchmarks/dynamo/torchbench.py --quiet --performance --backend inductor --float16 --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --filter 'alexnet\|vgg16' --progress --diff viable/strict Running model 1/2 batch size: 1024 cuda eval alexnet dynamo_bench_diff_branch 1.251x p=0.00 cuda eval alexnet viable/strict 1.251x p=0.00 Running model 2/2 batch size: 128 cuda eval vgg16 dynamo_bench_diff_branch 1.344x p=0.00 cuda eval vgg16 viable/strict 1.342x p=0.00 Summary for tag=dynamo_bench_diff_branch: speedup gmean=1.30x mean=1.30x abs_latency gmean=24.09x mean=25.26x compilation_latency mean=2.0 seconds compression_ratio mean=0.9x Summary for tag=viable/strict: speedup gmean=1.30x mean=1.30x abs_latency gmean=24.11x mean=25.29x compilation_latency mean=0.5 seconds compression_ratio mean=1.0x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92713 Approved by: https://github.com/jansel	2023-01-23 18:23:35 +00:00
Kazuaki Ishizaki	d40a4540d6	Fix typo under docs directory (#92762 ) This PR fixes typo and URL (`http -> https`) in `rst` files under `docs` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/92762 Approved by: https://github.com/H-Huang	2023-01-23 18:07:22 +00:00
Chien-Chin Huang	8f294f785f	[FSDP][optim_state_dict] Fix the conditions to check non-parameter associated states (#92744 ) If a state is not associated with any parameter, `FSDP.optim_state_dict` should still save it. The current implementation to determine whether a state is associated with a parameter is not completely correct and can cause `use_orig_params=True` have extra states. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92744 Approved by: https://github.com/awgu	2023-01-23 17:40:50 +00:00
Edward Z. Yang	d90d92e733	Don't fail-fast Docker builds (#92816 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92816 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-01-23 17:31:30 +00:00
Nikita Shulga	c0dd9b3b67	Revert "[Executorch][Quantization][BE] Refactor Choose Qparams (#92592 )" This reverts commit 59071ab1e71891d480ab77af0d619bc5e01094c2. It breaks `quantization.jit.test_ondevice_quantization.TestOnDeviceDynamicPTQFinalize`, which is not run in OSS, but is mandatory for internal CI.	2023-01-23 09:13:02 -08:00
PyTorch MergeBot	9c6433ce48	Revert "Move ASAN and ONNX to Python 3.9 and 3.8 (#92712 )" This reverts commit b5f614c4cd60b5169a8c6b7f9be59de54c25fe72. Reverted https://github.com/pytorch/pytorch/pull/92712 on behalf of https://github.com/ezyang due to Docker build didn't succeed on master, rolling back so we can try again	2023-01-23 16:02:46 +00:00
Bin Bao	2037746e8d	[inductor] Rename aot_inductor_debug to aot_eager_decomp_partition (#92314 ) Summary: To make the naming more explicit, aot eager + decomposition + min_cut partition Pull Request resolved: https://github.com/pytorch/pytorch/pull/92314 Approved by: https://github.com/mlazos	2023-01-23 15:56:48 +00:00
Andrew Gu	63d6ee7d02	[FSDP][Easy] Remove outdated comment (#92739 ) We pass `fully_sharded_module`, not `root_module`, after recent refactoring to unify composable and wrapper FSDP for now. This PR removes the comment explaining why before we passed in `root_module`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92739 Approved by: https://github.com/mrshenli	2023-01-23 15:52:49 +00:00
Andrew Gu	b88340ac72	[PT-D][Lint] Include nested directories to ufmt (#92779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92779 Approved by: https://github.com/mrshenli, https://github.com/Skylion007	2023-01-23 15:52:36 +00:00
PyTorch MergeBot	afe6ea884f	Revert "[BE][CI] rename .jenkins to .ci, add symlink (#92621 )" This reverts commit 8972a9fe6aa8be8f8035c83094ed371973bfbe73. Reverted https://github.com/pytorch/pytorch/pull/92621 on behalf of https://github.com/atalman due to breaks shipit	2023-01-23 15:04:58 +00:00
Kazuaki Ishizaki	5d66a418de	Swap file size on BE platform (#92810 ) Fixes #92808 This PR fixes SIGSEGV on a big-endian machine when reading pickle data. The root cause is not to convert `size`, which is read from a file, from little-endian to big-endian while `size` is used in a method. The fix is to convert `size` on a big-endian machine instead of `nbytes`. I confirmed that the program in the issue works w/o SIGSEGV and the test passes, with this fix in master branch. ``` $ python test/test_autograd.py TestAutograd.test_pickle . ---------------------------------------------------------------------- Ran 1 test in 0.010s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92810 Approved by: https://github.com/malfet	2023-01-23 15:02:38 +00:00
Edward Z. Yang	4a3fb7bcbc	Make CI_SKIPS into a consolidated dict (#92769 ) This makes it easier to add more configurations without causing a thicket of if statements selecting the correct variable. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92769 Approved by: https://github.com/voznesenskym, https://github.com/desertfire	2023-01-23 14:57:18 +00:00
Edward Z. Yang	3cfd2fa1c7	Make --inductor imply --backend inductor (#92764 ) This is to make some downstream code more uniform (can always ask args.backend for backend) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92764 Approved by: https://github.com/voznesenskym, https://github.com/desertfire	2023-01-23 14:57:18 +00:00
PyTorch MergeBot	7ddcf4e0c3	Revert "[functorch] vmap: bitwise operators (#91971 )" This reverts commit e54f7b3edde356c97c99706942f4b32a5a5ba475. Reverted https://github.com/pytorch/pytorch/pull/91971 on behalf of https://github.com/malfet due to Broke functorch bitwise, see `e54f7b3edd`	2023-01-23 14:52:16 +00:00
Nikita Shulga	fa5be78de1	Cleanup get-workflow-job-id action (#92193 ) To be landed few days later then rest of the changes As workflow can never fail now, no need to retry it Pull Request resolved: https://github.com/pytorch/pytorch/pull/92193 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-01-23 14:47:04 +00:00
Nikita Shulga	b5f614c4cd	Move ASAN and ONNX to Python 3.9 and 3.8 (#92712 ) As 3.7 is getting deprecated Pull Request resolved: https://github.com/pytorch/pytorch/pull/92712 Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/seemethere	2023-01-23 14:46:02 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	8f3600b966	[RELAND] Add metadata coverage for unsafe_split and unsafe_split_with_sizes (#92802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92802 Approved by: https://github.com/soumith	2023-01-23 10:57:10 +00:00
Tugsbayasgalan Manlaibaatar	53ef803705	Make torch.cond work with retracing (#92646 ) We simplify the handling of branch submodules by only working with flattened input/output so that there is no need for adjusting in_spec and out_spec in the second round of tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92646 Approved by: https://github.com/zhxchen17, https://github.com/voznesenskym	2023-01-23 09:36:10 +00:00
Khushi Agrawal	e54f7b3edd	[functorch] vmap: bitwise operators (#91971 ) Fixes https://github.com/pytorch/functorch/issues/1069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91971 Approved by: https://github.com/kshitij12345, https://github.com/Chillee	2023-01-23 09:03:13 +00:00
Nikita Karetnikov	53bfba0d72	[inductor] run CPU and CUDA tests with dynamic shapes (#92667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92667 Approved by: https://github.com/ezyang	2023-01-23 08:54:31 +00:00
Masaki Kozuki	30876229a7	[mta] Backward of unary foreach functions (#89591 ) as per title, this PR defines backward of those. This doesn't implement forward-mode automatic differentiation as [the current codegen](`a747326423/tools/autograd/gen_variable_type.py (L1513)`) doesn't seem to handle `ArrayRef<Tensor>`. Rel: - https://github.com/pytorch/pytorch/issues/53796 - https://github.com/pytorch/pytorch/issues/58833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89591 Approved by: https://github.com/albanD	2023-01-23 08:28:06 +00:00
Masaki Kozuki	32b2d8009a	check if `multi_tensor_apply_kernel` was called (#92077 ) Replacing all the hard coded number of cuda kernel launches with `multi_tensor_apply_kernel` call check, keeping the dependency on kineto profiler there Rel: https://github.com/pytorch/pytorch/pull/91844#issuecomment-1379844523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92077 Approved by: https://github.com/ngimel	2023-01-23 06:46:36 +00:00
fduwjj	b985c2ef4a	[PT-D] Enable init ops for DTensor (#92651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92651 Approved by: https://github.com/wanchaol	2023-01-23 04:38:11 +00:00
Horace He	20bf77f9bd	Fixed virtualized import and typing rule (#92774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92774 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2023-01-22 22:19:40 +00:00
Aaron Gokaslan	387d769156	[BE]: Replace string compares with more efficient cpp comparisons (#92765 ) Replace cpp string comparisons with more efficient equality operators. These string comparisons are not just more readable, but they also allow for short-circuiting for faster string equality checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92765 Approved by: https://github.com/ezyang	2023-01-22 21:40:19 +00:00
Aaron Gokaslan	582485bf0f	[BE] Use data() method when possible as it's safer and more readable (#92755 ) Apply clang-tidy readability-data-pointer fixits. This essentially uses the data() method when possible instead of the less readable `&vec[0]` to get the address of the underlying backing implementation. Not only is this more readable, it is safer as it allows you to retrieve the pointer even when the std::vector or std::string is empty without throwing an index error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92755 Approved by: https://github.com/ezyang	2023-01-22 20:05:41 +00:00
Ikko Eltociear Ashimine	b847ac227f	Fix typo in buckbuild.bzl (#92751 ) accomodate -> accommodate Pull Request resolved: https://github.com/pytorch/pytorch/pull/92751 Approved by: https://github.com/Skylion007	2023-01-22 17:35:38 +00:00
Edward Z. Yang	c52567ec18	Switch CI exclusions to use exact match. (#92761 ) Since the CI exclusions are hard-coded in our script, we might as well require them to match exactly. This solved some head scratching where I was like, "this model is not obviously excluded, why is it not showing up in CI." Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92761 Approved by: https://github.com/jansel	2023-01-22 17:10:20 +00:00
Aaron Gokaslan	e57a694d77	Add some missing moves to torch jit passes (#92317 ) Add some missing moves in torch/jit/passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/92317 Approved by: https://github.com/ezyang	2023-01-22 16:33:08 +00:00
Horace He	cfaa1bace3	A bunch of fixes for Inductor + dynamic shapes enablement (#92609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92609 Approved by: https://github.com/ezyang	2023-01-22 15:22:08 +00:00
albanD	2f6a975f25	Remove cffi dependency as it doesn't look like we're using it (#92738 ) Maybe this will go horribly wrong in CI but works fine without it locally! Pull Request resolved: https://github.com/pytorch/pytorch/pull/92738 Approved by: https://github.com/kit1980, https://github.com/seemethere	2023-01-22 15:03:52 +00:00
PyTorch MergeBot	0d9de46d9c	Revert "Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608 )" This reverts commit 36e1f7bc2b1e399808173dacb9aa1ea8b89fbbbf. Reverted https://github.com/pytorch/pytorch/pull/92608 on behalf of https://github.com/ezyang due to test_aot_autograd_symbolic_exhaustive_unsafe_split_cpu_float32 (main.TestEagerFusionOpInfoCPU) is now xpass	2023-01-22 13:57:31 +00:00
Tugsbayasgalan Manlaibaatar	36e1f7bc2b	Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92608 Approved by: https://github.com/ngimel	2023-01-22 07:12:29 +00:00
Jerry Zhang	6016e4c707	[quant][fx][refactor] Rename modules to named_modules (#92575 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/92575 Approved by: https://github.com/jcaip	2023-01-22 04:53:03 +00:00
Sergii Dymchenko	ed07070a11	Restore lint after PR 92637 (#92759 ) https://github.com/pytorch/pytorch/pull/92637 broke lint, can't easily revert because of merge conflicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92759 Approved by: https://github.com/ezyang	2023-01-22 04:03:04 +00:00
PyTorch MergeBot	6bc62a6392	Revert "[inductor] run CPU and CUDA tests with dynamic shapes (#92667 )" This reverts commit 425e506ffe41fc9fd16a18175c992f9d01eef08b. Reverted https://github.com/pytorch/pytorch/pull/92667 on behalf of https://github.com/kit1980 due to test_topk_dynamic_shapes_cpu failing after this PR	2023-01-22 03:43:57 +00:00
Edward Z. Yang	93e71cc2f5	Add helpers for running tests and then putting them in a CSV (#92642 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92642 Approved by: https://github.com/albanD	2023-01-22 02:00:39 +00:00
Horace He	756acd3fa1	Guard solve behind mod for symbolic shapes (#92597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92597 Approved by: https://github.com/ezyang	2023-01-22 00:29:56 +00:00
Michael Voznesensky	363ca57d02	Remove is_aot_autograd_safe_to_run (#91927 ) This should be alright to remove now, because we: 1) Support LSTM 2) AOT_Autograd can cover its own mutation detection Pull Request resolved: https://github.com/pytorch/pytorch/pull/91927 Approved by: https://github.com/Chillee, https://github.com/bdhirsh	2023-01-21 23:54:48 +00:00
Michael Voznesensky	fb776a2df1	Fix mistaken script merge (by me) (#92756 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92756 Approved by: https://github.com/Chillee	2023-01-21 22:19:02 +00:00
Nikita Karetnikov	425e506ffe	[inductor] run CPU and CUDA tests with dynamic shapes (#92667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92667 Approved by: https://github.com/ezyang	2023-01-21 22:03:41 +00:00
Horace He	5c4f0fd72c	Change convolution to use symbolic shapes for propagation (#92397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92397 Approved by: https://github.com/ezyang	2023-01-21 21:54:24 +00:00
soulitzer	97342ae04b	Fix python tensor hooks behavior on inplace (#92734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92734 Approved by: https://github.com/albanD	2023-01-21 21:32:37 +00:00
Edward Z. Yang	de69cedf98	Run all of the timm models shards in the periodic (#92743 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92743 Approved by: https://github.com/kit1980	2023-01-21 18:39:17 +00:00
Nikita Shulga	bea0b5ba73	[BE] Delete unused docker configs (#92711 ) CUDA-10.2 is long gone and CUDA-11.3+clang build is replaced by cuda-11.6+clang10 jammy build Pull Request resolved: https://github.com/pytorch/pytorch/pull/92711 Approved by: https://github.com/weiwangmeta	2023-01-21 16:42:28 +00:00
Will Constable	020c0d5895	Add debugability comments to DDPOptimizer (#89802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89802 Approved by: https://github.com/davidberard98	2023-01-21 15:07:28 +00:00
Michael Voznesensky	5778c04a15	Add `--timing` flag, phase timing to @dynamo_timed (#92637 ) Ex output: ``` TIMING: entire_frame_compile:8.574629999999999 backend_compile:5.26806 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637 Approved by: https://github.com/ezyang	2023-01-21 10:52:13 +00:00
Edward Z. Yang	27bf879b8c	Forward fix: restore sebotnet33ts_256 aot_eager skip (#92741 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92741 Approved by: https://github.com/kit1980	2023-01-21 08:10:23 +00:00
Huy Do	3cc1031322	Mark XLA Linux jobs as unstable temporarily (#92634 ) To be reverted once the issue is mitigated https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20AtenXlaTensorTest.TestFrobeniusNormInDims Caused by https://github.com/pytorch/pytorch/pull/81763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92634 Approved by: https://github.com/ZainRizvi	2023-01-21 06:31:19 +00:00
cyy	e4d81a9ec9	fix various pointer issues (#90651 ) Fix some issues found by static analyser Pull Request resolved: https://github.com/pytorch/pytorch/pull/90651 Approved by: https://github.com/Skylion007	2023-01-21 06:26:41 +00:00
Yanbo Liang	0ab4ab9f8d	[Dynamo] Fix calling UserDefinedObject.func should pass self object (#92050 ) Fixes #90834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92050 Approved by: https://github.com/jansel	2023-01-21 05:47:01 +00:00
Jane Xu	0d870b50d3	[optim][nadam] group tensors in foreach, make it default (#92715 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92715 Approved by: https://github.com/albanD	2023-01-21 05:43:37 +00:00
Jane Xu	9ccf9362c2	[optim][rprop] default to foreach when CUDA + differentiable=False (#92728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92728 Approved by: https://github.com/albanD	2023-01-21 05:31:22 +00:00
Jane Xu	c628654724	[optim][rmsprop] default to foreach when CUDA + differentiable=False (#92727 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92727 Approved by: https://github.com/albanD	2023-01-21 05:31:22 +00:00
Jane Xu	7277247a8c	[optim][radam] default to foreach when CUDA + differentiable=False (#92726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92726 Approved by: https://github.com/albanD	2023-01-21 05:31:22 +00:00
Jane Xu	9f356568ab	[optim][asgd] default to foreach when CUDA + differentiable=False (#92724 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92724 Approved by: https://github.com/albanD	2023-01-21 05:31:22 +00:00
Jane Xu	30bda6b12b	[optim][adamax] default to foreach when CUDA + differentiable=False (#92723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92723 Approved by: https://github.com/albanD	2023-01-21 05:31:22 +00:00
Jane Xu	9b4a778420	[optim][adagrad] default to foreach when CUDA + differentiable=False (#92716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92716 Approved by: https://github.com/albanD	2023-01-21 05:31:22 +00:00
Michael Lazos	6f1727b288	Print aot graphs if user specifies aot graph env vars (#92720 ) When integrating AOT logging with TorchInductor trace, the ability to print graphs to the console if the user specified any of the env vars was removed (in favor of using TORCH_COMPILE_DEBUG). This restores this by checking if the user set any of the aot debug variables before setting up the remainder of the logging, and adding a stream to stdout if any of those env vars are set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92720 Approved by: https://github.com/Chillee	2023-01-21 04:46:35 +00:00
Edward Z. Yang	c0fe41f983	Use SymBool for is_contiguous computation (#92229 ) This changes TensorImpl to store SymBool instead of bool. However, it doesn't actually compute these quantities symbolically (outside of some top level disjunctions.) The purpose of this PR is to make it easier to diagnose performance problems in the next PR, as after this change we can switch to guardless implementations without modifying TensorImpl.h Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92229 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-01-21 04:01:00 +00:00
PyTorch MergeBot	011df6630c	[vision hash update] update the pinned vision hash (#92732 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92732 Approved by: https://github.com/pytorchbot	2023-01-21 03:42:12 +00:00
kshitij12345	d2728bb6a7	[functorch] add is_any_true (#92686 ) Adds `is_any_true` similar to `is_all_true` (https://github.com/pytorch/pytorch/pull/89097/files) This would unblock https://github.com/pytorch/functorch/issues/1049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92686 Approved by: https://github.com/Chillee	2023-01-21 03:36:05 +00:00
Xiaodong Wang	e6a8267cf5	[pt2.0/inductor] Fix race in cache dir across ranks on the same host (#92664 ) Summary: It looks we have some race in the cache directory for triton codegen, when we have multiple processes on the same host: 1. Rank A and B cannot find the code in cache (/tmp/uid/triton/cache) and start compilation separately 2. Most of the times the codegen is the same; but rarely it may produce different llir and different shared memory (in our case it's 544 and 2560, both are valid for the llir/ptx generated). See repro D42584580 3. They both write the compiled so and metadata into the local cache folder, with the same directory name (same hash, without considering device id). There will be a race here even if they grab the file lock, because it only locks each file but not the entire transaction 4. We then load the so and meta data back from the file. What happens can be we load so from rank A and shared memory from rank B and they mismatch. Test Plan: Run the faulty program to double check ``` [trainer5]: cache dir: /tmp/root/4951/triton/cache/198ef4405d2e525acd20d5c2d01ad099 [trainer1]: cache dir: /tmp/root/4947/triton/cache/198ef4405d2e525acd20d5c2d01ad099 ``` Differential Revision: D42619405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92664 Approved by: https://github.com/bertmaher, https://github.com/ngimel, https://github.com/jansel	2023-01-21 03:22:12 +00:00
Jane Xu	8972a9fe6a	[BE][CI] rename .jenkins to .ci, add symlink (#92621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92621 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-21 02:40:18 +00:00
PyTorch MergeBot	09eb4c2a70	Revert "Update Module.__setattr__ to respect property setters (#92044 )" This reverts commit 0c8f4b58934cbfe4a52d261c914ff8b2632c4f5c. Reverted https://github.com/pytorch/pytorch/pull/92044 on behalf of https://github.com/saitcakmak due to Caused regressions in a Meta internal model	2023-01-21 02:39:21 +00:00
cyy	85851b1e8f	remove useless clang-tidy suppression (#92287 ) remove NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init) remove NOLINTNEXTLINE(performance-move-const-arg) remove NOLINTNEXTLINE(performance-no-automatic-move) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92287 Approved by: https://github.com/albanD	2023-01-21 02:33:24 +00:00
Edward Z. Yang	5489b32337	Add periodic job to test aot_eager on benchmarks suite. (#92695 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92695 Approved by: https://github.com/desertfire, https://github.com/albanD	2023-01-21 02:29:22 +00:00
Edward Z. Yang	9ad0aca6e5	Update aot_eager CI failures (#92696 ) Based on https://hud.pytorch.org/pr/92689 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92696 Approved by: https://github.com/desertfire	2023-01-21 02:29:22 +00:00
Edward Z. Yang	1bf512017e	Refactor test_inductor_benchmark into test_single_dynamo_benchmark helper (#92665 ) I need this because I'm going to add a few more configurations (not enabled by default, but to be run on periodic) and having this better factored will make it easier. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92665 Approved by: https://github.com/Chillee, https://github.com/desertfire	2023-01-21 02:29:22 +00:00
Edward Z. Yang	85a1f0223a	Add a warning about performance cost of set_default_device (#92703 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92703 Approved by: https://github.com/albanD	2023-01-21 02:23:13 +00:00
Edward Z. Yang	5c6f5439b7	Implement SymBool (#92149 ) We have known for a while that we should in principle support SymBool as a separate concept from SymInt and SymFloat ( in particular, every distinct numeric type should get its own API). However, recent work with unbacked SymInts in, e.g., https://github.com/pytorch/pytorch/pull/90985 have made this a priority to implement. The essential problem is that our logic for computing the contiguity of tensors performs branches on the passed in input sizes, and this causes us to require guards when constructing tensors from unbacked SymInts. Morally, this should not be a big deal because, we only really care about the regular (non-channels-last) contiguity of the tensor, which should be guaranteed since most people aren't calling `empty_strided` on the tensor, however, because we store a bool (not a SymBool, prior to this PR it doesn't exist) on TensorImpl, we are forced to immediately compute these values, even if the value ends up not being used at all. In particular, even when a user allocates a contiguous tensor, we still must compute channels-last contiguity (as some contiguous tensors are also channels-last contiguous, but others are not.) This PR implements SymBool, and makes TensorImpl use SymBool to store the contiguity information in ExtraMeta. There are a number of knock on effects, which I now discuss below. * I introduce a new C++ type SymBool, analogous to SymInt and SymFloat. This type supports logical and, logical or and logical negation. I support the bitwise operations on this class (but not the conventional logic operators) to make it clear that logical operations on SymBool are NOT short-circuiting. I also, for now, do NOT support implicit conversion of SymBool to bool (creating a guard in this case). This does matter too much in practice, as in this PR I did not modify the equality operations (e.g., `==` on SymInt) to return SymBool, so all preexisting implicit guards did not need to be changed. I also introduced symbolic comparison functions `sym_eq`, etc. on SymInt to make it possible to create SymBool. The current implementation of comparison functions makes it unfortunately easy to accidentally introduce guards when you do not mean to (as both `s0 == s1` and `s0.sym_eq(s1)` are valid spellings of equality operation); in the short term, I intend to prevent excess guarding in this situation by unit testing; in the long term making the equality operators return SymBool is probably the correct fix. * ~~I modify TensorImpl to store SymBool for the `is_contiguous` fields and friends on `ExtraMeta`. In practice, this essentially meant reverting most of the changes from https://github.com/pytorch/pytorch/pull/85936 . In particular, the fields on ExtraMeta are no longer strongly typed; at the time I was particularly concerned about the giant lambda I was using as the setter getting a desynchronized argument order, but now that I have individual setters for each field the only "big list" of boolean arguments is in the constructor of ExtraMeta, which seems like an acceptable risk. The semantics of TensorImpl are now that we guard only when you actually attempt to access the contiguity of the tensor via, e.g., `is_contiguous`. By in large, the contiguity calculation in the implementations now needs to be duplicated (as the boolean version can short circuit, but the SymBool version cannot); you should carefully review the duplicate new implementations. I typically use the `identity` template to disambiguate which version of the function I need, and rely on overloading to allow for implementation sharing. The changes to the `compute_` functions are particularly interesting; for most of the functions, I preserved their original non-symbolic implementation, and then introduce a new symbolic implementation that is branch-less (making use of our new SymBool operations). However, `compute_non_overlapping_and_dense` is special, see next bullet.~~ This appears to cause performance problems, so I am leaving this to an update PR. * (Update: the Python side pieces for this are still in this PR, but they are not wired up until later PRs.) While the contiguity calculations are relatively easy to write in a branch-free way, `compute_non_overlapping_and_dense` is not: it involves a sort on the strides. While in principle we can still make it go through by using a data oblivious sorting network, this seems like too much complication for a field that is likely never used (because typically, it will be obvious that a tensor is non overlapping and dense, because the tensor is contiguous.) So we take a different approach: instead of trying to trace through the logic computation of non-overlapping and dense, we instead introduce a new opaque operator IsNonOverlappingAndDenseIndicator which represents all of the compute that would have been done here. This function returns an integer 0 if `is_non_overlapping_and_dense` would have returned `False`, and an integer 1 otherwise, for technical reasons (Sympy does not easily allow defining custom functions that return booleans). The function itself only knows how to evaluate itself if all of its arguments are integers; otherwise it is left unevaluated. This means we can always guard on it (as `size_hint` will always be able to evaluate through it), but otherwise its insides are left a black box. We typically do NOT expect this custom function to show up in actual boolean expressions, because we will typically shortcut it due to the tensor being contiguous. It's possible we should apply this treatment to all of the other `compute_` operations, more investigation necessary. As a technical note, because this operator takes a pair of a list of SymInts, we need to support converting `ArrayRef<SymNode>` to Python, and I also unpack the pair of lists into a single list because I don't know if Sympy operations can actually validly take lists of Sympy expressions as inputs. See for example `_make_node_sizes_strides` * On the Python side, we also introduce a SymBool class, and update SymNode to track bool as a valid pytype. There is some subtlety here: bool is a subclass of int, so one has to be careful about `isinstance` checks (in fact, in most cases I replaced `isinstance(x, int)` with `type(x) is int` for expressly this reason.) Additionally, unlike, C++, I do NOT define bitwise inverse on SymBool, because it does not do the correct thing when run on booleans, e.g., `~True` is `-2`. (For that matter, they don't do the right thing in C++ either, but at least in principle the compiler can warn you about it with `-Wbool-operation`, and so the rule is simple in C++; only use logical operations if the types are statically known to be SymBool). Alas, logical negation is not overrideable, so we have to introduce `sym_not` which must be used in place of `not` whenever a SymBool can turn up. To avoid confusion with `__not__` which may imply that `operators.__not__` might be acceptable to use (it isn't), our magic method is called `__sym_not__`. The other bitwise operators `&` and `\|` do the right thing with booleans and are acceptable to use. * There is some annoyance working with booleans in Sympy. Unlike int and float, booleans live in their own algebra and they support less operations than regular numbers. In particular, `sympy.expand` does not work on them. To get around this, I introduce `safe_expand` which only calls expand on operations which are known to be expandable. TODO: this PR appears to greatly regress performance of symbolic reasoning. In particular, `python test/functorch/test_aotdispatch.py -k max_pool2d` performs really poorly with these changes. Need to investigate. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92149 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-01-21 02:21:56 +00:00
lezcano	34e8eb229d	Dispatch the auxiliary frobenius_norm and nuclear_norm to better implementations and deprecate them (#81763 ) These functions will be legacy functions. We deprecate them, but we also take this chance to dispatch to a more efficient and consistent implementation. Doing so should help writing a conversion rule for these to be able to remove them once and for all Differential Revision: [D42354776](https://our.internmc.facebook.com/intern/diff/D42354776) Pull Request resolved: https://github.com/pytorch/pytorch/pull/81763 Approved by: https://github.com/ngimel	2023-01-21 01:03:50 +00:00
Eddie Yan	1af40d5108	[cublas][cublasLt] Fall back to unfused `addmm` for 2-byte-aligned inputs (#92201 ) Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214 Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200. Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called. We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201 Approved by: https://github.com/ngimel	2023-01-21 00:32:02 +00:00
Jerry Zhang	a74c8df7cd	[quant][fx][pt2e][be] Store node_name_to_target_dtype to node.meta["target_dtype_info"] (#92574 ) Summary: This is in preparation for quantize_pt2e API where we allow programability for users to set how they want to quantize their model Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizePT2E Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/92574 Approved by: https://github.com/jcaip	2023-01-21 00:27:15 +00:00
Jane Xu	de0375e79d	[optim][foreach] Do NOT inplace modify gradients (#92706 ) SGD and ASGD already had out-of-place grads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92706 Approved by: https://github.com/ngimel, https://github.com/albanD	2023-01-21 00:12:28 +00:00
Jane Xu	2b885e1f6c	[optim][NAdam] Fix discrepancy between mt vs st impl (#92699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92699 Approved by: https://github.com/albanD	2023-01-21 00:12:28 +00:00
Daohang Shi	896b6d8768	fix the formatting of runtime error msg in prims _cat_meta (#92124 ) Summary: easy fix on formatting. for example, > BackendCompilerFailed: compile_fx raised RuntimeError: Sizes of tensors must match except in dimension 0. Expected {common_length} but got {length} for tensor number {tensor_idx} in the list Reviewed By: Yuzhen11 Differential Revision: D42491648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92124 Approved by: https://github.com/malfet	2023-01-20 23:26:02 +00:00
Catherine Lee	703265e599	Shard mac to 3 (#91277 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91277 Approved by: https://github.com/huydhn	2023-01-20 22:51:23 +00:00
Horace He	d6c3468f70	Don't allow recomputing a node that must be materialized in the backwards pass (#90896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90896 Approved by: https://github.com/ezyang	2023-01-20 22:34:41 +00:00
Nikita Shulga	97b7e4cdd5	Fix GroupNorm backward prop on CUDA (#92671 ) Fixes regression introduced by https://github.com/pytorch/pytorch/pull/89485 Adds test to prevent those regressions from happening in the future In process, discovered that GroupNormBackwards on CPU does not produce the same results if input and gradient memory_format is different Fixes #92166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92671 Approved by: https://github.com/ngimel, https://github.com/xuzhao9	2023-01-20 22:22:01 +00:00
Eddie Yan	8c0289a61c	[CUDA][CUBLAS][BFloat16] Tenatively disable reduced precision reductions for some matmul tests (#92599 ) We've observed some failures in numerical checks on newer compute capabilities stemming from cuBLAS allowing reduced precision reductions. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/92599 Approved by: https://github.com/ngimel	2023-01-20 22:19:11 +00:00
Peter Bell	5644059489	[inductor] Lower torch.exp2 and use it for torch.pow(2, x) (#92632 ) Before ```python tmp0 = 2.0 tmp2 = tl.libdevice.pow(tmp0, tmp1) ``` After ```python tmp1 = tl.libdevice.exp2(tmp0) ``` I've benchmarked on CPU and CUDA with the following examples ``` @torch._dynamo.optimize() def exp2(x): return torch.pow(2, x) @torch._dynamo.optimize() def logaddexp2(a, b): m = torch.maximum(a, b) return m + torch.log2(1 + torch.pow(2, -torch.abs(a-b))) ``` triton is able to specialize `pow(2, x)` such that this makes no difference, but on CPU I see a surprisingly large speedup. \| device \| Function \| Master (us) \| This PR (us) \| Speedup \| \|--------\|-----------\|-------------\|--------------\|---------\| \| CUDA \| exp2 \| 64 \| 63 \| 1.0 \| \| \| logaddexp \| 109 \| 107 \| 1.0 \| \| CPU \| exp2 \| 220 \| 40 \| 5.5 \| \| \| logaddexp \| 282 \| 140 \| 2.0 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/92632 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-01-20 22:06:23 +00:00
Edward Z. Yang	5a1344407a	Add GHA side support for ciflow/inductor-perf-test-nightly (#92693 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92693 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2023-01-20 22:01:24 +00:00
soulitzer	a3efa9d740	Create autograd Function for aot_autograd backward only when needed (#92688 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92688 Approved by: https://github.com/bdhirsh	2023-01-20 21:55:23 +00:00
Iris	eee2869ea7	[PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553 ) Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory". This is the fix for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553 Approved by: https://github.com/kumpera	2023-01-20 21:54:04 +00:00
milesial	e4d83d54a6	Foreach gradient clipping (#91846 ) Faster gradient clipping using the foreach functions ``` [------------------------ (tensors, scalar) -------------------------] \| without foreach \| with foreach \| apex 1 threads: ---------------------------------------------------------------------- 10 tensors of size 4 \| 120.5 \| 61.1 \| 50.3 100 tensors of size 4 \| 946.2 \| 239.5 \| 136.3 1000 tensors of size 4 \| 9808.5 \| 2151.1 \| 1006.9 10000 tensors of size 4 \| 96871.2 \| 22637.4 \| 10119.1 10 tensors of size 16 \| 121.0 \| 64.1 \| 52.5 100 tensors of size 16 \| 993.4 \| 252.6 \| 136.7 1000 tensors of size 16 \| 9427.7 \| 2151.2 \| 1049.5 10000 tensors of size 16 \| 97437.1 \| 22203.1 \| 10340.0 10 tensors of size 256 \| 118.9 \| 62.3 \| 51.5 100 tensors of size 256 \| 955.2 \| 243.1 \| 134.2 1000 tensors of size 256 \| 9374.9 \| 2140.7 \| 1009.6 10000 tensors of size 256 \| 95302.5 \| 21849.4 \| 10215.5 10 tensors of size 65536 \| 118.5 \| 62.4 \| 51.1 100 tensors of size 65536 \| 1740.7 \| 243.3 \| 225.3 1000 tensors of size 65536 \| 17364.1 \| 2228.7 \| 2004.5 10000 tensors of size 65536 \| 177510.1 \| 25410.4 \| 20678.2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91846 Approved by: https://github.com/janeyx99	2023-01-20 21:43:29 +00:00
Will Constable	44b7a0b7ef	Clean up argparser help (benchmarks/dynamo/distributed.py) (#92687 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92687 Approved by: https://github.com/davidberard98	2023-01-20 21:23:49 +00:00
soulitzer	9db4323e4c	Deprecate capture hooks except distributed use case (#92653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92653 Approved by: https://github.com/albanD	2023-01-20 20:51:46 +00:00
Edward Z. Yang	c4501593c3	Delete get_pyobj() entirely (#92638 ) Opt for the shorter and more direct node attribute access. I need to do this because I'm going to publicly document SymInt and SymFloat but I don't want to doc get_pyobj(). Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92638 Approved by: https://github.com/Chillee, https://github.com/albanD, https://github.com/voznesenskym, https://github.com/bdhirsh	2023-01-20 19:06:56 +00:00
Huy Do	5610766044	Mark test monitoring as an optional process (#92658 ) This is an optional step that is ok to ignored when PyPI becomes flaky. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92658 Approved by: https://github.com/clee2000	2023-01-20 18:59:56 +00:00
PyTorch MergeBot	8b3e35ea4a	Revert "Run dynamo/test_dynamic_shapes serially (#92215 )" This reverts commit ea1007b89cb86551c80ddfd38db0bb3ade32140b. Reverted https://github.com/pytorch/pytorch/pull/92215 on behalf of https://github.com/huydhn due to This is not needed anymore as https://github.com/pytorch/pytorch/issues/92196 has been root caused to test ordering	2023-01-20 18:54:13 +00:00
Sean Ross-Ross	fb3d9f39cc	update vmap to accept nones (#91644 ) * Fixes https://github.com/pytorch/functorch/issues/1082 * Fixes https://github.com/pytorch/functorch/issues/439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91644 Approved by: https://github.com/kshitij12345, https://github.com/Chillee	2023-01-20 18:25:22 +00:00
Sherlock Huang	2fb328eb46	[Dynamo] Preserve source_fn in node.meta (#92399 ) Sample value from the test case `test_export_with_stack_trace` node.target \| node.meta["source_fn"] -- \| -- aten.randn.default \| <built-in method randn of type object at 0x7f8683263108> aten.t.default \| < built-in function linear > aten.mm.default \| < built-in function linear > aten.cos.default \| <built-in method cos of type object at 0x7f8683263108> aten.relu.default \| relu aten.add.Tensor \| < built-in function add > Pull Request resolved: https://github.com/pytorch/pytorch/pull/92399 Approved by: https://github.com/jerryzh168, https://github.com/yanboliang	2023-01-20 18:23:39 +00:00
Peter Bell	dd760c98f8	[decomp] Use new squeeze.dims overload in decompositions (#91602 ) This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602 Approved by: https://github.com/ngimel	2023-01-20 18:08:18 +00:00
Peter Bell	2af2952c66	logaddexp2: Use log1p and exp2 (#92116 ) This replaces `log2(1 + x)` with `log1p(x) * (1 / log(2))` which improves precision when `x` is small by avoiding the truncation from calculating `(1 + x) - 1`. Noting that `x` is always `<= 1` in this formula. This also replaces `pow(2, x)` with `exp2(x)` which improves performance, particularly on CPU where the constant value cannot be inlined into Sleef. With numel=1e7 for example, I see a 1.35x speedup on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92116 Approved by: https://github.com/lezcano	2023-01-20 18:04:27 +00:00
Zain Rizvi	67bb5236da	lint fix (#92685 ) This linter error was introduced in https://github.com/pytorch/pytorch/pull/91821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92685 Approved by: https://github.com/weiwangmeta, https://github.com/malfet	2023-01-20 17:26:37 +00:00
PyTorch MergeBot	2891cecd8d	Revert "Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608 )" This reverts commit 4386f317b92a400cabc6a25b5849466475eec1a9. Reverted https://github.com/pytorch/pytorch/pull/92608 on behalf of https://github.com/ZainRizvi due to test_aot_autograd_symbolic_exhaustive_unsafe_split_cpu_float32 (__main__.TestEagerFusionOpInfoCPU) is failing consistently since this PR was merged	2023-01-20 17:17:35 +00:00
vfdev	215f4fc355	Update android/README.md, how to build pytorch android from source (#92356 ) `sh ./scripts/build_pytorch_android.sh` leads to ``` ./scripts/build_pytorch_android.sh: 30: source: not found ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92356 Approved by: https://github.com/soulitzer	2023-01-20 16:39:31 +00:00
Jane Xu	b2ca2c8662	[optim][adagrad] group tensors in foreach to maximize perf (#92362 ) another one Pull Request resolved: https://github.com/pytorch/pytorch/pull/92362 Approved by: https://github.com/albanD	2023-01-20 16:24:39 +00:00
PyTorch MergeBot	44132cc4b0	Revert "Add `--timing` flag, phase timing to @dynamo_timed (#92637 )" This reverts commit 773b5134359ae21957e1f5a37eb2cee620c74029. Reverted https://github.com/pytorch/pytorch/pull/92637 on behalf of https://github.com/malfet due to Broke lint	2023-01-20 16:23:20 +00:00
vfdev-5	5ac22782d1	Optimized vertical flip using memcpy (#89414 ) ## Description - Use memcpy for vertical flip - Added bool type support for horizontal flip - channels last input with horizontal flip goes also into cpu_vflip_memcpy and has a speed-up Previous PRs: - https://github.com/pytorch/pytorch/pull/90013 - https://github.com/pytorch/pytorch/pull/88989 ## Results ### Horizontal flip - AVX2 (channels last input only) ``` [------------------------------------------------------------------------- Horizontal flip -------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0a0+gitb0bd5c4) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 204.813 (+-1.018) \| \| 308.070 (+-1.573) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 844.523 (+-2.302) \| \| 1226.801 (+-5.069) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 2246.512 (+-8.935) \| \| 2689.692 (+-22.654) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 21.024 (+-0.083) \| 44.196 (+-0.131) \| 22.564 (+-0.066) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 71.806 (+-0.150) \| 166.653 (+-0.789) \| 72.660 (+-0.160) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 129.354 (+-0.385) \| 306.998 (+-0.819) \| 130.094 (+-0.274) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 177.250 (+-0.485) \| 44.232 (+-0.465) \| 289.201 (+-2.837) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 699.055 (+-1.940) \| 166.540 (+-0.903) \| 1172.747 (+-3.645) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 1302.968 (+-5.390) \| 307.210 (+-0.852) \| 2149.396 (+-23.570) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 11.943 (+-0.079) \| \| 12.451 (+-0.033) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 39.830 (+-0.093) \| \| 40.583 (+-0.070) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 69.001 (+-0.078) \| \| 69.590 (+-0.162) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 177.378 (+-0.507) \| \| 283.461 (+-2.957) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 698.915 (+-1.840) \| \| 1061.208 (+-10.449) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 1299.365 (+-3.919) \| \| 1957.424 (+-13.149) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 17.955 (+-0.077) \| \| 89.456 (+-0.285) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 56.901 (+-0.081) \| \| 339.802 (+-0.879) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 103.629 (+-0.256) \| \| 627.845 (+-1.185) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 21.179 (+-0.077) \| 44.146 (+-0.260) \| 22.957 (+-0.138) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 71.685 (+-0.155) \| 166.666 (+-0.730) \| 72.606 (+-0.124) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 129.168 (+-0.288) \| 307.094 (+-1.571) \| 130.156 (+-0.453) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 33.049 (+-0.089) \| \| 33.056 (+-0.477) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 116.635 (+-0.299) \| \| 113.433 (+-0.891) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 212.134 (+-0.413) \| \| 204.394 (+-0.822) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 207.214 (+-0.586) \| \| 302.370 (+-0.670) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 846.553 (+-2.301) \| \| 1223.851 (+-5.280) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 2251.687 (+-6.513) \| \| 2711.557 (+-14.011) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 33.237 (+-0.072) \| \| 33.101 (+-0.070) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 113.605 (+-0.337) \| \| 117.067 (+-0.547) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 204.632 (+-0.487) \| \| 212.590 (+-0.848) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 7.950 (+-0.030) \| \| 37.757 (+-0.080) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 23.799 (+-0.080) \| \| 136.571 (+-0.441) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 37.970 (+-0.075) \| \| 246.894 (+-0.926) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 8.009 (+-0.077) \| \| 37.800 (+-0.100) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 23.861 (+-0.099) \| \| 136.553 (+-0.519) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 38.211 (+-0.104) \| \| 246.939 (+-0.692) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100405-pr_vs_nightly-md) - AVX512 (channels last input only) ``` [---------------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0.dev20221208+cu116) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 194.708 (+-9.566) \| \| 372.067 (+-12.430) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 765.151 (+-10.098) \| \| 1524.231 (+-111.283) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 1587.229 (+-88.117) \| \| 2950.081 (+-92.322) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 13.328 (+-0.375) \| 49.693 (+-1.193) \| 10.323 (+-0.333) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 90.580 (+-0.812) \| 191.936 (+-4.369) \| 92.269 (+-0.980) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 163.821 (+-3.174) \| 352.053 (+-10.909) \| 165.661 (+-4.436) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 206.862 (+-4.417) \| 49.336 (+-1.492) \| 287.373 (+-7.266) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 829.736 (+-15.857) \| 191.489 (+-5.645) \| 1166.126 (+-45.667) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 1540.953 (+-28.269) \| 352.171 (+-8.784) \| 2171.570 (+-82.740) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 7.856 (+-0.131) \| \| 7.943 (+-0.148) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 34.750 (+-1.195) \| \| 36.309 (+-0.716) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 85.858 (+-0.729) \| \| 87.306 (+-0.981) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 206.896 (+-5.716) \| \| 262.551 (+-6.598) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 828.212 (+-13.441) \| \| 1077.916 (+-28.810) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 1542.748 (+-31.379) \| \| 2003.661 (+-71.614) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 11.038 (+-0.271) \| \| 126.867 (+-5.590) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 90.190 (+-1.185) \| \| 501.446 (+-13.498) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 165.797 (+-3.016) \| \| 921.131 (+-20.500) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 13.516 (+-0.578) \| 49.678 (+-1.966) \| 10.360 (+-0.256) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 91.195 (+-0.830) \| 191.778 (+-4.742) \| 91.117 (+-0.855) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 168.551 (+-3.352) \| 351.585 (+-8.230) \| 164.199 (+-3.725) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 35.832 (+-0.840) \| \| 35.087 (+-0.972) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 133.624 (+-5.293) \| \| 131.423 (+-6.002) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 240.702 (+-5.213) \| \| 236.876 (+-7.867) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 192.351 (+-6.740) \| \| 313.999 (+-12.141) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 766.553 (+-16.669) \| \| 1270.797 (+-49.828) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 1501.700 (+-69.499) \| \| 2427.303 (+-126.694) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 35.386 (+-0.801) \| \| 34.539 (+-0.844) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 132.369 (+-4.107) \| \| 130.926 (+-3.597) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 237.722 (+-6.680) \| \| 237.072 (+-5.027) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 6.796 (+-0.132) \| \| 44.727 (+-0.905) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 24.827 (+-0.669) \| \| 166.758 (+-5.141) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 42.392 (+-0.980) \| \| 310.830 (+-6.130) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 8.114 (+-0.141) \| \| 44.776 (+-0.707) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 24.787 (+-0.787) \| \| 167.766 (+-5.004) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 42.545 (+-0.636) \| \| 313.715 (+-7.603) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105633-pr_vs_nightly-avx512-md) ### Vertical flip - AVX2 (all tested cases showing speed-up or same perfs) ``` [-------------------------------------------------------------------------- Vertical flip --------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0a0+gitb0bd5c4) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 93.125 (+-3.022) \| \| 101.064 (+-0.436) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 412.942 (+-57.066) \| \| 461.463 (+-2.098) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 1533.265 (+-4.071) \| \| 1829.713 (+-14.311) channels=3, size=256, dtype=torch.int64, mf=channels_first \| 101.134 (+-0.924) \| \| 102.858 (+-0.319) channels=3, size=520, dtype=torch.int64, mf=channels_first \| 421.679 (+-1.101) \| \| 477.413 (+-1.809) channels=3, size=712, dtype=torch.int64, mf=channels_first \| 1550.418 (+-3.647) \| \| 1877.143 (+-6.622) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 20.961 (+-0.063) \| 19.515 (+-0.302) \| 21.980 (+-0.070) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 71.199 (+-0.173) \| 70.199 (+-0.332) \| 95.262 (+-0.109) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 128.532 (+-0.318) \| 127.325 (+-0.328) \| 167.190 (+-0.370) channels=1, size=256, dtype=torch.int32, mf=channels_first \| 21.206 (+-0.059) \| 19.471 (+-0.128) \| 21.469 (+-0.064) channels=1, size=520, dtype=torch.int32, mf=channels_first \| 71.284 (+-0.163) \| 70.124 (+-0.388) \| 94.988 (+-0.239) channels=1, size=712, dtype=torch.int32, mf=channels_first \| 129.017 (+-0.286) \| 128.088 (+-0.461) \| 167.115 (+-1.075) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 16.909 (+-0.057) \| 19.570 (+-0.353) \| 17.981 (+-0.072) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 55.163 (+-0.138) \| 70.218 (+-0.275) \| 107.938 (+-0.620) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 98.518 (+-0.121) \| 127.737 (+-0.486) \| 170.965 (+-0.436) channels=3, size=256, dtype=torch.uint8, mf=channels_first \| 18.150 (+-0.084) \| 19.758 (+-0.221) \| 18.122 (+-0.088) channels=3, size=520, dtype=torch.uint8, mf=channels_first \| 56.693 (+-0.200) \| 70.278 (+-0.386) \| 89.018 (+-0.206) channels=3, size=712, dtype=torch.uint8, mf=channels_first \| 100.409 (+-0.235) \| 127.772 (+-0.457) \| 168.072 (+-0.436) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 12.817 (+-0.041) \| \| 12.818 (+-0.049) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 38.359 (+-0.081) \| \| 63.378 (+-0.165) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 68.246 (+-0.090) \| \| 116.637 (+-0.583) channels=1, size=256, dtype=torch.int16, mf=channels_first \| 12.899 (+-0.054) \| \| 12.649 (+-0.060) channels=1, size=520, dtype=torch.int16, mf=channels_first \| 38.404 (+-0.069) \| \| 63.448 (+-0.108) channels=1, size=712, dtype=torch.int16, mf=channels_first \| 68.378 (+-0.104) \| \| 116.415 (+-0.332) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 17.071 (+-0.044) \| \| 17.792 (+-0.050) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 55.163 (+-0.100) \| \| 108.539 (+-0.466) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 98.537 (+-0.091) \| \| 171.675 (+-0.553) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 17.837 (+-0.071) \| \| 18.355 (+-0.067) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 56.051 (+-0.087) \| \| 88.261 (+-0.129) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 100.603 (+-0.245) \| \| 169.067 (+-0.430) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 21.204 (+-0.063) \| 19.607 (+-0.140) \| 22.202 (+-0.094) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 71.356 (+-0.211) \| 69.844 (+-0.343) \| 94.614 (+-0.167) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 129.087 (+-0.290) \| 127.065 (+-0.319) \| 166.513 (+-0.444) channels=1, size=256, dtype=torch.float32, mf=channels_first \| 21.196 (+-0.065) \| 19.156 (+-0.132) \| 21.516 (+-0.073) channels=1, size=520, dtype=torch.float32, mf=channels_first \| 71.422 (+-0.180) \| 70.296 (+-0.136) \| 94.913 (+-0.095) channels=1, size=712, dtype=torch.float32, mf=channels_first \| 129.045 (+-0.312) \| 128.023 (+-0.585) \| 166.089 (+-0.409) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 12.770 (+-0.045) \| \| 34.853 (+-0.089) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 38.363 (+-0.064) \| \| 131.969 (+-0.577) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 67.954 (+-0.107) \| \| 239.507 (+-0.835) channels=1, size=256, dtype=torch.float16, mf=channels_first \| 12.855 (+-0.067) \| \| 35.124 (+-0.109) channels=1, size=520, dtype=torch.float16, mf=channels_first \| 38.725 (+-0.079) \| \| 131.708 (+-0.586) channels=1, size=712, dtype=torch.float16, mf=channels_first \| 68.931 (+-0.086) \| \| 239.022 (+-0.914) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 90.277 (+-0.083) \| \| 101.512 (+-0.285) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 421.277 (+-1.030) \| \| 471.913 (+-3.654) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 1534.394 (+-7.572) \| \| 1833.262 (+-12.185) channels=3, size=256, dtype=torch.float64, mf=channels_first \| 100.809 (+-0.328) \| \| 103.166 (+-0.335) channels=3, size=520, dtype=torch.float64, mf=channels_first \| 425.535 (+-0.926) \| \| 482.606 (+-1.450) channels=3, size=712, dtype=torch.float64, mf=channels_first \| 1550.832 (+-3.547) \| \| 1859.098 (+-6.517) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 12.954 (+-0.051) \| \| 12.744 (+-0.046) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 41.180 (+-0.064) \| \| 63.362 (+-0.139) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 68.136 (+-0.142) \| \| 117.009 (+-0.292) channels=1, size=256, dtype=torch.bfloat16, mf=channels_first \| 13.049 (+-0.052) \| \| 12.792 (+-0.076) channels=1, size=520, dtype=torch.bfloat16, mf=channels_first \| 38.488 (+-0.092) \| \| 63.451 (+-0.096) channels=1, size=712, dtype=torch.bfloat16, mf=channels_first \| 68.103 (+-0.091) \| \| 116.693 (+-0.290) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 7.572 (+-0.029) \| \| 8.017 (+-0.071) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 22.121 (+-0.061) \| \| 23.614 (+-0.074) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 36.896 (+-0.094) \| \| 39.460 (+-0.084) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 7.671 (+-0.028) \| \| 8.034 (+-0.058) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 21.989 (+-0.053) \| \| 23.645 (+-0.063) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 37.252 (+-0.072) \| \| 39.477 (+-0.100) channels=1, size=256, dtype=torch.complex64, mf=channels_last \| 37.129 (+-0.052) \| \| 37.801 (+-0.101) channels=1, size=520, dtype=torch.complex64, mf=channels_last \| 122.646 (+-0.230) \| \| 139.074 (+-0.467) channels=1, size=712, dtype=torch.complex64, mf=channels_last \| 228.946 (+-0.736) \| \| 257.589 (+-0.545) channels=1, size=256, dtype=torch.complex64, mf=channels_first \| 37.088 (+-0.070) \| \| 37.894 (+-0.078) channels=1, size=520, dtype=torch.complex64, mf=channels_first \| 122.695 (+-0.268) \| \| 138.933 (+-0.336) channels=1, size=712, dtype=torch.complex64, mf=channels_first \| 234.655 (+-0.454) \| \| 255.787 (+-0.530) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100440-pr_vs_nightly-md) - AVX512 (all tested cases showing speed-up or same perfs) ``` [---------------------------------------------------------------------------- Vertical flip -----------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0.dev20221208+cu116) nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 122.544 (+-1.962) \| \| 129.161 (+-1.809) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 508.274 (+-4.790) \| \| 533.872 (+-7.457) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 951.176 (+-29.534) \| \| 1073.603 (+-44.676) channels=3, size=256, dtype=torch.int64, mf=channels_first \| 127.872 (+-2.700) \| \| 127.326 (+-2.666) channels=3, size=520, dtype=torch.int64, mf=channels_first \| 518.019 (+-4.157) \| \| 538.094 (+-6.600) channels=3, size=712, dtype=torch.int64, mf=channels_first \| 1002.176 (+-42.545) \| \| 1033.989 (+-42.137) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 10.025 (+-0.135) \| 10.054 (+-0.369) \| 10.155 (+-0.285) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 89.867 (+-0.994) \| 88.712 (+-0.622) \| 103.029 (+-2.254) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 161.787 (+-2.080) \| 161.370 (+-1.801) \| 182.608 (+-7.031) channels=1, size=256, dtype=torch.int32, mf=channels_first \| 10.005 (+-0.277) \| 9.965 (+-0.338) \| 10.604 (+-0.334) channels=1, size=520, dtype=torch.int32, mf=channels_first \| 89.116 (+-0.996) \| 88.840 (+-0.608) \| 102.103 (+-2.111) channels=1, size=712, dtype=torch.int32, mf=channels_first \| 164.328 (+-3.284) \| 161.538 (+-2.739) \| 181.702 (+-3.770) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 8.853 (+-0.148) \| 10.292 (+-0.494) \| 8.961 (+-0.190) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 68.368 (+-1.158) \| 90.068 (+-1.780) \| 81.155 (+-0.945) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 125.458 (+-2.511) \| 163.150 (+-2.532) \| 147.039 (+-4.264) channels=3, size=256, dtype=torch.uint8, mf=channels_first \| 10.409 (+-0.435) \| 10.406 (+-0.351) \| 10.263 (+-0.252) channels=3, size=520, dtype=torch.uint8, mf=channels_first \| 69.077 (+-1.062) \| 90.057 (+-0.992) \| 79.910 (+-0.884) channels=3, size=712, dtype=torch.uint8, mf=channels_first \| 127.286 (+-2.789) \| 162.862 (+-2.953) \| 142.821 (+-2.119) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 7.513 (+-0.143) \| \| 7.364 (+-0.154) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 33.140 (+-0.779) \| \| 42.141 (+-0.820) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 86.235 (+-1.187) \| \| 104.205 (+-2.205) channels=1, size=256, dtype=torch.int16, mf=channels_first \| 7.410 (+-0.162) \| \| 7.075 (+-0.126) channels=1, size=520, dtype=torch.int16, mf=channels_first \| 33.656 (+-0.914) \| \| 40.991 (+-0.893) channels=1, size=712, dtype=torch.int16, mf=channels_first \| 86.087 (+-1.191) \| \| 105.419 (+-1.801) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 8.802 (+-0.196) \| \| 8.627 (+-0.202) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 66.348 (+-0.775) \| \| 80.631 (+-1.832) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 126.275 (+-2.318) \| \| 144.597 (+-4.242) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 10.255 (+-0.383) \| \| 10.101 (+-0.335) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 68.124 (+-0.849) \| \| 79.286 (+-0.748) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 127.118 (+-2.225) \| \| 142.029 (+-2.507) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 9.850 (+-0.453) \| 9.299 (+-0.253) \| 10.030 (+-0.234) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 91.506 (+-1.319) \| 90.265 (+-0.824) \| 107.570 (+-2.093) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 167.820 (+-3.883) \| 162.871 (+-2.397) \| 180.046 (+-8.952) channels=1, size=256, dtype=torch.float32, mf=channels_first \| 10.118 (+-0.359) \| 10.433 (+-0.479) \| 10.204 (+-0.344) channels=1, size=520, dtype=torch.float32, mf=channels_first \| 90.862 (+-1.486) \| 90.138 (+-0.969) \| 107.011 (+-1.801) channels=1, size=712, dtype=torch.float32, mf=channels_first \| 163.931 (+-3.653) \| 163.155 (+-2.673) \| 186.707 (+-2.248) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 7.304 (+-0.134) \| \| 24.141 (+-0.444) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 35.186 (+-0.656) \| \| 101.523 (+-1.465) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 85.707 (+-0.841) \| \| 192.640 (+-4.942) channels=1, size=256, dtype=torch.float16, mf=channels_first \| 7.286 (+-0.142) \| \| 24.155 (+-0.555) channels=1, size=520, dtype=torch.float16, mf=channels_first \| 33.819 (+-1.009) \| \| 101.620 (+-3.034) channels=1, size=712, dtype=torch.float16, mf=channels_first \| 84.811 (+-0.993) \| \| 192.286 (+-4.707) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 126.273 (+-2.519) \| \| 128.831 (+-1.975) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 551.861 (+-4.159) \| \| 517.343 (+-4.501) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 1102.465 (+-66.427) \| \| 1224.532 (+-55.656) channels=3, size=256, dtype=torch.float64, mf=channels_first \| 129.965 (+-2.083) \| \| 130.709 (+-2.261) channels=3, size=520, dtype=torch.float64, mf=channels_first \| 526.332 (+-5.354) \| \| 515.399 (+-4.320) channels=3, size=712, dtype=torch.float64, mf=channels_first \| 1169.215 (+-78.889) \| \| 1102.536 (+-51.178) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 7.478 (+-0.147) \| \| 7.154 (+-0.162) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 33.836 (+-1.022) \| \| 38.854 (+-0.648) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 85.483 (+-0.582) \| \| 99.190 (+-2.202) channels=1, size=256, dtype=torch.bfloat16, mf=channels_first \| 7.416 (+-0.125) \| \| 7.169 (+-0.121) channels=1, size=520, dtype=torch.bfloat16, mf=channels_first \| 34.958 (+-0.717) \| \| 40.136 (+-0.784) channels=1, size=712, dtype=torch.bfloat16, mf=channels_first \| 85.505 (+-1.207) \| \| 99.793 (+-2.065) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 5.856 (+-0.178) \| \| 5.824 (+-0.118) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 12.030 (+-0.330) \| \| 14.478 (+-0.554) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 30.116 (+-0.639) \| \| 31.163 (+-0.873) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 5.804 (+-0.113) \| \| 5.825 (+-0.102) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 12.043 (+-0.363) \| \| 14.240 (+-0.341) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 30.001 (+-1.001) \| \| 33.199 (+-0.430) channels=1, size=256, dtype=torch.complex64, mf=channels_last \| 29.941 (+-0.861) \| \| 28.229 (+-0.904) channels=1, size=520, dtype=torch.complex64, mf=channels_last \| 173.244 (+-2.577) \| \| 173.173 (+-2.260) channels=1, size=712, dtype=torch.complex64, mf=channels_last \| 323.548 (+-3.338) \| \| 318.318 (+-2.764) channels=1, size=256, dtype=torch.complex64, mf=channels_first \| 29.001 (+-1.029) \| \| 28.565 (+-2.074) channels=1, size=520, dtype=torch.complex64, mf=channels_first \| 173.078 (+-1.993) \| \| 170.664 (+-1.722) channels=1, size=712, dtype=torch.complex64, mf=channels_first \| 324.782 (+-3.759) \| \| 315.745 (+-2.600) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105707-pr_vs_nightly-avx512-md) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89414 Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/albanD	2023-01-20 16:18:01 +00:00
Edward Z. Yang	387357539f	Log accuracy failure in more cases (#92645 ) Fixes https://github.com/pytorch/torchdynamo/issues/1910 But not durably, it's easy to forget if you add more cases. I'd like someone else to do that refactor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92645 Approved by: https://github.com/Chillee	2023-01-20 15:23:35 +00:00
mfkasim1	64985123e4	Logcumsumexp for complex in CPU and CUDA (#90847 ) Another PR towards solving #89205. What's in this PR: * The implementation of forward `logcumsumexp` for complex numbers in CPU & CUDA * The tests on forward call of `logcumsumexp` for complex numbers * The implementation of backward `logcumsumexp` for complex numbers What's missing: * The test on backward gradient of `logcumsumexp` (it complaints `RuntimeError: logcumsumexp does not support automatic differentiation for outputs with complex dtype.` and I don't know how to solve the error and I don't know where to put the test for the backward computation). If possible, I'd like this to be done in this PR. It's really tricky to handle the edge cases here (i.e. the ones involving `inf`), but I've tried my best to put some comments explaining the reasonings of my decisions in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90847 Approved by: https://github.com/albanD	2023-01-20 15:10:50 +00:00
Tugsbayasgalan Manlaibaatar	4386f317b9	Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92608 Approved by: https://github.com/ngimel	2023-01-20 12:39:56 +00:00
kshitij12345	274958ef43	[vmap] unsafe_split : batching rule and OpInfo (#92291 ) Ref: https://github.com/pytorch/functorch/issues/1089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92291 Approved by: https://github.com/Chillee	2023-01-20 10:31:56 +00:00
Wei Wang	f6acd95ae5	Fix performance smoke test script bug (#92660 ) Fixes the file not found issue in https://github.com/pytorch/pytorch/actions/runs/3963775704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92660 Approved by: https://github.com/desertfire, https://github.com/huydhn	2023-01-20 06:46:13 +00:00
Yanbo Liang	2a3954372a	[Dynamo] Make torch.autograd.Function.forward support graph break and no re-compilation (#91295 ) Fixes #91101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91295 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-01-20 06:25:09 +00:00
Liao, Xuan	119d5e425c	[Inductor] decompose expm1 for CPP vec (#92289 ) For micro-bench op `aten.elu.default` in TIMM, the performance is not good even though with vectorization. `Elu` uses `expm1` as a sub-op. It turns out that inductor invokes sleef `expm1` function while aten decomposes it with `exp - 1`. The former one performs worse than the latter one. This PR decomposes `expm1` for cpp vectorization to make performance come back. Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> suite \| improved_ratio_speedup \| speedup_old \| RSD(3) \| speedup_new \| RSD(3) -- \| -- \| -- \| -- \| -- \| -- timm \| 114.38% \| 0.803447768 \| 8.39% \| 1.722458 \| 27.74% </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92289 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-20 05:29:32 +00:00
Michael Voznesensky	38a4cb765b	Torch package support in dynamo (#91821 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91821 Approved by: https://github.com/suo, https://github.com/malfet	2023-01-20 05:03:34 +00:00
Michael Voznesensky	773b513435	Add `--timing` flag, phase timing to @dynamo_timed (#92637 ) Ex output: ``` TIMING: entire_frame_compile:8.574629999999999 backend_compile:5.26806 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637 Approved by: https://github.com/ezyang	2023-01-20 05:01:21 +00:00
PyTorch MergeBot	663bf4ba15	[vision hash update] update the pinned vision hash (#92270 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92270 Approved by: https://github.com/pytorchbot, https://github.com/malfet	2023-01-20 04:08:45 +00:00
Jerry Zhang	1464db08b4	[quant][pt2e] Support setting qconfig by module_type (#92355 ) Summary: This PR supports the following feature for QConfigMapping: ``` qconfig_mapping = QConfigMapping().set_object_type(torch.nn.Conv2d, qconfig) backend_config = get_qnnpack_pt2e_backend_config() m = prepare_pt2e(m, qconfig_mapping, example_inputs, backend_config) ``` which means users want to set the qconfig for all calls to `torch.nn.Conv2d` to use `qconfig`, note this is only verified for the case when the module is broken down to a single aten op right now, e.g. torch.nn.Conv2d will be torch.ops.aten.convolution op when traced through. will need to support more complicated modules that is broken down to multiple operators later, e.g. (MaxPool) Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qconfig_module_type Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/92355 Approved by: https://github.com/jcaip	2023-01-20 03:18:21 +00:00
William Wen	620846c8b4	Remove reference in dynamo benchmark makefile to triton master branch (#92663 ) Triton changed the name of the master branch to main. Dynamo dashboard will likely break without this fix. Tested on a new conda environment locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92663 Approved by: https://github.com/yanboliang	2023-01-20 03:09:53 +00:00
Peter Bell	e9bc82f54b	Vectorize torch.exp2 on CPU and add complex support (#92115 ) I see an 11x speedup in `exp2` on CPU from this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92115 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-01-20 02:48:04 +00:00
Serkan Karakulak	52e8af57a6	[3/N] Update ema_teacher_arch in the backward call (#92080 ) Summary: adding support for updating ema_teacher_arch in C2 backend Test Plan: baseline f397096610 EMA run f397096864 Differential Revision: D41124891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92080 Approved by: https://github.com/kit1980	2023-01-20 02:29:42 +00:00
Andrew Gu	f659452009	[FSDP][1/N] Split `fully_shard` unit tests (#92296 ) This PR splits `test_fully_shard.py` into `fully_shard/test_fully_shard<...>.py`. This should help improve readability and avoid some future rebase conflicts. The only other real change is resolving a `TODO` for using `run_subtests` in the model checkpointing unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92296 Approved by: https://github.com/mrshenli	2023-01-20 02:02:59 +00:00
Jacob Szwejbka	59071ab1e7	[Executorch][Quantization][BE] Refactor Choose Qparams (#92592 ) Summary: Should hopefully be a little faster. Definitely cleaner to not create an observer inside the op Test Plan: ci Differential Revision: D42154677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92592 Approved by: https://github.com/jerryzh168	2023-01-20 01:36:47 +00:00
Wei Wang	cf5495ac3a	Add perf check for inductor smoke test (#92358 ) Background: performance smoke test job has been setup for inductor https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-smoke-test.yml I have used this job to identify that https://github.com/pytorch/pytorch/pull/91254 regressed performance from 1.194x to 1.156x. However, this was by manual checking. To automatically flag similar regressions, we will add a reference value (which needs to be actively maintained) so that any speedups falling below the reference would be treated regression. In the back, two A100 instances from GCP would be running the perf check jobs for every push to upstream. So far these two instances give up to 1.204x and 1.197x, so we interpret any output below 1.185x to be suspicious. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92358 Approved by: https://github.com/ngimel, https://github.com/desertfire	2023-01-20 01:02:29 +00:00
Angela Yi	493a6ced74	[fx] Throw error when symbolically tracing control flow ops (#92313 ) Throws a better error when symbolically tracing control flow ops. Right now it throws an error when creating the function arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92313 Approved by: https://github.com/zhxchen17	2023-01-20 00:38:21 +00:00
Natalia Gimelshein	4110900b22	let inductor generate broadcast when loading a single value (#92595 ) For better perf with MLIR triton. Changes ``` tmp32 = tl.load(seed3 + (0 + tl.zeros([XBLOCK, RBLOCK], tl.int32)), None) ``` to ``` tmp32_load = tl.load(seed3+(0)); tmp32 = tl.broadcast_to(tmp32_load, [XBLOCK, RBLOCK]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92595 Approved by: https://github.com/Chillee	2023-01-20 00:05:01 +00:00
Han Qi (qihqi)	f0e3c4929b	only copy meta if available (#92623 ) Test Plan: ``` buck2 test mode/opt //torchmultimodal/tests:tests -- --exact 'torchmultimodal/tests:tests - test_albef.py::test_albef_image_embeddings_momentum' ``` now passes Reviewed By: malfet Differential Revision: D42608385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92623 Approved by: https://github.com/tugsbayasgalan	2023-01-19 23:39:53 +00:00
PyTorch MergeBot	60bf851931	Revert "Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. (#88078 )" This reverts commit 8383b5c488399f2ae295c7c0f993bdd353dfd75c. Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/malfet due to This seems to have broke sm_86 testing, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=sm86%20%2F%20test%20(default%2C%203	2023-01-19 23:37:59 +00:00
Nikita Shulga	550983e39d	Revert "Move check_label ci to mergebot (#92309 )" This reverts commit 190f7803f5d90d027f331eaf48ef5fa63f14737a. As it broke revert workflow, see https://github.com/pytorch/pytorch/actions/runs/3963235531/jobs/6790838677	2023-01-19 15:33:10 -08:00
Ning Xu	190f7803f5	Move check_label ci to mergebot (#92309 ) Fixes #88098 ### What Changed * Moved `check_label.py` logic into `trymerge.py` * Refactored relevant unittests * ~~Dropped~~ Refactored `check_label.py` ci job ### Tests `python .github/scripts/test_trymerge.py` `python .github/scripts/test_check_labels.py` `make lint & lintrunner -a` ### Notes to reviewers This PR replaces the [original PR](https://github.com/pytorch/pytorch/pull/92225) to workaround the sticky EasyCLA failure mark on its first commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92309 Approved by: https://github.com/ZainRizvi	2023-01-19 22:31:32 +00:00
Egil Martinsson	b33d9e2c87	Point to README.md#from-source instead of duplicate instructions in CONTRIBUTING.md#developing-pytorc (#91850 ) Idea: [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) should be the place that describes how I as a developer builds from source. Currently, `CONTRIBUTING.md` suggests an incomplete set of install instructions that predates those in `README.md`. This PR tries to simplify and remove a dead end from the developer onboarding funnel by pointing to [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source). ### Details Without touching this codebase for years I tried to build repo for local development and run unit tests. I tried to capitalise on the confusion by documenting it: 1. I go to [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) 2. Since it doesn't suggest how I run unit test I follow [README.md#releases-and-contributing to ](https://github.com/pytorch/pytorch/blob/master/README.md#releases-and-contributing) to [CONTRIBUTING.md#developing-pytorch](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#developing-pytorch) which is written as if it's _the_ set up dev env instruction: `73e5379fab/CONTRIBUTING.md (L88-L90)` But this section gives competing and incomplete install instructions that does not work for me. Ex, it doesn't mention `ninja` or `pyaml` required for `python setup.py develop`. 5. Going back to the original [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) setup instructions that (mostly) worked. `73e5379fab/README.md (L187)` #### TODO - [x] verify that it does not break any link to other documentation [skip ci] Pull Request resolved: https://github.com/pytorch/pytorch/pull/91850 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2023-01-19 22:14:28 +00:00
zhxchen17	706aa51628	[dynamo] Support control flow map() operator. (#91939 ) Fixes #ISSUE_NUMBER We want to add support for control flow map() at dynamo level to unblock some internal model which will have to use map() operator in captured graph. Basically I replicate the pattern for implementing cond() op from https://github.com/pytorch/pytorch/pull/90286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91939 Approved by: https://github.com/ezyang	2023-01-19 22:03:01 +00:00
Kurt Mohler	647b8f8e3e	Add TORCH_CHECK_TENSOR_ALL (#89097 ) `TORCH_CHECK_TENSOR_ALL(cond, ...)` is a wrapper around `TORCH_CHECK` which allows the condition argument to be a tensor, batched or unbatched. `cond` can be a boolean tensor of any size. If any element is False, or if `cond.numel() == 0`, then `TORCH_CHECK_TENSOR_ALL` raises an error Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89097 Approved by: https://github.com/zou3519	2023-01-19 21:04:09 +00:00
Catherine Lee	25e530083e	[ci] Run test_decomp parallel (#92566 ) run test_decomp in parallel with itself since it now takes 2+ hours on some architectures https://docs.google.com/spreadsheets/d/1o0W4WjOYIyPSzBSl3lelvKcQyLOiv8pMijiGUDoPuBU/edit#gid=0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92566 Approved by: https://github.com/huydhn	2023-01-19 20:47:27 +00:00
Huy Do	0998ec1e27	Revert 61cdae0ce58bcbe048b143356fd9ded821225657 to fix CI (#92631 ) `61cdae0ce5` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92631 Approved by: https://github.com/malfet	2023-01-19 19:57:05 +00:00
Edward Z. Yang	a20c678c72	Rename Makefile_dashboard to Makefile (#92584 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92584 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2023-01-19 16:28:37 +00:00
Edward Z. Yang	90024436e7	Do not specialize int/float with dynamic=True (#92570 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92570 Approved by: https://github.com/bdhirsh	2023-01-19 16:27:45 +00:00
Wanchao Liang	0bc875ac1d	[dtensor] disable gpu tests in op db first (#92611 ) There seems to be some issue with the cuda tests where our CI aren't capturing those failures (probably because of lacking 4 GPUs in CI environment). Disabling it first and debug later see https://github.com/pytorch/pytorch/issues/92343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92611 Approved by: https://github.com/XilunWu	2023-01-19 16:20:00 +00:00
Will Constable	a2b8e891f6	Fix/modernize dynamo docs (#92572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92572 Approved by: https://github.com/ezyang	2023-01-19 16:15:31 +00:00
Huy Do	ce43fc586f	Register sccache epilogue before starting sccache (#92587 ) Fixing XLA test job flaky with sccache failing to start with a timeout error, for example: * https://github.com/pytorch/pytorch/actions/runs/3953719143/jobs/6770489428 * https://github.com/pytorch/pytorch/actions/runs/3952860712/jobs/6769339620 * https://github.com/pytorch/pytorch/actions/runs/3946315315/jobs/6754126326 XLA test job actually builds XLA as part of the test ~~, so it needs sccache~~ * Register sccache epilogue before starting sccache, so that any errors when starting sccache can be printed * Add `-e SKIP_SCCACHE_INITIALIZATION=1` to `_linux_test` workflow, this is the same flag used in `_linux_build` workflow. Quoted the reason from the build script: > sccache --start-server seems to hang forever on self hosted runners for GHA so let's just go ahead and skip the --start-server altogether since it seems as though sccache still gets used even when the sscache server isn't started explicitly * Also fix the code alignment in `.jenkins/pytorch/common-build.sh` * We don't even use sccache in XLA test job, but there is an S3 cache used by bazel there (`XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92587 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2023-01-19 16:14:31 +00:00
Edward Z. Yang	44e52ea514	Reenable mobilevit_s in CI, seems to pass (#92585 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92585 Approved by: https://github.com/Chillee	2023-01-19 15:24:45 +00:00
Henry Cheng	b6cfd62285	vmap support for torch.linalg.vander (#91749 ) Adds vmap support for torch.linalg.vander in a similar manner to how view_as_complex is implemented. #91700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91749 Approved by: https://github.com/lezcano	2023-01-19 14:49:54 +00:00
Jane (Yuan) Xu	3ba5eae72a	[optim][radam] fix eps discrepancy for foreach (#92551 ) Will likely race with https://github.com/pytorch/pytorch/pull/92365 eps was not being used at all in the mta/foreach impl. There was also a discrepancy between the docs vs the implementation: the implementation was doing sqrt(x) + eps and the docs were doing sqrt(x+eps)). I've fixed the docs + extended the current multi_tensor test case to capture this issue. ![image](https://user-images.githubusercontent.com/31798555/213300617-61cbb763-da2d-48e0-b3b6-0190594dd049.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92551 Approved by: https://github.com/albanD	2023-01-19 14:38:59 +00:00
Nikita Shulga	97f34e367d	Run CI in a new environment (#92378 ) Needed to be able to install newer Python versions (Python-3.11 in this case), which do not have numerous packages that default environment must have In addition, fix weird incursion of `conda-forge` by torch-deploy test. Reincarnation of an old https://github.com/pytorch/pytorch/pull/66530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92378 Approved by: https://github.com/kit1980	2023-01-19 14:24:30 +00:00
Li-Huai (Allan) Lin	ccbdf49582	[MPS] Fix index_select scalar input with multiple indices (#91064 ) Support operations like this: ``` device="mps" arr = torch.tensor(10, device=device) indices = torch.tensor([0, 0], device=device) # multiple indices torch.index_select(arr, 0, indices) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91064 Approved by: https://github.com/kulinseth	2023-01-19 14:08:02 +00:00
PyTorch MergeBot	827e22ec2d	Revert "[vmap] unsafe_split : batching rule and OpInfo (#92291 )" This reverts commit 0510ae59b3168eb22422ee88b64419aeb0682782. Reverted https://github.com/pytorch/pytorch/pull/92291 on behalf of https://github.com/kshitij12345 due to Broke trunk	2023-01-19 13:49:43 +00:00
Peter Bell	a9f4462847	[primTorch] Remove prims.to_dtype (#92380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92380 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-01-19 12:07:47 +00:00
Nikita Shulga	1906eaf22f	[BE] Get rid of `future` (#92596 ) PyTorch has been Python-3.X+ for ages, so it's a shame to still rely on `future.utils` even in a deprecated Caffe2 codebase For the reference: https://peps.python.org/pep-0469/#migrating-directly-to-python-3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92596 Approved by: https://github.com/kit1980, https://github.com/orionr	2023-01-19 08:46:50 +00:00
soulitzer	1bc60c6b31	[reland] Improve hooks ordering behavior (#92559 ) This reverts commit e525f433e15de1f16966901604a8c4c662828a8a. Original PR: #85849 Fixes #ISSUE_NUMBER In addition to reverting the revert, this PR: - defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92559 Approved by: https://github.com/albanD	2023-01-19 08:17:32 +00:00
kshitij12345	0510ae59b3	[vmap] unsafe_split : batching rule and OpInfo (#92291 ) Ref: https://github.com/pytorch/functorch/issues/1089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92291 Approved by: https://github.com/Chillee	2023-01-19 06:34:45 +00:00
Yanli Zhao	0a404fdd82	Follow up comments of PR #91531 (#92359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92359 Approved by: https://github.com/awgu	2023-01-19 06:12:01 +00:00
Connor Baker	2066523508	Fix `ShardedTensorMetadata.tensor_properties` for Python 3.11 (#91795 ) The `tensor_properties` field of the `ShardedTensorMetadata` dataclass is a reference to a `TensorProperties` object. However, the field is set to `field(default=TensorProperties())` instead of `field(default_factory=TensorProperties)`. This causes an error when using Python 3.11 or later: ```python ValueError: mutable default <class 'torch.distributed._shard.sharded_tensor.metadata.TensorProperties'> for field tensor_properties is not allowed: use default_factory ``` This change in dataclass behavior was introduced in [bpo-44674: Use unhashability as a proxy for mutability for default dataclass __init__ arguments](https://github.com/python/cpython/pull/29867). The current use of `default` instead of `default_factory` also means that all `ShardedTensorMetadata` objects created without specifying `tensor_properties` will share the same `TensorProperties` object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91795 Approved by: https://github.com/fduwjj	2023-01-19 04:21:05 +00:00
Wanchao Liang	06d54b4061	[threaded_pg] fix the comments of MultiThreadTestCase (#92373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92373 Approved by: https://github.com/wz337	2023-01-19 03:42:54 +00:00
Wanchao Liang	997de44100	[dtensor] delete lagging op db and update op db tests (#92290 ) We are now in pytorch core so don't need lagging op db anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/92290 Approved by: https://github.com/XilunWu	2023-01-19 03:42:54 +00:00
Nikita Vedeneev	8383b5c488	Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. (#88078 ) As per title. Additionally we also introduce support for: - Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation). - Batch support with broadcasting for either of the arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078 Approved by: https://github.com/cpuhrsch	2023-01-19 03:14:54 +00:00
Horace He	4f4b62e4a2	some fixes to get symbolic shapes working through inductor (#92320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92320 Approved by: https://github.com/ezyang	2023-01-19 03:09:02 +00:00
Michael Lazos	cac217c80a	Fix key error formatting and move exc code to exc.py (#92593 ) Fixes https://github.com/pytorch/torchdynamo/issues/1953 and moves exception formatting code from convert_frame.py to exc.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/92593 Approved by: https://github.com/ezyang	2023-01-19 02:54:00 +00:00
Edward Z. Yang	ba6820574c	Make run_dynamic_ci_skips_only.sh more generic (#92581 ) Since the dynamic aot_eager CI skips list is very short now, I find that I need to run this script with other flags now. Make it more easy to change the flags. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92581 Approved by: https://github.com/bdhirsh	2023-01-19 02:24:13 +00:00
Jane Xu	2a7a859d00	[CI] move parallelnative to periodic (experimental) (#92567 ) This PR is more of an RFC asking whether we intend to maintain parallelnative in the long term or to allow it to become community-supported. If we want to maintain parallelnative, then let's close this PR. If we do not, then we should remove it from trunk workflows into periodic (or just remove entirely). Why shouldn't we just allow it to continue on CI regardless? It adds friction to development! If we do support it, I think the friction is good--it prevents users from breaking what we support! But if not, then it is just another job users have to wait for before landing or another vector for flakiness to arise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92567 Approved by: https://github.com/malfet	2023-01-19 01:46:48 +00:00
Michael Voznesensky	28cb3141e8	Remove temporary export skip hack (#92160 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92160 Approved by: https://github.com/SherlockNoMad, https://github.com/ezyang	2023-01-19 01:19:52 +00:00
Natalia Gimelshein	ef2586422c	fix promote_constants with ExpandView (#92403 ) Fixes #92324 OpInfo, even with all samples, doesn't have this input ;-) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92403 Approved by: https://github.com/desertfire, https://github.com/eellison	2023-01-19 01:02:14 +00:00
Wei-Sheng Chin	bdbd3ed312	When nopython=True, Dynamo can't allow graph breaks. (#90970 ) I count the number of sub-graphs (for tiny-GPT2 in huggingface) by ``` class GraphCaptureCompiler: def __init__(self): self.captured_graphs = [] def compile(self, gm, example_inputs): self.captured_graphs.append(gm) return gm compiler = GraphCaptureCompiler() torch._dynamo.optimize(compiler, nopython=True)(Wrapper(fn))(*args) ``` Although `len(compiler.captured_graphs)` is 2, no error was thrown during the compilation. This observation conflicts with `nopython=True`. After some digging, I found a check is missed before making graph break. This PR adds it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90970 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/thiagocrepaldi	2023-01-19 00:59:33 +00:00
Michael Voznesensky	eb39d990ce	Guard on at::Tensor device index (#91779 ) Fixes #91777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91779 Approved by: https://github.com/ngimel	2023-01-19 00:58:04 +00:00
Nikita Shulga	388d79ccda	[CI] valgrind 3.16.1->3.20.0 (#92552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92552 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-01-19 00:42:50 +00:00
soulitzer	bb7790781f	Make aot_autograd explicitly error when double backward (#92348 ) Mitigates https://github.com/pytorch/pytorch/issues/91469 Changes: - ~once_differentiable can now be parametrized to print a custom error message~ - instead of once_differentiable, we do the backward inside another custom Function, which makes sure the graph is connected, but also makes sure to error on double backward - we now explicitly error when doing double backward with torch.compile + aot_autograd instead of being silently incorrect. ~The niceness of the error message can vary depending on whether your grad_outputs are passed, or whether you are doing `.grad()` or `.backward()`.~ Unchanged: - doing backward inside compiled function is still allowed. It currently causes a graph break and is equivalent to doing backward outside the compiled function. It might be nice to disallow this explicitly as well, but that can be done in a follow up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92348 Approved by: https://github.com/albanD	2023-01-19 00:13:29 +00:00
fduwjj	62eeb7d60f	[PTD][Oncall] Sync Reorder structure for compatibility with linux-6.0 and gloo submodule for PT (#92568 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92568 Approved by: https://github.com/kumpera	2023-01-19 00:01:59 +00:00
Catherine Lee	34353a402e	[mergebot] Flatten workflows into jobs, fix bugs (#92097 ) * flatten the workflows into just jobs in order to give more specific links (link to the specific job that failed instead of just pull), this should make it easier to implement bypass certain failures in the future * try catch of MandatoryChecksMissingError from find_matching_merge_rule should fix error where merge loops instead of raising runtime error when trunk job fails * remove usage of on_green and mandatory_only flags just in case. on_green and force are the only two behaviors we currently use * fail if ghstack pr has non ghstack change, tested locally with #92177 but unsure how to write tests b/c requires use of repo._run_git Pull Request resolved: https://github.com/pytorch/pytorch/pull/92097 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-18 23:38:16 +00:00
lezcano	8b861544f9	Remove lowering and decompositions of zero_, zero, zeros_like... in favour of their references (#92071 ) The generated triton code is identical. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92071 Approved by: https://github.com/ngimel	2023-01-18 23:22:36 +00:00
Sherlock Huang	b5c3b4a36c	Fix dynamo.export(aten=True) for condition op (#92361 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92361 Approved by: https://github.com/voznesenskym	2023-01-18 23:17:22 +00:00
Jane Xu	c5cb46ecdb	[optim][asgd] group tensors in foreach to maximize perf (#92364 ) faster foreach Pull Request resolved: https://github.com/pytorch/pytorch/pull/92364 Approved by: https://github.com/albanD	2023-01-18 23:09:55 +00:00
Angela Yi	5fdddbbfe8	Fix checking of current mode in PyOperator dispatch (#92357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92357 Approved by: https://github.com/voznesenskym	2023-01-18 23:08:36 +00:00
Alex Settle	f8a07ca422	Reland 2nd attempt "Add heirachical module names to torchFX graph.node" (#91721 ) Fixes #87659 Reland of PR #87742 and PR #90205 PR #90205 was reverted due to BC issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/91721 Approved by: https://github.com/jerryzh168	2023-01-18 23:00:36 +00:00
Brian Hirsh	76cb2d0ede	fix incorrect _embedding_bag meta (#92549 ) Fixes https://github.com/pytorch/pytorch/issues/92286. See the issue for diagnosis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92549 Approved by: https://github.com/albanD, https://github.com/eellison	2023-01-18 22:50:31 +00:00
Richard Zou	5aa3740d63	Change references to pytorch/functorch to the torch.func APIs (#92543 ) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92543 Approved by: https://github.com/albanD	2023-01-18 22:50:17 +00:00
Jane Xu	fbafcecf8d	[optim][radam] group tensors in foreach to maximize perf (#92365 ) Also noticed that eps is not being used nor tested at all for the mta impl of RAdam. Will fix in a followup PR before turning foreach to default! Pull Request resolved: https://github.com/pytorch/pytorch/pull/92365 Approved by: https://github.com/albanD	2023-01-18 22:32:27 +00:00
Jane Xu	de459bdfaa	[optim][rmsprop] group tensors in foreach to maximize perf (#92369 ) Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/92369 Approved by: https://github.com/albanD	2023-01-18 22:28:52 +00:00
Jane Xu	07800c52af	[optim][adam] group tensors in foreach to maximize perf (#92349 ) same idea as https://github.com/pytorch/pytorch/pull/92338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92349 Approved by: https://github.com/albanD	2023-01-18 22:05:42 +00:00
Jane (Yuan) Xu	e2433e420c	[optim][adamax] group tensors in foreach to maximize perf (#92363 ) make foreach faster Pull Request resolved: https://github.com/pytorch/pytorch/pull/92363 Approved by: https://github.com/albanD	2023-01-18 21:32:28 +00:00
Chien-Chin Huang	92d412d684	[FSDP][optim_state_dict][11/N] Let FSDP support NamedOptimizer/KeyedOptimizer when use_orig_params is False (#92184 ) Current design of FSDP only support NamedOptimizer/KeyedOptimizer when use_orig_params is True this PR adds the support even if use_orig_params if False. This PR also adds the support for user-defined optimizer states -- states that are not associated with any particular parameters. Differential Revision: [D42497416](https://our.internmc.facebook.com/intern/diff/D42497416/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92184 Approved by: https://github.com/colin2328, https://github.com/rohan-varma	2023-01-18 21:24:30 +00:00
PyTorch MergeBot	befe3b68de	Revert "Clean up C++14 code (#92216 )" This reverts commit dfbdfb276eb5b0492b39036f1c49c196b826587f. Reverted https://github.com/pytorch/pytorch/pull/92216 on behalf of https://github.com/atalman due to fails internal build	2023-01-18 21:24:23 +00:00
Yinghai Lu	4450424b8e	Reduce some ambiguity in Tensor (#92266 ) Summary: A lot of other libraries have their own `xyz::Tensor` data structure. Under some rare cases, when they interop with torch, there will be compilation error such as ``` torch/csrc/api/include/torch/data/samplers/random.h(49): error: "Tensor" is ambiguous ``` Making some of the `Tensor` namespace clear will resolve this. Test Plan: CI Differential Revision: D42538675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92266 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-01-18 21:09:35 +00:00
Peter Bell	8770a7ed6f	Decompose more inplace ops (#90967 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90967 Approved by: https://github.com/anijain2305	2023-01-18 21:07:47 +00:00
Nikita Karetnikov	0d65a10a2d	[inductor] run CPU tests when CUDA is available (#92220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92220 Approved by: https://github.com/ezyang	2023-01-18 21:05:49 +00:00
Edward Z. Yang	dc1c0f78e2	Remove dead TORCHDYNAMO_DYNAMIC_SHAPES print (#92547 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92547 Approved by: https://github.com/albanD	2023-01-18 20:57:52 +00:00
Edward Z. Yang	3481ad3365	Make log parser work on inference runs too (#92546 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92546 Approved by: https://github.com/albanD	2023-01-18 20:57:52 +00:00
Edward Z. Yang	6420fecdc4	Introduce sym_min and sym_max (#92107 ) It turns out our old max/min implementation didn't do anything, because `__max__` and `__min__` are not actually magic methods in Python. So I give 'em the `sym_` treatment, similar to the other non-overrideable builtins. NB: I would like to use `sym_max` when computing contiguous strides but this appears to make `python test/functorch/test_aotdispatch.py -v -k test_aot_autograd_symbolic_exhaustive_nn_functional_max_pool2d_cpu_float32` run extremely slowly. Needs investigating. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92107 Approved by: https://github.com/albanD, https://github.com/voznesenskym, https://github.com/Skylion007	2023-01-18 20:57:27 +00:00
Huy Do	b26efd0dd2	Run bazel jobs on 4xlarge (#92340 ) After the previous fix to limit the CPU and memory used by Bazel, I see one case today where the runner runs out of memory in a "proper" way with exit code 137 `0c8f4b5893`. So, the memory usage must be close to limit of an 2xlarge instance. It makes sense to preemptively use 4xlarge now (like XLA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92340 Approved by: https://github.com/clee2000	2023-01-18 20:14:56 +00:00
Jane Xu	bb34461f00	[optim][rprop] group tensors in foreach to maximize perf (#92372 ) this one had a few more for loops than i was expecting Pull Request resolved: https://github.com/pytorch/pytorch/pull/92372 Approved by: https://github.com/albanD	2023-01-18 20:03:11 +00:00
Edward Z. Yang	b92a7afed9	Reclassify some dynamic aot_eager failures as static failures (#92376 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92376 Approved by: https://github.com/Chillee	2023-01-18 19:27:11 +00:00
Xuehai Pan	ae4ec7de1e	Fix and update type hints for `make_functional.py` (#91579 ) Changes in details: - Fix and update some out-of-date type hints in `_functorch/make_functional.py`. - ~Explicitly use `OrderedDict` for order-sensitive mappings.~ In `create_names_map()`, `_swap_state()`, and `FunctionalModuleWithBuffers.__init__()`, the unordered `dict` was used. The key order should be preserved for `dict.items()` while it is required to `zip` with a tuple of `params`/`buffers`. Although since Python 3.6, the built-in dictionary is insertion ordered ([PEP 468](https://peps.python.org/pep-0468)). Explicit is better than implicit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91579 Approved by: https://github.com/zou3519	2023-01-18 19:16:32 +00:00
Huy Do	bcd9f189f4	Remove setup-python on Windows CI and use Conda instead (#92183 ) This has been bugging me for a while what Windows CI stills has the `setup-python` step in its setup. The python setup here is not used by the build and the test steps at all, but are there to provide a python3 interpreter for `actions/get-workflow-job-id` and `actions/filter-test-configs`. ~~As these 2 actions are generic and should be smart enough to check for conda setup and use that instead of system python.~~ Having `setup-python` contributes a bit to network flakiness on Windows where it fails to download stuffs from GitHub. Example failures: * https://github.com/pytorch/pytorch/actions/runs/3913257969/jobs/6690485582 * https://github.com/pytorch/pytorch/actions/runs/3930859163/jobs/6722743854 * https://github.com/pytorch/pytorch/actions/runs/3918415654/jobs/6699239557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92183 Approved by: https://github.com/ZainRizvi	2023-01-18 18:07:40 +00:00
Nikita Shulga	65056845d3	Update clang-tidy to 15.0.6 (#92195 ) Based on results from https://github.com/pytorch/test-infra/pull/1382 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92195 Approved by: https://github.com/Skylion007	2023-01-18 17:00:13 +00:00
Nikita Shulga	74bc894ede	[BE] Delete unused args during docker build (#92396 ) Such as `TRAVIS_DL_URL_PREFIX`, `JENKINS_UID`/`JENKINS_GID` and `EC2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92396 Approved by: https://github.com/huydhn, https://github.com/janeyx99	2023-01-18 15:41:00 +00:00
PyTorch MergeBot	e525f433e1	Revert "Improve hooks ordering behavior (#85849 )" This reverts commit 049838f2496bd1d29e4e8292714acb0042cc706e. Reverted https://github.com/pytorch/pytorch/pull/85849 on behalf of https://github.com/albanD due to fails internal build	2023-01-18 15:27:22 +00:00
albanD	7f0d321d2e	Add missing gc untrack for cpp autograd Nodes (#92351 ) Fixes https://github.com/pytorch/pytorch/issues/91161 the assertion after the warning seems to be linked to the fact that we didn't untrack this properly. In 3.11 they added a warning when this is not called properly before tp_free Pull Request resolved: https://github.com/pytorch/pytorch/pull/92351 Approved by: https://github.com/ezyang	2023-01-18 15:23:48 +00:00
Jane Xu	0070c546b5	[BE][optim] abstract out docstrings, add differentiable docs (#92336 ) 1. abstract out common doc strings --> I'm sure there are more, but let this be a first step. 2. Add differentiable docs to those who are actually differentiable Pull Request resolved: https://github.com/pytorch/pytorch/pull/92336 Approved by: https://github.com/albanD	2023-01-18 15:09:28 +00:00
Shen Li	0035340488	Allow DDP to handle custom dataclass forward outputs (#92334 ) Differential Revision: [D42554973](https://our.internmc.facebook.com/intern/diff/D42554973) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92334 Approved by: https://github.com/zhaojuanmao	2023-01-18 14:51:37 +00:00
Richard Zou	5d01277fea	Deprecate torch.nn.utils.stateless.functional_call (#92280 ) This PR: - Updates the docs to say it is deprecated - Raises a UserWarning - Changes most of the callsites inside PyTorch to use torch.func.functional_call, minus the test_stateless testing. The motivation behind this is that we can now align behind a single functional_call API in PyTorch. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92280 Approved by: https://github.com/albanD	2023-01-18 14:26:25 +00:00
Richard Zou	a8a44a1aa2	Add deprecation messages for functorch.* function transforms (#92279 ) This PR: - adds deprecation warnings when calling the functorch APIs - adds documentation saying that those APIs are deprecated It does this by creating thin wrappers around the original APIs that (1) raise deprecation warnings and (2) have an additional line in their documentation that they are deprecated. NB: - Python surpresses DeprecationWarning, so we use UserWarning instead. Test Plan: - New tests - the functorch.* APIs are still tested for correctness because that's what test/functorch/* use (as opposed to directly calling the torch.func.* APIs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92279 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-01-18 14:26:25 +00:00
Richard Zou	21d2bd782b	stack_module_state should return unrelated parameters (#92278 ) `torch.func.stack_module_state` is our replacement for `functorch.combine_state_for_ensemble`. The most common usage for combine_state_for_ensemble is to - create stacked parameters and buffers - use vmap to run the forward pass - use regular PyTorch autograd to run the backward pass (e.g., Tensor.backwrd) - optimize directly over the stacked parameters (this is more performant than optimizing over the unstacked parameters). Right now, stack_module_state returns stacked parameters that cannot be optimized directly (only leaf tensors can have a .grad field); this PR fixes that by turning the stacked parameters back into leaf tensors. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92278 Approved by: https://github.com/soulitzer	2023-01-18 14:26:22 +00:00
Wu, Chunyuan	3aa6cec18c	[dynamo] exclude reset_rng_state when measure timing (#92237 ) Fixes inductor performance regression on CPU: https://github.com/pytorch/torchdynamo/issues/2027, https://github.com/pytorch/torchdynamo/issues/2028 and https://github.com/pytorch/torchdynamo/issues/2029. The details are explained here: https://github.com/pytorch/torchdynamo/issues/2028#issuecomment-1381496678. ### Performance - Model: lennard_jones - Machine: IceLake (32 cores per socket) - Configuration: single instance, 32 cores per instance - jemalloc and iomp enabled ```bash python benchmarks/dynamo/torchbench.py --inductor-settings --inductor --performance --float32 -dcpu -n5000 --no-skip --dashboard --only=lennard_jones --quiet ``` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Time before regression \| Time after regression \| Time with this PR -- \| -- \| -- 0.00020483799744397402 \| 0.0002818034990923479 \| 0.00020241099991835654 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92237 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-18 13:17:28 +00:00
Peter Bell	f0b592dae7	Make masked_fill reference traceable (#90972 ) As the comment states, `item()` cannot be used since you can't trace through a scalar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90972 Approved by: https://github.com/ngimel	2023-01-18 10:54:42 +00:00
fduwjj	368c737603	[PT-D][5/N] Enable add_param_group for named optimizer (#91928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91928 Approved by: https://github.com/rohan-varma	2023-01-18 10:53:31 +00:00
Xia, Weiwen	61a7618f3c	[Quant][Eager] Copy MHA's batch_first attribute in prepare() (#91680 ) Summary Fixes #91571 MHA's batch_first attribute is not copied after `torch.quantization.prepare()`. Now we copy MHA's batch_first attribute in torch/ao/nn/quantizable/modules/activation.py: `MultiheadAttention.from_float()`. Test plan python test/test_quantization.py -k test_mha_batch_first_attr_is_copied_in_prepare Pull Request resolved: https://github.com/pytorch/pytorch/pull/91680 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-01-18 10:49:05 +00:00
Peter Bell	206f4e47bb	Replace exp(x) - 1 with expm1(x) (#92154 ) This offers improved precision near zero where `exp(x)` is `1 + O(x)` and doing `(1 + O(x)) - 1` will truncate anything below the float epsilon to zero. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92154 Approved by: https://github.com/lezcano	2023-01-18 10:43:57 +00:00
Peter Bell	4058dedf21	Replace log(1 + x) with log1p(x) (#92114 ) `log1p` offers better precision near zero since `(1 + x) - 1` truncates any values less than the float epsilon to zero. For `soft_margin_loss` this also requires one fewer kernel invocation which for numel=1e7 gives me a 1.2x speedup on CUDA and a 1.1x speedup on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92114 Approved by: https://github.com/ngimel, https://github.com/lezcano	2023-01-18 10:43:56 +00:00
Xia, Weiwen	5a2ae8805c	[Quant] onednn backend switch to ideep new api without affacting performance (#91056 ) > Reopen of https://github.com/pytorch/pytorch/pull/90354 Summary Onednn quantization backend switch to new API in `third_party/ideep`. - `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly. - Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated. - Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`. - For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch. - Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc. It won't impact correctness. Performance should be unaffected or slightly better. FBGEMM and QNNPACK backends are not affected. Performance results are given below. 1. End-to-end performance of static quantized models (from torchvision) (throughput: fps, higher is better) ![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png) 2. Op benchmark of dynamic quantized linear (Latency: ms, lower is better) ![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png) Test method & env: - Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz - Run multi-instances on a single node. Use one core for each instance. - Use Jemalloc and Intel OpenMP Test plan python test/test_quantization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/91056 Approved by: https://github.com/jgong5	2023-01-18 09:53:34 +00:00
min-jean-cho	fb50a4b4ce	[Inductor] added aten.exponential_ decomp (#91673 ) Fixes #91276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91673 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2023-01-18 09:19:35 +00:00
Pearu Peterson	4a4520e74b	Retire unsafe sparse tensor constructors in Python API (#91331 ) This PR removes sparse tensor constructor functions `torch._sparse_coo/csr/csc/bsr/bsc/compressed_tensor_unsafe(...)` as unneeded. The equivalent functionality is provided via `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor(..., check_invariants=False)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91331 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-01-18 08:55:22 +00:00
cyy	dfbdfb276e	Clean up C++14 code (#92216 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92216 Approved by: https://github.com/ezyang	2023-01-18 08:14:54 +00:00
Wanchao Liang	c55f6973e4	[dtensor][3/N] move OpSchema and types to a separate file (#90732 ) This PR moves OpSchema and types to a separate file to resolve circular dependency better, this is part of refactor on dispatching logic to enable more complicated features Pull Request resolved: https://github.com/pytorch/pytorch/pull/90732 Approved by: https://github.com/XilunWu	2023-01-18 07:16:23 +00:00
Wanchao Liang	dc95ef25e5	[dtensor][2/N] add __repr__ to placements (#91785 ) This PR added __repr__ to all placement types Pull Request resolved: https://github.com/pytorch/pytorch/pull/91785 Approved by: https://github.com/XilunWu	2023-01-18 07:16:23 +00:00
Wanchao Liang	a1186d6af9	[dtensor][1/N] add __hash__ to device_mesh and dtensor_spec (#90731 ) This PR adds __hash__ to device_mesh and dtensor_spec to allow things like dict indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/90731 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-01-18 07:16:21 +00:00
Michael Lazos	bc9af74c99	Clear references to user tensors after compilation is finished (#92353 ) Fixes https://github.com/pytorch/torchdynamo/issues/2033 and https://github.com/pytorch/torchdynamo/issues/2005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92353 Approved by: https://github.com/eellison	2023-01-18 06:43:30 +00:00
kshitij12345	387ca598a1	[nn] full_backward{_pre}_hook: warning for Module returning dict, list, etc (#87547 ) Fixes https://github.com/pytorch/pytorch/issues/87540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87547 Approved by: https://github.com/albanD	2023-01-18 06:28:00 +00:00
Wei-Lin-Intel	3868eeb75f	fix biasadd OMP perf issue for the packed MKL SGEMM (#92300 ) Currently the biasadd of MKL SGEMM was executed using OpenMP macro, this will lead to a performance issue if the SGEMM size is very small (e.g., M = 1, K = 80, N = 256) when we are using many threads. The reason is that in such case `num_task < num_thread`, and the task cost is too small (e.g., ~1-2 cycles for memcpy), the thread synchronization cost would be very large. Thus it is better to use `at::parallel_for` to run on the main thread directly. Packed MKL SGEMM (1x80x256) \| OpenMP biasadd \| `at::parallel_for` biasadd -- \| -- \| -- Latency \| 2000 us \| 21 us Pull Request resolved: https://github.com/pytorch/pytorch/pull/92300 Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5	2023-01-18 06:14:11 +00:00
Avik Chaudhuri	bb11e072ae	Squash and merge linalg meta kernels (#92335 ) Squashed changes from https://github.com/pytorch/pytorch/pull/92021 and https://github.com/pytorch/pytorch/pull/92020 and https://github.com/pytorch/pytorch/pull/92019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92335 Approved by: https://github.com/avikchaudhuri	2023-01-18 05:55:52 +00:00
Andrew Gu	0d4bbd1996	[Lint] Add FSDP/composable API files to ufmt include (#90873 ) This PR adds FSDP and composable API files to `.lintrunner.toml` so that (1) lintrunner enforces that those files are formatted and (2) `lintrunner f` formats those files for you. There are two requirements here (see https://github.com/pytorch/pytorch/wiki/lintrunner for details): 1. Install lintrunner: ``` pip install lintrunner lintrunner init ``` 2. `lintrunner f` before you finalize your PR, which would now be enforced by CI after this PR. The code changes in this PR outside of `.lintrunner.toml` are the result of `lintrunner f`. --- I only plan to land this PR if all of the composable API developers agree that this is something that makes sense and is not too intrusive to the workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90873 Approved by: https://github.com/yhcharles, https://github.com/mrshenli, https://github.com/rohan-varma	2023-01-18 05:33:34 +00:00
Jason Ansel	9b173b87b2	Refactor away leftover import indirection (#92188 ) This indirect ways of importing are a leftover from when we wanted to support both `import torchdynamo` and `import torch._dynamo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92188 Approved by: https://github.com/desertfire	2023-01-18 04:53:05 +00:00
Edward Z. Yang	a414b7f367	Make clone-deps checkout correct Triton hash (#92345 ) Fixes https://github.com/pytorch/pytorch/issues/92326 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92345 Approved by: https://github.com/albanD	2023-01-18 04:46:51 +00:00
Horace He	6fa86d7402	Add @chillee to codeowners for functorch tests (#92337 ) ^ Pull Request resolved: https://github.com/pytorch/pytorch/pull/92337 Approved by: https://github.com/zou3519	2023-01-18 04:44:24 +00:00
yanbing-j	94a7c01159	Enable oneDNN implementation in LSTM op (#91158 ) ### Description This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported. ### Performance improvement In CPX 28C, with setting iomp and jemalloc. We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original. In single socket: ![image](https://user-images.githubusercontent.com/61222868/211182994-833debec-518a-4b35-8504-6b0fadb17930.png) ![image](https://user-images.githubusercontent.com/61222868/211183012-31e1253f-2c60-4c92-a656-c239a971b453.png) In single core: ![image](https://user-images.githubusercontent.com/61222868/211183017-186e5d47-cb9a-4c1e-914f-fa718e769f1c.png) ![image](https://user-images.githubusercontent.com/61222868/211183022-53266857-5a9e-4a95-b300-33fa34811d08.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91158 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-01-18 04:41:18 +00:00
Jane Xu	a41f00ed70	[optim][sgd] group tensors in foreach to maximize perf (#92338 ) Make foreach faster for SGD Pull Request resolved: https://github.com/pytorch/pytorch/pull/92338 Approved by: https://github.com/albanD	2023-01-18 04:02:41 +00:00
Richard Zou	98b78aa11c	[autograd.Function] setup_context always appears on the Function (#92312 ) Previously, we used the existence of setup_context to switch between if forward should take a ctx object or not. To be consistent with all other staticmethod (which always exist on the autograd.Function), this PR change it so that we use IF setup_context gets overriden by the user to switch between if forward should take a ctx object or not. Fixes https://github.com/pytorch/pytorch/issues/91451 Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92312 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-01-18 02:55:42 +00:00
Han Qi	00fe63d1d8	fx Graph should copy meta on deepcopy (#92062 ) Summary: fx Graph should copy meta on deepcopy Test Plan: Unit test Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92062 Approved by: https://github.com/zhxchen17	2023-01-18 02:49:14 +00:00
PyTorch MergeBot	60fe2f4420	Revert "Torch package support in dynamo (#91821 )" This reverts commit 3726d232191088e8e7a9c1a2ab3244cdd9250bf2. Reverted https://github.com/pytorch/pytorch/pull/91821 on behalf of https://github.com/huydhn due to The change causes flakiness on trunk. See https://github.com/pytorch/pytorch/issues/92196#issuecomment-1386368909 for more details	2023-01-18 02:17:25 +00:00
Jason Ansel	cf5a40c2b4	Only warn about fallbacks once per graph (#92211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92211 Approved by: https://github.com/eellison	2023-01-18 01:44:43 +00:00
Peter Bell	30f2026863	[inductor] Promote half-precision CPU constants to float (#91224 ) Currently `aten.where` can fail with the following C++ compiler error: ``` error: operands to '?:' have different types 'c10::Half' and 'float' ``` This happens because `ops.load` is overridden to cast Half inputs to float, but `ops.constant` will load a Half without promoting to float. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91224 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ngimel	2023-01-18 01:04:36 +00:00
Liao, Xuan	764f79f680	[Microbenchmark] microbench fix for triton template (#92282 ) Fixes microbench bug due to triton template https://github.com/pytorch/pytorch/pull/91575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92282 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-01-18 00:58:00 +00:00
soulitzer	88366a9075	Document hooks ordering behavior in the autograd note (#91667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91667 Approved by: https://github.com/albanD	2023-01-18 00:20:13 +00:00
soulitzer	388b245d54	Expose autograd.graph.Node as an abstract base class (#91475 ) This PR: - registers all of the codegened Nodes to the torch._C._functions module, this is where special nodes like AccumulateGrad are already registered. - creates a autograd.graph.Node abstract base class that all of the newly registered nodes subclass from. We make the subclassing happen by implementing the ``__subclasshook__`` method - enables static type checking to work and also enables Sphinx to generate documentation for the Node and its methods - handles both the custom Function and codegened cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/91475 Approved by: https://github.com/albanD	2023-01-18 00:20:13 +00:00
Jane Xu	0157e2ef4e	[optim][adamw] default to foreach when CUDA + differentiable=False (#92306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92306 Approved by: https://github.com/albanD	2023-01-18 00:13:50 +00:00
shubhambhokare1	fcde6dbbac	[onnx] Add mse_loss symbolic (#90717 ) Adds support for mse_loss operator Pull Request resolved: https://github.com/pytorch/pytorch/pull/90717 Approved by: https://github.com/BowenBao, https://github.com/titaiwangms, https://github.com/abock	2023-01-18 00:04:59 +00:00
Driss Guessous	40d6f2a020	Update sdp_utils to check gradmode and subclassed tensors (#92323 ) # Summary Fix up the grad check test to check for subclassed tensors and gradmode Pull Request resolved: https://github.com/pytorch/pytorch/pull/92323 Approved by: https://github.com/soulitzer	2023-01-17 23:14:21 +00:00
Alyssa Wang	68f8042064	Bypass filament2 for new pytorch random distribution method (#92190 ) Summary: After D41587318 introduced new pytorch randomization, filament2 training failed due to chunk size is 0. We gated the new change to external only to fix filament2 package Test Plan: f402461641 the flow has training successfully finished Reviewed By: izaitsevfb Differential Revision: D42501726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92190 Approved by: https://github.com/izaitsevfb	2023-01-17 22:36:24 +00:00
Huy Do	b31905c727	Fix Windows cpu_profiling_allocator_test same pointer check flakiness (#92264 ) This is a small follow-up from https://github.com/pytorch/pytorch/pull/91727 to fix the flaky same pointer check on Windows https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20CPUAllocationPlanTest.with_profiling_alloc. AFAICT, keeping the same memory pointer is not a guarantee in non-mobile memory allocator (or may be this is Windows-specific behavior). The test might be flaky when the tensor is copied to a different memory location with the default allocator. This's ok as long as the values remain equal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92264 Approved by: https://github.com/ZainRizvi	2023-01-17 22:22:35 +00:00
Richard Zou	16f9d1bb83	[torch.func] Add migration guide from functorch (#91811 ) Test Plan: - view preview Future: - still need to figure out the make_fx situation Pull Request resolved: https://github.com/pytorch/pytorch/pull/91811 Approved by: https://github.com/albanD	2023-01-17 22:14:42 +00:00
PyTorch MergeBot	89f1ad08b4	Revert "Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. (#88078 )" This reverts commit 7f256fff77c49729131aa6d092e60e891d0c4948. Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/huydhn due to This breaks lint `7f256fff77`	2023-01-17 22:14:37 +00:00
Nikita Vedeneev	7f256fff77	Improve `bsr @ strided` performance in `baddmm` for `bfloat16/half` with Triton kernels. (#88078 ) As per title. Additionally we also introduce support for: - Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation). - Batch support with broadcasting for either of the arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078 Approved by: https://github.com/cpuhrsch	2023-01-17 21:43:20 +00:00
PyTorch MergeBot	befe815466	Revert "Add sym_size/stride/numel/storage_offset to native_function.yaml (#91919 )" This reverts commit 0388400f3f8a8ecae2f809ba40ca3ddd5a8b9028. Reverted https://github.com/pytorch/pytorch/pull/91919 on behalf of https://github.com/atalman due to Break internal build	2023-01-17 21:03:18 +00:00
PyTorch MergeBot	88942a3199	Revert "[FSDP] Do not clean FQNs even for `use_orig_params=True` (#91767 )" This reverts commit d6f3265e1add26abedb504910be93b393b9fb33c. Reverted https://github.com/pytorch/pytorch/pull/91767 on behalf of https://github.com/malfet due to Looks like it broke `test_compatible_with_named_optimizer` distribued tests, see `d6f3265e1a`	2023-01-17 20:04:52 +00:00
Sait Cakmak	0c8f4b5893	Update Module.__setattr__ to respect property setters (#92044 ) Fixes #52664. Checks if the attribute is a property that defines a setter and uses fset in __setattr__ rather than registering an inaccessible module / parameter. This is BC-breaking as the attribute setters on nn.Module properties used to be ignored and now will be called properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92044 Approved by: https://github.com/albanD	2023-01-17 20:00:06 +00:00
Jane Xu	4fc796daf9	[optim] abstract out _default_to_foreach_util (#92305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92305 Approved by: https://github.com/albanD	2023-01-17 19:42:20 +00:00
PyTorch MergeBot	5c9c39a83f	Revert "[fx] rewrite `FloorDiv` to match Python better (#90906 )" This reverts commit d13207c7adf7f94620b1228dab547ff253c46d0b. Reverted https://github.com/pytorch/pytorch/pull/90906 on behalf of https://github.com/malfet due to eca_halonext26ts started failing after 2nd PR from the stack was landed, see `88b3810c94`, not sure which one of the two caused it	2023-01-17 19:26:38 +00:00
PyTorch MergeBot	013afc5abe	Revert "[fx] fix type promotion in `binary_magic_impl` (#91376 )" This reverts commit 88b3810c94b45f5982df616e2bc4c471d173f491. Reverted https://github.com/pytorch/pytorch/pull/91376 on behalf of https://github.com/malfet due to eca_halonext26ts started failing after this was landed, see `88b3810c94`	2023-01-17 19:04:04 +00:00
Colin Taylor	933cc67e7e	[pytorch] [compososable] make contract() pickle-able through functools wraps (#92120 ) Summary: make contract() pickle-able through functools wraps. This is to get functions wrapped with contract() to work with torch package Differential Revision: D42491056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92120 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/rohan-varma, https://github.com/mrshenli	2023-01-17 18:14:05 +00:00
Huy Do	ea1007b89c	Run dynamo/test_dynamic_shapes serially (#92215 ) Per my findings in https://github.com/pytorch/pytorch/issues/92196#issuecomment-1383029544 > The test itself dynamo/test_dynamic_shapes is not flaky and all passes when I try to run it locally. However, this test is set to run in parallel with other tests on the runner (2 tests at a times). After many tries, I can only reproduce the issue once when dynamo/test_dynamic_shapes is run in parallel with test_comparison_utils After many retries, I could reproduce the issue once locally when running (https://paste.sh/_mFImq6V#FgbKq6IQBg65PKUFA08Ah_Vb) ``` python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests -i test_comparison_utils dynamo/test_dynamic_shapes ``` So setting this test to run serially to avoid further flakiness while the root cause is investigated. Here are some example flaky failures: * https://github.com/pytorch/pytorch/issues/92196 * https://github.com/pytorch/pytorch/issues/92178 * https://github.com/pytorch/pytorch/issues/92042 * https://github.com/pytorch/pytorch/issues/92210 The test takes 30s or so to finish, so its duration is not a concern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92215 Approved by: https://github.com/clee2000	2023-01-17 17:54:39 +00:00
Jiong Gong	2eaa7a25d0	Fix model accuracy issue caused by vectorized transpose (#92299 ) Fix accuracy issues from models: jx_nest_base, cait_m36_384, XLNetLMHeadModel, Super_SloMo https://github.com/pytorch/torchdynamo/issues/2038 https://github.com/pytorch/torchdynamo/issues/2037 https://github.com/pytorch/torchdynamo/issues/2036 https://github.com/pytorch/torchdynamo/issues/2035 The inner loop list should be newly created in loop.clone(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92299 Approved by: https://github.com/desertfire	2023-01-17 17:53:45 +00:00
min-jean-cho	d29f0ba74d	Fix philox randn to follow standard normal distribution (#91945 ) Fixes #91944 Related #91207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91945 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-01-17 17:48:25 +00:00
Andrew Gu	d6f3265e1a	[FSDP] Do not clean FQNs even for `use_orig_params=True` (#91767 ) Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91767 Approved by: https://github.com/zhaojuanmao	2023-01-17 17:41:28 +00:00
Chien-Chin Huang	1439cb0314	[FSDP][optim_state_dict][9/N] Rewrite the all-gather flow of optimizer state to support older GPUs (#91343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91343 Approved by: https://github.com/rohan-varma	2023-01-17 17:21:19 +00:00
lezcano	46a81c8db7	Deprecate .mT,.T,.mH,.H on 0D tensors (#92143 ) As discussed with @ngimel, this is not only not documented, but also an unnecessary edge case. See https://github.com/pytorch/pytorch/pull/90463#discussion_r1064807197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92143 Approved by: https://github.com/ngimel	2023-01-17 16:54:35 +00:00
lezcano	66e498626c	Perform first the decomposition and then the ATen function to catch in-place modifications (#92243 ) Addresses https://github.com/pytorch/pytorch/pull/91672#discussion_r1070412867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92243 Approved by: https://github.com/ezyang	2023-01-17 16:53:36 +00:00
lezcano	77b8aa6e43	Wrap a few more functions to ease their tracking during debugging (#92004 ) Yup Pull Request resolved: https://github.com/pytorch/pytorch/pull/92004 Approved by: https://github.com/ezyang	2023-01-17 16:53:36 +00:00
lezcano	ea8b14f27e	Add a test for decompositions that decomposes all the operations as much as possible (#87182 ) This will enable a more thorough testing of the decompositions than the one just provided by OpInfos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87182 Approved by: https://github.com/ezyang	2023-01-17 16:53:34 +00:00
lezcano	d162c8f92b	Assorted decomposition fixes (#87183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87183 Approved by: https://github.com/ngimel	2023-01-17 16:53:31 +00:00
lezcano	da58f9eb8f	Rewrite out-of-place decompositions in terms of out-of-place ops (#92003 ) Fixes https://github.com/pytorch/torchdynamo/issues/1863 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92003 Approved by: https://github.com/ngimel	2023-01-17 16:53:27 +00:00
Edward Z. Yang	1d47c59384	Check in some utility scripts for running dynamic shapes sweeps (#92256 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92256 Approved by: https://github.com/albanD	2023-01-17 16:37:13 +00:00
soulitzer	049838f249	Improve hooks ordering behavior (#85849 ) Addresses: https://github.com/pytorch/pytorch/issues/35802 Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit# ### Changes in this PR #### Implementation - We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed. - Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated. #### Hooks ordering and execution: - Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn - Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad() - Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook #### Retains grad hooks - retains grad hooks always execute last, even if there are other tensor pre-hooks registered #### Unchanged: - pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad() Follow ups: - simplify retains_grad field to not be a vector, since it always holds a single hook - potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since - python hooks registered to tensor behavior on in-place is still wrong Pull Request resolved: https://github.com/pytorch/pytorch/pull/85849 Approved by: https://github.com/albanD	2023-01-17 16:23:21 +00:00
Peter Bell	fb1427ea8f	squeeze: allow squeezing multiple dimensions at once (#89017 ) Ref #70924 This addresses part 1 of the issue, allowing `torch.squeeze` to be passed a tuple of dimensions. e.g. ```python x.squeeze(0).squeeze(0) ``` can now be written ```python x.squeeze((0, 1)) ``` (assuming x has at least 2 dimensions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89017 Approved by: https://github.com/albanD	2023-01-17 14:20:15 +00:00
Richard Zou	fbf9e379e1	[autograd.Function] update error messages for vmap to point to docs (#92030 ) We need to separately update it when 2.0 comes along and the master docs become stable docs so that users aren't looking at master docs all the time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92030 Approved by: https://github.com/soulitzer	2023-01-17 13:36:42 +00:00
Richard Zou	81cc9bba5e	[autograd.Function] Kill the extension feature flag (#92026 ) This PR removes the autograd.Function extension feature flag. This was previously used for development of the functorch <> autograd.Function interaction. It's been in master for long enough with the feature flag defaulting to True, so it's time to remove it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92026 Approved by: https://github.com/soulitzer	2023-01-17 13:36:42 +00:00
Richard Zou	7aaad0b832	Rename flag that enables/disables _SingleLevelFunction for functorch (#92025 ) functorch used to have a switch that enables/disables autograd.Function. That switch now enables/disables torch.autograd.function._SingleLevelFunction, so I've renamed it accordingly. We could just delete the switch because users should not be directly working with torch.autograd.function._SingleLevelFunction. However, it was useful for debugging when something went wrong when I was implementing the autograd.Function <> functorch interaction, so I want to keep it around as a debugging tool for a while since the code is already there. Test Plan: - updated tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92025 Approved by: https://github.com/soulitzer	2023-01-17 13:36:41 +00:00
Richard Zou	14ff58d4fa	[generate_vmap_rule] Delete unused output_shapes (#92024 ) We don't actually need `output_shapes` to implement `generate_vmap_rule=True` support for autograd.Function. - We need this in the vjp (backward) case because autograd automatically reduces grad_inputs to inputs and we need to replicate that behavior. In order to replicate that behavior, we recorded the original input shapes so we know how to reduce the grad_input. - There is no such behavior for forward-mode AD, so we don't need to pass an `output_shapes` to reductify. This PR simplifies the API of `reductify` and `reductify_leaf`. Instead of accepting `input_shape_without_bdim` and `allow_expanded_grad`, we now combine these into a single argument, `reduce_to_input_shape_without_bdim`. - if it is None, then we don't do anything - if it is not-None and a shape, then we will reduce the grad to the provided shape. Test Plan: - updated original unittests - wait for test suite Pull Request resolved: https://github.com/pytorch/pytorch/pull/92024 Approved by: https://github.com/soulitzer	2023-01-17 13:36:39 +00:00
Richard Zou	f5af97ef06	[autograd.Function] add nice error message for incorrect usage of vmap (#92023 ) This PR: - adds a nice error message if the user doesn't follow the API of the vmap staticmethod correctly. That is, the user must return two arguments from the vmap staticmethod API: (outputs, out_dims), and out_dims must be a PyTree with either the same structure as `outputs` our be broadcastable to the same structure as `outputs`. - Fixes an edge case for out_dims=None. out_dims is allowed to be None, but wrap_outputs_maintaining_identity was treating "None" as "This is not the vmap case" Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92023 Approved by: https://github.com/soulitzer	2023-01-17 13:36:37 +00:00
Richard Zou	2f9166ef89	[autograd.Function] Cleanup asymmetry in generate_vmap_rule and vmap (#91787 ) This PR: - changes generate_vmap_rule to either be True or False. Previously it could be True, False, or not set. This simplifies the implementation a bit. - changes the vmap staticmethod to always be on the autograd.Function rather than sometimes defined. This is how the other staticmethod (forward, backward, jvp) are implemented and allows us to document it. There are 4 possible states for the autograd.Function w.r.t. to the above: - generate_vmap_rule is True, vmap staticmethod overriden. This raises an error when used with vmap. - generate_vmap_rule is False, vmap staticmethod overriden. This is valid. - generate_vmap_rule is True, vmap staticmethod not overriden. This is valid. - generate_vmap_rule is False, vmap staticmethod not overriden. This raises an error when used with vmap. Future: - setup_context needs the same treatment, but that's a bit tricker to implement. Test Plan: - new unittest - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/91787 Approved by: https://github.com/soulitzer	2023-01-17 13:36:34 +00:00
Nikita Karetnikov	88b3810c94	[fx] fix type promotion in `binary_magic_impl` (#91376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91376 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-01-17 10:04:38 +00:00
Nikita Karetnikov	d13207c7ad	[fx] rewrite `FloorDiv` to match Python better (#90906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90906 Approved by: https://github.com/ezyang	2023-01-17 10:04:38 +00:00
Huy Do	5e0d3458eb	Move XLA test job to 4xlarge (#92269 ) Per the discussion with @clee2000 , I'm trying to look into XLA flaky failures. It's tricky because the runner crashes losing all the logs. The only guess I have comes from the test insight information of XLA test job, i.e. https://hud.pytorch.org/test/insights?jobName=linux-bionic-py3_7-clang8-xla%20%2F%20test%20(xla%2C%201%2C%201%2C%20linux.2xlarge)&workflowId=3919472559&jobId=10650151864 * Memory looks fine. It peaks at ~14GB when building, then dropping when testing * CPU spikes at 100% at the end, which I suspect to be the reason causing the runner to crash So the fix is to try to limit the test to nCPU - 1, so there is always one core left for the runner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92269 Approved by: https://github.com/malfet	2023-01-17 06:43:21 +00:00
Edward Z. Yang	ded2b47bde	Fix AOTAutograd 2.0 perf regression involving as_strided (#92255 ) I feel there may be a deeper fix where we avoid as_strided entirely, but in the regressed model the sizes/strides all lined up exactly, so this seems to work to fix the immediate regression. Repro command: `python benchmarks/dynamo/torchbench.py --performance --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert ` Before: 1.138x p=0.00 After: 1.162x p=0.00 Natalia pinpointed it to this line by comparing GPU traces and finding that the regressed PyTorch had two extra fill kernels and a memcpy: Without regression: ![image](https://user-images.githubusercontent.com/13564/212726521-450e183d-7b36-4538-ad14-617e09c689a8.png) With regression: ![image](https://user-images.githubusercontent.com/13564/212726469-4f3ff4b5-3f68-48cf-94d2-ddebb9216176.png) ...which CPU profiler blamed on `AsStridedBackward`: ![image](https://user-images.githubusercontent.com/13564/212726953-16333bfc-8460-4445-90ad-7fe73c4173c2.png) ...which were then pinpointed to https://github.com/pytorch/pytorch/pull/92076/files#diff-df954bbf954d2dcb81f687876053267ffa4ddb36ed86b7d2bd76319ff2b94416R486-R489 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92255 Approved by: https://github.com/ngimel, https://github.com/bdhirsh	2023-01-17 06:07:37 +00:00
cyy	9b716a0682	Clean up more clang-tidy supression (#92203 ) 1. remove unused NOLINTNEXTLINE(performance-move-const-arg) 2. add more std::move Pull Request resolved: https://github.com/pytorch/pytorch/pull/92203 Approved by: https://github.com/Skylion007	2023-01-17 05:43:08 +00:00
Jason Ansel	bbce4184be	Refactor inductor to use standard BACKENDS dict (#92187 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92187 Approved by: https://github.com/desertfire	2023-01-17 04:05:43 +00:00
Sherlock Huang	0388400f3f	Add sym_size/stride/numel/storage_offset to native_function.yaml (#91919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang	2023-01-17 03:39:57 +00:00
Wanchao Liang	801d831d7a	[dtensor] enable op db tests by using multithreaded test case (#92198 ) Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing! using MultiThreadTestCase on a AWS dev node: ``` time pytest test/distributed/_tensor/test_dtensor_ops.py ============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) ======= real 1m22.330s user 1m38.782s sys 0m18.762s ``` MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198 Approved by: https://github.com/XilunWu	2023-01-17 03:26:38 +00:00
Wanchao Liang	2ce63ef26c	[dtensor] switch pointwise op tests to use DTensorOpsTestBase (#92197 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92197 Approved by: https://github.com/XilunWu	2023-01-17 03:26:38 +00:00
Wanchao Liang	e16979c9a0	[threaded_pg] full rewrite of MultiThreadedTestCase to enable device_type tests (#91650 ) This PR did a full rewrite of MultiThreadedTestCase, to make it more aligned with the MultiProcessTestCase, also changed how it do spawning and testing, so that we could embed thread local states when running tests. This PR enables device_type tests to work with MultiThreadedTestCase Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650 Approved by: https://github.com/XilunWu	2023-01-17 03:26:36 +00:00
Wanchao Liang	9942ddd5b3	[threaded_pg] enable subpg creation and concurrent collective (#91649 ) This PR refactors the threaded PG logic to enable multiple sub pg creation under the world threaded pg, and allow the case where we can call collectives together on different subpgs Pull Request resolved: https://github.com/pytorch/pytorch/pull/91649 Approved by: https://github.com/XilunWu	2023-01-17 03:26:34 +00:00
yanbing-j	85edb58179	Fix oneDNN double checkout issue and Upgrade oneDNN to v2.7.3 (#92239 ) ### Descriotion This PR is to fix oneDNN double checkout issue that mentioned in https://github.com/pytorch/pytorch/pull/87061#issuecomment-1284384276, and upgrade oneDNN to v2.7.3 to fix #92138. ### Performance test Use TorchBench test in ICX with 40 cores Intel OpenMP & jemalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/212634378-b91c20b5-0e85-474f-861c-c1d2f6962de1.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92239 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-01-17 01:54:21 +00:00
Nikita Shulga	d62eff56bd	Fix typos introduced by 014ac7fda2d1e59796b1147221fb92f4377ca2f1 Also rename `Facebook CLA` to `EasyCLA` Test Plan: `python3 test_trymerge.py` passes	2023-01-16 17:38:45 -08:00
Jithun Nair	014ac7fda2	Add ROCm merge rules (#85762 ) Adds jeffdaily as approver needed to merge any changes to ROCm or HIP-related files in PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/85762 Approved by: https://github.com/malfet	2023-01-17 00:45:12 +00:00
Richard Barnes	eadbf762fc	Fix CUDA error not getting captured by handler (#92227 ) Fixes #91758. Still leaves functions on the hotpath. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92227 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-01-17 00:16:29 +00:00
Nikita Shulga	32937f39f4	Don't raise error if job_id can't be fetched (#92192 ) But always return `workflowi_d`, which is not unique across reruns but it's better than failing the entire run just because API call failed. Test it locally by feeding the program an incorrect input and observe the failure. Fixes https://github.com/pytorch/pytorch/issues/91332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92192 Approved by: https://github.com/kit1980	2023-01-17 00:09:05 +00:00
Jeff Daily	301644d3cb	[ROCm] disable NVFuser (#92182 ) In preparation for #89621. Partial reverts of #82498 and #86369. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92182 Approved by: https://github.com/davidberard98	2023-01-16 18:35:12 +00:00
John Crousse	0b90ddacd9	Unit test for is_causal Better Transformers (#91900 ) (#92102 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91900 Test Plan: buck test :test_transformers -- -r test_train_with_is_causal buck test mode/opt :test_transformers -- -r test_is_causal_gpu flake8 test_transformers.py Differential Revision: D42453642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92102 Approved by: https://github.com/drisspg	2023-01-16 17:25:06 +00:00
Aleksandar Samardžić	b05f509601	Add missing conversion for to_sparse.sparse_dim (#92006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92006 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-01-16 15:34:10 +00:00
PyTorch MergeBot	523d4f2562	Revert "[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527 )" This reverts commit 4d07ad74f1c11efa55501433d6cf1f06840f5207. Reverted https://github.com/pytorch/pytorch/pull/91527 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-16 13:28:09 +00:00
PyTorch MergeBot	1a98c3e36c	Revert "Add kwargs support to torch.export() API (#92013 )" This reverts commit 890b68281a3eb3e5c5762d5f51bacd91fdfa89d8. Reverted https://github.com/pytorch/pytorch/pull/92013 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-16 13:03:48 +00:00
Peter Bell	76c88364ed	[inductor] Respect dtype argument in ops.constant (#92093 ) Consider the following example: ```python def fn(x): y = torch.full_like(x, 1.2, dtype=torch.int64) return x + y ``` In eager this truncates 1.2 to 1, then adds it to `x`. However, in inductor the literal "1.2" is used verbatim and the result is off by 0.2. This fixes the issue by respecting the dtype argument to `ops.constant` and truncating accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92093 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-01-16 12:53:47 +00:00
Hangchen Yu	5a0fa04a49	Add MTIA DeviceType for Meta training and inference devices (#92232 ) Summary: This adds a new MTIA DeviceType which is associated with the MTIA DispatchKey and will be used for the Meta in-house training and inference accelerators. Test Plan: All CI should pass. Differential Revision: D42526044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92232 Approved by: https://github.com/ezyang	2023-01-16 12:20:23 +00:00
Lei Mao	9cf8434776	[ONNX] Raise Unsupported for Grid Sample with volumetric 5D input (#92212 ) Fixes #92209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92212 Approved by: https://github.com/BowenBao	2023-01-16 03:34:05 +00:00
Andrew Gu	85e0fd0280	[FSDP][BE] Improve `device_id` + CPU offload test (#92031 ) Closes https://github.com/pytorch/pytorch/issues/83054. The new version of the test ensures that the parent FSDP instance has managed parameters to trigger the `module.to(device_from_device_id)` call, which moves the child FSDP instance's managed parameters (and hence must be hackily moved back). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92031 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma	2023-01-16 02:38:10 +00:00
Andrew Gu	5a3b4dacad	[FSDP][BE] Rename `prefixed_param_names` -> `fqns` for consolidation (#92028 ) Closes https://github.com/pytorch/pytorch/issues/90961. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92028 Approved by: https://github.com/zhaojuanmao	2023-01-16 02:38:10 +00:00
Andrew Gu	b0888cce0f	[FSDP][BE] Better error msg for incorrect device for training (#92027 ) Closes https://github.com/pytorch/pytorch/issues/90541. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92027 Approved by: https://github.com/zhaojuanmao	2023-01-16 02:38:07 +00:00
Xilun Wu	b5d8fef9a5	[DTensor] remove redundant device mesh test code (#92069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92069 Approved by: https://github.com/wanchaol	2023-01-16 01:17:45 +00:00
Xilun Wu	513c1e71e2	[DTensor] check DeviceMesh ranks contiguity (#91802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91802 Approved by: https://github.com/wanchaol	2023-01-16 01:17:45 +00:00
Nikita Shulga	2293a6b95e	[BE] Refactor get_workflow_job_id (#92191 ) A noop change that refactors existing codebase and prints a bit more verbose error message when request fails. Get rid of `requests` as it inevitable results in flakiness TODO: Remove in a few days after PR is landed `4af5939d7a/.github/actions/get-workflow-job-id/action.yml (L29)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92191 Approved by: https://github.com/kit1980	2023-01-15 23:02:29 +00:00
Edward Z. Yang	1da0ac2c93	Enable -Werror=bool-operation (#92221 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92221 Approved by: https://github.com/Skylion007	2023-01-15 20:49:53 +00:00
ydwu4	bc4c324807	Remove variable_excluded_from_dispatch() assertion from mkldnncommon (#92168 ) When tracing a model using dynamo, theses assertions fail. Following https://github.com/pytorch/pytorch/pull/29653 and https://github.com/pytorch/pytorch/pull/46371, we think it might be OK to remove these two assertions as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92168 Approved by: https://github.com/ezyang	2023-01-15 01:40:10 +00:00
Jane Xu	d41b5d7c14	[adam] Add not torch.jit.is_scripting() as a requirement for switching to fused (#92181 ) A "fix" following https://github.com/pytorch/pytorch/pull/90865. Realized that fused is not compatible with torch.jit.is_scripting() when looking at a later line. Took the opportunity to make the code cleaner/slightly more performant (with the extends) as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92181 Approved by: https://github.com/albanD	2023-01-14 19:05:27 +00:00
Salil Desai	da43584bef	[Reland] Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#92081 ) Summary: X-link: https://github.com/facebookresearch/d2go/pull/459 Reland of D41690203 (`370df963e0`) Remove MobileOptimizerType and all rewrite flags from torch.X and torch._C.X to clean up torch.X and torch._C.X namespaces The affected rewrite flags are - CONV_BN_FUSION - FUSE_ADD_RELU - HOIST_CONV_PACKED_PARAMS - INSERT_FOLD_PREPACK_OPS - REMOVE_DROPOUT - VULKAN_AUTOMATIC_GPU_TRANSFER Bc-Breaking Change: Before this change, the rewrite flags were accessible through all of 1. torch.utils.mobile_optimizer.MobileOptimizerType.X 2. torch._C.MobileOptimizerType.X 3. torch.X 4. torch.MobileOptimizerType.X 5. torch._C.X But after this change, only torch.utils.mobile_optimizer.MobileOptimizerType.X (option 1 above) and the newly added torch._C._MobileOptimizerType.X remain Corresponding updates to PyTorch Tutorial Docs are in https://github.com/pytorch/tutorials/pull/2163 Test Plan: ```buck test caffe2/test:test_mobile_optimizer``` ``` Summary Pass: 6 Skip: 1 ↻ caffe2/test:test_mobile_optimizer - test_mobilenet_optimize_for_mobile (test_mobile_optimizer.TestOptimizer) ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124793514412 ``` ___ ```buck test caffe2/torch/fb/mobile/tests:model_exporter_tests``` Tests pass ___ With temporary testing changes in D41690204: ```buck run caffe2:test_rewrite_flags_api``` Before: ``` torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ✅ torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ❌ (module 'torch._C' has no attribute '_MobileOptimizerType') torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ torch.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ ``` After: ``` torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ✅ torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ✅ torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch._C' has no attribute 'MobileOptimizerType') torch.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER') torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch' has no attribute 'MobileOptimizerType') torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch._C' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER') ``` ```buck test caffe2/test:public_bindings -- test_no_new_bindings``` ``` Summary Pass: 1 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7881299473114294 ``` Reviewed By: SS-JIA Differential Revision: D42442395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92081 Approved by: https://github.com/albanD	2023-01-14 17:06:00 +00:00
Bin Bao	55f0ed6dcd	[inductor] Fix an issue causing "Could not generate fp64 outputs" (#92036 ) Summary: Fix a fp64 version of model failed-to-run issue when convert_element_type appears in the model. The failure can cause some numerical difference recognized as accuracy error since the fp64 baseline result is not available, and thus distracts Minifier from finding a real culprit for accuracy error. See the discussion in https://github.com/pytorch/torchdynamo/issues/1812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92036 Approved by: https://github.com/ngimel	2023-01-14 17:03:27 +00:00
lezcano	353e9f883f	Add name attribute to ValueRangeAnalysis (#92121 ) This is expected when used within InterpreterShim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92121 Approved by: https://github.com/eellison	2023-01-14 12:07:52 +00:00
cyy	a0626c356d	Cleanup std::move (#91987 ) fix use after move and remove unnecessary lint suppression Pull Request resolved: https://github.com/pytorch/pytorch/pull/91987 Approved by: https://github.com/Skylion007	2023-01-14 08:17:03 +00:00
PyTorch MergeBot	1490dc6421	Revert "[BE] meow (#92174 )" This reverts commit 3debb97084484c3ebbba65e5fcbc2a60b77f0b47. Reverted https://github.com/pytorch/pytorch/pull/92174 on behalf of https://github.com/ezyang due to oh yeah i think the print is intentional graph break	2023-01-14 07:32:39 +00:00
Jiewen Tan	dfabb91614	[LTC] Use DataCache in GetIrValueForScalarFromCodegen (#92066 ) Summary: XLA expects GetIrValueForScalarFromCodegen to use DataCache such that not every scalar will request a data transfer to the backend device. This needs pytorch/xla#4447 to verify. Test Plan: PJRT_DEVICE=CPU python xla/test/test_operations.py -v -k test_cached_addcdiv Fixes pytorch/xla#4213. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92066 Approved by: https://github.com/JackCaoG	2023-01-14 05:38:06 +00:00
Jane (Yuan) Xu	3debb97084	[BE] meow (#92174 ) :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/92174 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-01-14 05:36:47 +00:00
milesial	421f40e051	Use binary units for CUDA memory summary (#91854 ) To reduce confusion, use for example `KiB` instead of `KB` since we're talking powers of 2 and not 10. https://en.wikipedia.org/wiki/Byte#Multiple-byte_units ``` import torch x = torch.zeros(1024 * 1024, dtype=torch.uint8, device='cuda') print(torch.cuda.memory_summary()) ``` ``` \|===========================================================================\| \| PyTorch CUDA memory summary, device ID 0 \| \|---------------------------------------------------------------------------\| \| CUDA OOMs: 0 \| cudaMalloc retries: 0 \| \|===========================================================================\| \| Metric \| Cur Usage \| Peak Usage \| Tot Alloc \| Tot Freed \| \|---------------------------------------------------------------------------\| \| Allocated memory \| 1024 KiB \| 1024 KiB \| 1024 KiB \| 0 B \| \| from large pool \| 0 KiB \| 0 KiB \| 0 KiB \| 0 B \| \| from small pool \| 1024 KiB \| 1024 KiB \| 1024 KiB \| 0 B \| \|---------------------------------------------------------------------------\| \| Active memory \| 1024 KiB \| 1024 KiB \| 1024 KiB \| 0 B \| \| from large pool \| 0 KiB \| 0 KiB \| 0 KiB \| 0 B \| \| from small pool \| 1024 KiB \| 1024 KiB \| 1024 KiB \| 0 B \| \|---------------------------------------------------------------------------\| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91854 Approved by: https://github.com/ngimel	2023-01-14 05:10:51 +00:00
Aaron Gokaslan	b8057aa16d	Remove unnecessary copies of Scalars for TensorBody template (#92162 ) Inspired by #92156 , I realized our generated TensorBody.h has many methods that do an unnecessary copies. Scalar is backed by a ptr and is therefore not trivially copyable and care should be assigned over ownership of the params. Since it's a template, clang-tidy was never run on it in a way that was able to propogate the changes back to the source code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92162 Approved by: https://github.com/ezyang	2023-01-14 03:38:03 +00:00
Kurt Mohler	3a0053abd6	Move `PyObject` code out of `TensorImpl` into new `PyObjectSlot` class (#92169 ) Redo of PR #92099 Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92169 Approved by: https://github.com/albanD	2023-01-14 02:55:32 +00:00
Larry Liu	7568484d54	[torchgen] Add CI job to cover custom ops registration for Executorch (#91291 ) As titled. To register a custom op into Executorch, we need: * `custom_ops.yaml`, defines the operator schema and the corresponding native function. * `custom_ops.cpp`, defines the kernel. * `RegisterDispatchKeyCustomOps.cpp`, a template to register operator into PyTorch. Added a new test for custom ops. The custom op `custom::add_3.out` takes 3 tensors and add them together. The test makes sure it is registered correctly and then verifies the outcome is correct. Differential Revision: [D42204263](https://our.internmc.facebook.com/intern/diff/D42204263/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91291 Approved by: https://github.com/ezyang	2023-01-14 02:30:54 +00:00
PyTorch MergeBot	66b324cf06	Revert "In inductor triton generated code, avoid masking when numel=1 (#91254 )" This reverts commit 4e21fc2075e09dd735746696f95dce093b634c16. Reverted https://github.com/pytorch/pytorch/pull/91254 on behalf of https://github.com/ngimel due to regresses perf of hf models	2023-01-14 01:39:10 +00:00
Jane Xu	d3765509df	[optim][adadelta] default to foreach when CUDA + differentiable=False (#91896 ) following up to https://github.com/pytorch/pytorch/pull/90865 and https://github.com/pytorch/pytorch/pull/92048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91896 Approved by: https://github.com/albanD	2023-01-14 01:21:33 +00:00
Andrew Gu	cb67d9460b	[PT-D] Fix `send`, `recv` return type (#92152 ) - `send` returns `None`. - `recv` returns the sender rank if valid or -1 otherwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92152 Approved by: https://github.com/wz337	2023-01-14 01:09:49 +00:00
Jane Xu	4af5939d7a	[optim] Improve adadelta foreach, group tensors to maximize fast path (#92048 ) Old behavior would have adadelta foreach sending tensors to the slow path if they were not all the same dtype nor on the same device. This PR adds grouping for adadelta optimizer so that it would run foreach in batches, allowing more users to benefit from foreach perf. Of course, we should ensure that the new implementation works, so there are new tests to ensure this behavior is not broken. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92048 Approved by: https://github.com/albanD	2023-01-14 00:35:14 +00:00
Aaron Gokaslan	3779a75fc9	Apply noexcept to relevant move methods to improve performance (#92156 ) This clang-tidy check is disabled globally due to false positives on containers, but there are a few places here where adding clang-tidy would actually improve performance (by allowing STL containers to use the move operator / assignment) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92156 Approved by: https://github.com/ngimel	2023-01-14 00:17:26 +00:00
Huy Do	901a34ccb5	Add the new unstable workflow (#92106 ) It's empty at the moment, but would tentatively include ROCm trunk jobs. This adopts the same practice we have for inductor where it's run for every commit on trunk, and on PR with `ciflow/unstable` label - [x] Allow `ciflow/unstable` as a valid tag https://github.com/pytorch/test-infra/pull/1394 - [x] Create the unstable workflow on PyTorch https://github.com/pytorch/pytorch/pull/92106 - [ ] Gather reliability metrics of ROCm runner - [ ] Decide if we want to move ROCMs trunk jobs to the unstable workflow - [ ] Add redness metrics for the unstable workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/92106 Approved by: https://github.com/ZainRizvi	2023-01-13 23:53:23 +00:00
Nikita Shulga	3794b4643f	[GHF] Record how many times PR is revered (#92180 ) Or merged, by adding "revertedX2","revertedX3",... labels Tested in https://github.com/malfet/deleteme/pull/36 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92180 Approved by: https://github.com/ZainRizvi, https://github.com/kit1980	2023-01-13 23:18:38 +00:00
bxia	70b3ea59ae	[ROCM] Modify transcoding: absolute path ->relative path (#91845 ) Fixes https://github.com/pytorch/pytorch/issues/91797 This PR compiles the transcoded file with a relative path to ensure that the written transcoded file is written to SOURCE.txt as a relative path. Ensure successful packaging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91845 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2023-01-13 23:00:57 +00:00
Catherine Lee	214c0fdc4b	MYPYNOFOLLOW for test_utils (#92136 ) lintrunner went from 10 minutes to 25 minutes after 333540a458d40603feea84d30e4ad9b96b07318d since test/test_utils.py imports op_db, which takes 10+ minutes to run mypy on, so switch it to the the group of files that doesn't follow imports Pull Request resolved: https://github.com/pytorch/pytorch/pull/92136 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-01-13 22:57:04 +00:00
Jeff Daily	04689ae209	[CI][ROCm] skip multiprocessing tests that trigger hangs (#92101 ) Skip tests affected by #90940. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92101 Approved by: https://github.com/huydhn	2023-01-13 22:39:00 +00:00
Eddie Yan	4d07ad74f1	[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527 ) We've been building with V8 (incl. V8 API) by default for a while now; this PR cleans up some guards for cuDNN < 8.0. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91527 Approved by: https://github.com/ngimel	2023-01-13 18:55:37 +00:00
PyTorch MergeBot	4d26903739	Revert "Pytorch-bot test (#92163 )" This reverts commit 7fe3c64bdb6dc5aa969230ce0b10a9869849b49e. Reverted https://github.com/pytorch/pytorch/pull/92163 on behalf of https://github.com/clee2000 due to undo the test	2023-01-13 18:43:05 +00:00
clee2000	7fe3c64bdb	Pytorch-bot test (#92163 ) test to try to make pytorch-bot not a first time contributor Pull Request resolved: https://github.com/pytorch/pytorch/pull/92163 Approved by: https://github.com/huydhn	2023-01-13 18:39:47 +00:00
Anupam Bhatnagar	f4b804eeaa	Call profiler step via optimizer post hook (#90101 ) This PR adds the `_profile_using_dynolog` function to `torch/__init__.py`. The `_profile_using_dynolog` method allows registering the optimizer step post hook. This is required to collect iteration based traces using dynolog. Other related changes for tests to pass: 1. Updated `optimizer.pyi` 1. Updated `overrides.py` 1. The test `test_kineto_profiler_multiple_steppers` in `test_profiler.py` has been broken down into two cases: - `test_kineto_profiler_multiple_steppers_with_override_True` : this test uses the override argument - `test_kineto_profiler_multiple_steppers_with_override_False` : this test uses the environment variable Pull Request resolved: https://github.com/pytorch/pytorch/pull/90101 Approved by: https://github.com/albanD	2023-01-13 18:07:40 +00:00
Yimin Tang	6783db13ef	Update CMakeLists.txt since MacOS linker doesn't support whole-archive (#91736 ) --whole-archive is a linker option(notice, that flag is passed as -Wl,--whole-archive), and -force_load is indeed available on MacOS platform (below is the quote from man ld): -force_load path_to_archive Loads all members of the specified static archive library. Note: -all_load forces all members of all archives to be loaded. This option allows you to target a specific archive. Quote from malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/91736 Approved by: https://github.com/larryliu0820	2023-01-13 18:03:02 +00:00
kshitij12345	745fe35df5	[follow-up] Python Attr Serialization (#88913 ) Ref: https://github.com/pytorch/pytorch/pull/81616#issuecomment-1307595402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88913 Approved by: https://github.com/albanD	2023-01-13 17:38:51 +00:00
BowenBao	a72bcb3388	Do not leak SkipFrame exception to parent frames (#91059 ) Discovered by https://github.com/pytorch/torchdynamo/issues/2000, we noticed the exception `SkipFrame` to avoid repeatedly compiling frame of loop with graph breaks could leak to parent frames while inlining, which then prevents compiling. This PR checks at inlining if such exception is raised and would instead raise an `Unsupported` to the outer frame. The original behavior and goal of #88857 is unaffected: the inner frame that has loop would still be skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91059 Approved by: https://github.com/jansel, https://github.com/thiagocrepaldi	2023-01-13 17:11:22 +00:00
Nouran Ali	a60125e298	add docstring for adam differentiable parameter (#91881 ) Fixes #90467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91881 Approved by: https://github.com/janeyx99	2023-01-13 17:08:27 +00:00
Nikita Shulga	8f1c3c68d3	[BE] Use nested namespaces in .cpp/.cu files (#92100 ) As we live in C++17 world This is a functional no-op, just - `s/namespace at { namespace native {/namespace at::native {/` - `s/namespace torch { namespace jit {/namespace torch::jit {/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92100 Approved by: https://github.com/izaitsevfb	2023-01-13 16:32:34 +00:00
Luca Lumetti	a4a0195c6c	Fix torch.where signature mismatch that was caused by torchgen (#91627 ) Fixes #91003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91627 Approved by: https://github.com/albanD	2023-01-13 16:17:55 +00:00
Zachary DeVito	accecd7b04	[torchdim] Fix Python 3.11 bytecode decoding in dims (#91290 ) Adds a PyInstDecoder object that handles the differences in bytecode added in 3.11. Basically some instructions have inline caches which change the size of the instruction, so calculating the next instruction is slightly different. fixes #91246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91290 Approved by: https://github.com/albanD	2023-01-13 16:15:23 +00:00
albanD	60e37a6e08	Update sgd doc to insist on momentum buffer initial value (#92111 ) Following the discussion in https://github.com/pytorch/pytorch/pull/91108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92111 Approved by: https://github.com/soumith, https://github.com/janeyx99	2023-01-13 15:50:57 +00:00
Matthew Hoffman	a26e5e21b5	Improve type hints for Module forward hooks (#92061 ) Fixes #91654. Currently, the `hook` parameters of `nn.Module.register_forward_pre_hook` and `nn.Module.register_forward_hook` are typed as `Callable[..., None]`, which 1) does not enable the validation of the signature of `hook` and 2) incorrectly restricts the return type of `hook`, which the docstrings of these methods themselves state can be non-`None`. The typing of the first parameter of `hook` as `TypeVar("T", bound="Module")` allows the binding of `Callable` whose first parameter is a subclass of `Module`. --- Here are some examples of: 1. forward hooks and pre-hook hooks being accepted by mypy according to the new type hints 2. mypy throwing errors d.t. incorrect `hook` signatures 3. false negatives of pre-hooks being accepted as forward hooks 4. false negatives of hooks with kwargs being accepted irrespective of the value provided for `with_kwargs` ```python from typing import Any, Dict, Tuple import torch from torch import nn def forward_pre_hook( module: nn.Linear, args: Tuple[torch.Tensor, ...], ) -> None: ... def forward_pre_hook_return_input( module: nn.Linear, args: Tuple[torch.Tensor, ...], ) -> Tuple[torch.Tensor, ...]: ... def forward_pre_hook_with_kwargs( module: nn.Linear, args: Tuple[torch.Tensor, ...], kwargs: Dict[str, Any], ) -> None: ... def forward_pre_hook_with_kwargs_return_input( module: nn.Linear, args: Tuple[torch.Tensor, ...], kwargs: Dict[str, Any], ) -> Tuple[Tuple[torch.Tensor, ...], Dict[str, Any]]: ... def forward_hook( module: nn.Linear, args: Tuple[torch.Tensor, ...], output: torch.Tensor, ) -> None: ... def forward_hook_return_output( module: nn.Linear, args: Tuple[torch.Tensor, ...], output: torch.Tensor, ) -> torch.Tensor: ... def forward_hook_with_kwargs( module: nn.Linear, args: Tuple[torch.Tensor, ...], kwargs: Dict[str, Any], output: torch.Tensor, ) -> None: ... def forward_hook_with_kwargs_return_output( module: nn.Linear, args: Tuple[torch.Tensor, ...], kwargs: Dict[str, Any], output: torch.Tensor, ) -> torch.Tensor: ... model = nn.Module() # OK model.register_forward_pre_hook(forward_pre_hook) model.register_forward_pre_hook(forward_pre_hook_return_input) model.register_forward_pre_hook(forward_pre_hook_with_kwargs, with_kwargs=True) model.register_forward_pre_hook(forward_pre_hook_with_kwargs_return_input, with_kwargs=True) model.register_forward_hook(forward_hook) model.register_forward_hook(forward_hook_return_output) model.register_forward_hook(forward_hook_with_kwargs, with_kwargs=True) model.register_forward_hook(forward_hook_with_kwargs_return_output, with_kwargs=True) # mypy(error): [arg-type] model.register_forward_pre_hook(forward_hook) model.register_forward_pre_hook(forward_hook_return_output) model.register_forward_pre_hook(forward_hook_with_kwargs) model.register_forward_pre_hook(forward_hook_with_kwargs_return_output) model.register_forward_hook(forward_pre_hook) model.register_forward_hook(forward_pre_hook_return_input) # false negatives model.register_forward_hook(forward_pre_hook_with_kwargs) model.register_forward_hook(forward_pre_hook_with_kwargs_return_input) model.register_forward_pre_hook(forward_pre_hook_with_kwargs, with_kwargs=False) model.register_forward_pre_hook(forward_pre_hook_with_kwargs_return_input, with_kwargs=False) ... ``` --- Though it is not functional as of mypy 0.991, the ideal typing of these methods would use [`typing.Literal`](https://mypy.readthedocs.io/en/stable/literal_types.html#literal-types): ```python T = TypeVar("T", bound="Module") class Module: @overload def register_forward_hook( self, hook: Callable[[T, Tuple[Any, ...], Any], Optional[Any]], , prepend: bool = ..., with_kwargs: Literal[False] = ..., ) -> RemovableHandle: ... @overload def register_forward_hook( self, hook: Callable[[T, Tuple[Any, ...], Dict[str, Any], Any], Optional[Any]], , prepend: bool = ..., with_kwargs: Literal[True] = ..., ) -> RemovableHandle: ... def register_forward_hook(...): ... ``` which would: 1. validate the signature of `hook` according to the corresponding literal value provided for `with_kwargs` (and fix the false negative examples above) 2. implicitly define the [fallback `bool` signature](https://github.com/python/mypy/issues/6113#issuecomment-1266186192) e.g. to handle if a non-literal is provided for `with_kwargs` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92061 Approved by: https://github.com/albanD	2023-01-13 15:45:42 +00:00
Thiago Crepaldi	890b68281a	Add kwargs support to torch.export() API (#92013 ) Fixes [#1997](https://github.com/pytorch/torchdynamo/issues/1997) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92013 Approved by: https://github.com/jansel	2023-01-13 15:17:26 +00:00
Pearu Peterson	b3e4f5029b	Add check-sparse-tensor-invariants flag to Context - 2nd try. (#92094 ) This PR is a copy of https://github.com/pytorch/pytorch/pull/90849 that merge was reverted. The PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI: `torch.sparse.check_sparse_tensor_invariants` class provides different ways to enable/disable the invariant checking. `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden. The PR fixes https://github.com/pytorch/pytorch/issues/90833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92094 Approved by: https://github.com/cpuhrsch	2023-01-13 14:50:33 +00:00
Wu, Chunyuan	a111dd9014	[dynamo] support comparing numpy ndarray (#91870 ) The output of Torchbench model `doctr_det_predictor` on CPU is a `numpy ndarray`. When running the accuracy benchmark of this model, the below error is raised: `RuntimeError: unsupported type: ndarray`. Repro CMD: ```bash python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcpu -n50 --inductor --no-skip --dashboard --only doctr_det_predictor --batch_size 1 --threads 1 ``` This PR adds the support to compare `numpy ndarray` in the dynamo utils. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91870 Approved by: https://github.com/jgong5, https://github.com/Chillee	2023-01-13 12:11:49 +00:00
BowenBao	fa3841ffd4	[ONNX] Fix potential flaky test in test_verification.py (#92105 ) Very low probability, but it is possible to have all values positive throughout the execution of this test model. The test tries to fake an incorrect export by replacing relu's output with its input. However, the behavior of the model is the same when values are all positive. Hence leading to false test failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92105 Approved by: https://github.com/titaiwangms	2023-01-13 07:56:24 +00:00
Jerry Zhang	ec3941ada6	[quant][fx] Add support for GRU in fx graph mode quantization (#91976 ) Summary: might be needed by a meta-internal use case Test Plan: python test/test_quantization.py TestQuantizeFxOps.test_rnn Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91976 Approved by: https://github.com/jcaip	2023-01-13 07:00:12 +00:00
andrewor14	0bd3fa3d22	[Quant][docs] Move parts of BackendConfig tutorial (#91999 ) Summary: This commit moves the API specification section of the BackendConfig tutorial to the docstrings, which is a more suitable place for this content. This change also reduces some duplication. There is no new content added in this change. Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/91999 Approved by: https://github.com/vkuzo, https://github.com/jerryzh168	2023-01-13 05:59:22 +00:00
Wei Wang	a617d031ff	[Inductor Perf CI] Enable perf CI smoke test (#92051 ) This tries to detect perf regression e.g. https://github.com/pytorch/pytorch/pull/91316#issuecomment-1370370885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92051 Approved by: https://github.com/seemethere, https://github.com/desertfire	2023-01-13 05:47:17 +00:00
mingfeima	eb7b89771e	unify reduction types from different operators: scatter, scatter_reduce, segment_reduce (#91499 ) The target of this PR is to unify `ReductionType` for reduce operators so that we have the same set of reduce utils for `init`, or `update` for vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91499 Approved by: https://github.com/ngimel	2023-01-13 04:32:34 +00:00
PyTorch MergeBot	a70387f0fa	[vision hash update] update the pinned vision hash (#92119 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92119 Approved by: https://github.com/pytorchbot	2023-01-13 04:16:33 +00:00
Edward Z. Yang	fbbb19599a	Update dynamic skips after #92076 (#92103 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92103 Approved by: https://github.com/voznesenskym, https://github.com/Chillee	2023-01-13 04:05:10 +00:00
milesial	9412778d51	Fix OneCycleLR error log (#92040 ) If we call the scheduler 11 times but the number of expected steps is 10, we should print `Tried to step 11 times`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92040 Approved by: https://github.com/janeyx99	2023-01-13 02:46:59 +00:00
Huy Do	61cdae0ce5	Switch Windows CI jobs to G5 runners (#91727 ) ### Changelist * Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU * Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by https://github.com/pytorch/pytorch/pull/91979 * G5 runner has `AMD EPYC 7R32` CPU, not an Intel one * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test. ### Performance gain * (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration) * (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (25%) ### Pricing On demand hourly rate: * (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428 * (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36 So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = ~45% ### Rolling out https://github.com/pytorch/test-infra/pull/1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91727 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere	2023-01-13 01:11:59 +00:00
Xilun Wu	b7cad020b5	[DTensor] require DeviceMesh size equals world size (#91801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91801 Approved by: https://github.com/wanchaol	2023-01-12 22:37:55 +00:00
Xilun Wu	3dd9dbd942	[DTensor] create default process group when absent (#91756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91756 Approved by: https://github.com/wanchaol	2023-01-12 22:37:55 +00:00
PyTorch MergeBot	f8e641bad4	Revert "Make ModuleList derive from Sequence[T] and type it appropriately (#89135 )" This reverts commit d0bfd79f3d1bbf8885b00acb6d72db0bc16f1995. Reverted https://github.com/pytorch/pytorch/pull/89135 on behalf of https://github.com/albanD due to Is actually breaking user code	2023-01-12 22:04:02 +00:00
Jerry Zhang	8fa66a6337	[quant][pt2e] Add a test to confirm we can set qconfig according to module_name (#91977 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qconfig_none Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91977 Approved by: https://github.com/jcaip	2023-01-12 21:59:02 +00:00
Richard Barnes	6f749fd171	Fixes to DSA infra (#91835 ) Differential Revision: D42397325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91835 Approved by: https://github.com/soumith	2023-01-12 21:54:26 +00:00
Huy Do	4636fe701c	Limit the memory and CPU of Bazel build to avoid crashing the runner (#92056 ) I'm seeing quite a number of runner errors "i-NUMBER lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error" with Bazel build and test job, i.e. https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=bazel The job runs on normal `linux.2xlarge` runner. As the error doesn't occur with any other jobs running on the same type of runner with the exception of XLA. I suspect that this is due to a resource constraint crashing the runner. So this PR sets a limit to the amount of memory and CPU and bazel can use. Even if bazel crashes, i.e. with OOM error, it's still better than crashing the whole runner and losing all the logs. Example failures: * `33e3c9ac67` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92056 Approved by: https://github.com/ZainRizvi	2023-01-12 21:51:16 +00:00
Edward Z. Yang	7078ad5b8c	Reland "AOT Autograd refactor + cleanup, handle intermediate views of bases, use view replay, fix non-tensor input handling" (#92076 ) Original PR: https://github.com/pytorch/pytorch/pull/89532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92076 Approved by: https://github.com/janeyx99, https://github.com/albanD	2023-01-12 21:32:05 +00:00
min-jean-cho	da77b10b41	fix in-place geometric pmf (#92049 ) See https://github.com/pytorch/pytorch/pull/37984#discussion_r1059548320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92049 Approved by: https://github.com/lezcano	2023-01-12 19:56:44 +00:00
vfdev-5	5f55335c2e	Fixed output memory format mismatch for bicubic2d (#90470 ) Description: - output memory format is matching input for bicubic2d Problem: output tensor's memory format does not match input format for bicubic2d ```python import torch i = torch.rand(1, 3, 32, 32).contiguous(memory_format=torch.channels_last) assert i.is_contiguous(memory_format=torch.channels_last) o = torch.nn.functional.interpolate(i, size=(4, 4), mode="bicubic") assert o.is_contiguous(memory_format=torch.channels_last), f"Should be channels last but given channels first ({o.is_contiguous(memory_format=torch.contiguous_format)})" > AssertionError: Should be channels last but given channels first (True) ``` Related PR fixing bilinear ops: https://github.com/pytorch/pytorch/pull/53535 (cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @bdhirsh ) Discovered together with @NicolasHug while working on https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev - Updated code to match grad input / output memory formats - temporary tensor creation matches memory format in `separable_upsample_generic_Nd_kernel_impl` - Updated tests - Added missing forward AD support for bicubic with antialiasing Pull Request resolved: https://github.com/pytorch/pytorch/pull/90470 Approved by: https://github.com/NicolasHug, https://github.com/lezcano	2023-01-12 19:52:28 +00:00
David Berard	c4a6f21b50	[JIT] Add tests for pow() with different dtype inputs (#91946 ) Fixes #75476 Apparently this NNC bug has been fixed at some point. Adding tests to track this and verify via CI that this is actually fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91946 Approved by: https://github.com/qihqi	2023-01-12 19:39:55 +00:00
samdow	515dff7811	[functorch] move batch_norm_replacement to torch.func (#91412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91412 Approved by: https://github.com/zou3519	2023-01-12 19:15:41 +00:00
PyTorch MergeBot	7bdcf6d4f0	Revert "[FSDP] Do not clean FQNs even for `use_orig_params=True` (#91767 )" This reverts commit a383789f4d8ecb36adaff6bd3746430209ff0546. Reverted https://github.com/pytorch/pytorch/pull/91767 on behalf of https://github.com/huydhn due to This breaks inductor_distributed workflow `a383789f4d`	2023-01-12 19:07:50 +00:00
Nikita Vedeneev	91920ee6da	sparse_mask: remove redundant mask.coalesce() in to_dense_backward (#92001 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92001 Approved by: https://github.com/cpuhrsch	2023-01-12 17:50:06 +00:00
Aaron Gokaslan	b9182cbbd8	Fixup torch jit with some initializers and moves (#92037 ) Fixup some minor codequality issues in torch JIT Pull Request resolved: https://github.com/pytorch/pytorch/pull/92037 Approved by: https://github.com/ezyang	2023-01-12 17:29:24 +00:00
Natalia Gimelshein	5625f521a4	generate set_device call to ensure context existence (#92055 ) Hopefully Fixes https://github.com/pytorch/torchdynamo/issues/2026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92055 Approved by: https://github.com/wconstab	2023-01-12 17:23:49 +00:00
Jiong Gong	7c641eaaf0	[Inductor] Support vectorized transpose in CPP backend (#91532 ) Fix https://github.com/pytorch/torchdynamo/issues/1915 This PR adds the vectorization support for transposed operations in TorchInductor CPP backend. It contains the following changes: 1. `CppTile2DKernelChecker` is added to check the eligibility of applying the optimization. We only addresss a narrow set of situations. All of the following conditions should be met: 1) There exists one and only one fp32 load/store with outer loop var having contiguous buffer accesses. 2) When a load/store doesn't have contiguous access in an outer loop var, the access should be vectorizable from the inner-most dim. 3) No reduction. More scenarios/operations would be supported in the future PRs. 2. If `CppTile2DKernelChecker` reports the optimization is doable, `CppKernelProxy` would split/tile the loops from both the outer loop var having contiguous buffer access and the inner-most loop var. 3. The main loop split from the outer loop var is further split at the inner-most level and then handled by `CppTile2DKernel` and `CppTile2DTailKernel` which generate the transposed load/store. The former kernel does the vectorized transposed load/store on tiles and then does vectorized load/store/compute along the inner-most loop axis. The vectorized transpose micro-kernel implementation borrows/refers to that from FBGEMM. The latter kernel simply does scalar operations. 4. The tail loop split from the outer loop var directly calls `CppKernel` with scalar operations. Next steps: 1. Support vectorized transpose with smaller tile size at one dim but bigger tile size at the other, e.g., 3x784. 2. Support reduction vectorized on the outer loop var (contiguous from outer loop var, not with inner-most loop var) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91532 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-01-12 17:20:39 +00:00
Peter Bell	eece6da162	[inductor] Reduce device context manager overhead (#91045 ) This adds `torch.cuda._DeviceGuard` which is a stripped down version of `torch.cuda.device` with lower overhead. To do this, it only accepts `int` as the device so we don't need to call `_get_device_index` and is implemented with a new C++ helper `torch._C._cuda_exchangeDevice` that allows `_DeviceGuard.__enter__` to be just a single function call. On my machine, I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark: ```python def set_device(): with torch.cuda.device(0): pass %timeit set_device() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-01-12 16:51:59 +00:00
PyTorch MergeBot	db466ae057	Revert "[Modes] Add assert that the mode isn't already on the stack (#90770 )" This reverts commit 702838637d63936460ea2bf00b64ffec86ed6687. Reverted https://github.com/pytorch/pytorch/pull/90770 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-12 16:44:29 +00:00
Andrew Gu	a383789f4d	[FSDP] Do not clean FQNs even for `use_orig_params=True` (#91767 ) Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91767 Approved by: https://github.com/zhaojuanmao	2023-01-12 15:14:14 +00:00
Andrew Gu	7f50ff1685	[FSDP] Test `use_orig_params=True`, `no_sync()`, mixed precision (#91193 ) This makes some minor fixes to ensure that `use_orig_params=True`, `no_sync()`, and mixed precision work together for `FULL_SHARD`, `SHARD_GRAD_OP`, and `NO_SHARD`. The added unit test only checks that dtypes are correct since for FP16, it is hard to test for numeric parity against a baseline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91193 Approved by: https://github.com/zhaojuanmao	2023-01-12 15:14:14 +00:00
Andrew Gu	e5503aceae	[FSDP] Re-support model dtype change after FSDP init (#91192 ) Closes https://github.com/pytorch/pytorch/issues/90838. To make mixed precision precise internally, https://github.com/pytorch/pytorch/pull/90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window. Note that any mixed precision settings specified by the user take precedence over the model dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91192 Approved by: https://github.com/zhaojuanmao	2023-01-12 15:14:10 +00:00
Eddie Yan	e096d2db5a	[BC-Breaking] Separate `stream_id`, `device_index`, and `device_type` in `pack` and `unpack` for `Streams` (#81596 ) #75854 A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`. Stills needs sanity checks, testing, and minimization of BC-breaking changes. Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596 Approved by: https://github.com/ezyang	2023-01-12 14:16:49 +00:00
ydwu4	a2368a7c13	[dynamo] delegate handling of len() of TensorVariable to size(0) (#92016 ) We delegate the handling logic of __len__ in TensorVariable to size(0). This seems to also fix several expected failures that are related to len(). Fixes #91901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92016 Approved by: https://github.com/ezyang	2023-01-12 13:40:48 +00:00
mingfeima	3ab58fd5ed	optimize sampled_addmm performance on CPU (SparseCSR) (#90978 ) ### Target and Background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * number of nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the adjacency matrix is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, 1100x speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-01-12 12:04:07 +00:00
Peter Bell	81f7c40612	Cleanup some unused includes (#91961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91961 Approved by: https://github.com/lezcano	2023-01-12 11:53:52 +00:00
Peter Bell	8acf0e62d0	Use c10 math constants consistently in Math.h (#91967 ) On MSVC the `M_` constants are hidden behind the `USE_MATH_DEFINES` macro, so it's better to avoid them in headers otherwise the include order can break compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91967 Approved by: https://github.com/malfet	2023-01-12 11:53:52 +00:00
PyTorch MergeBot	c7a22bb7c7	Revert "Add check-sparse-tensor-invariants flag to Context. (#90849 )" This reverts commit b9a035c1c58630f3eef5242cb4849881b8376b39. Reverted https://github.com/pytorch/pytorch/pull/90849 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-12 09:58:16 +00:00
zhxchen17	05d0c4cee3	[functorch] Fix proxy unwrapping for cond(). (#91907 ) In control_flow.cond(), we unwrap arguments' proxy by using get_proxy_slot() call which call a lambda in the end to get the stored proxy. For SymInt and SymFloat we hide the proxy under a thunk instead of storing proxy on .proxy attribute diretly, therefore we need to special case SymInt for unwrapping here. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91907 Approved by: https://github.com/ezyang	2023-01-12 08:45:12 +00:00
milesial	a76bc410df	Fix `_foreach_norm` on some tensor sizes (#91844 ) This PR fixes 2 bugs with CUDA `_foreach_norm`: 1. Wrong norm when tensors are larger than kChunkSize = 65536 ``` >>> torch._foreach_norm([torch.ones(60000, device="cuda") for _ in range(1)]) (tensor(244.9490, device='cuda:0', grad_fn=<NotImplemented>),) >>> torch._foreach_norm([torch.ones(70000, device="cuda") for _ in range(1)]) (tensor(256., device='cuda:0', grad_fn=<NotImplemented>),) >>> torch.ones(60000, device="cuda").norm() tensor(244.9490, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>) >>> torch.ones(70000, device="cuda").norm() tensor(264.5751, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>) ``` 2. Error when a tensor numel is smaller than the number of tensors ``` >> torch._foreach_norm([torch.ones(9, device="cuda") for _ in range(10)]) Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: select(): index 9 out of range for tensor of size [9] at dimension 0 ``` This bug could have been caught by tests if `PYTORCH_TEST_WITH_SLOW` was 1, because it would have tested tensors of size 300*300=90000. It's not enabled by default, does someone know if it's ever enabled? Pull Request resolved: https://github.com/pytorch/pytorch/pull/91844 Approved by: https://github.com/ngimel	2023-01-12 05:48:01 +00:00
Natalia Gimelshein	44413f2525	properly convert fill value to x dtype in constant_pad (#92045 ) Fixes #92038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92045 Approved by: https://github.com/desertfire	2023-01-12 05:41:10 +00:00
eqy	fb38b9ff2a	[cuBLAS][TF32] Fix TF32 get/set test when `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE` is set (#92052 ) Follow up of #85859 to fix the test for when the environment variable is set. CC @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/92052 Approved by: https://github.com/ngimel	2023-01-12 05:36:06 +00:00
Harshit Khaitan	ffbd13b654	Fix for swap_custom_module_to_observer doing duplicate swaps on the same node.target (#91905 ) Summary: This is a fix for the following issue: "When two nodes in a model have the same dTypes / node.target, the torch quantization prepare_fx flow does not check for duplicates and tries to do a custom module swap twice. When it attempts the swap the same target for a second time, the swap_custom_module_to_observed detects the observed module instead of the float module class on the target, and fails on an assertion. " The added unit test demonstrates a simple example where it fails in absence of this fix. Test Plan: buck test mode/dev //caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_custom_module_class_input_has_duplicate_nodes (quantization.fx.test_quantize_fx.TestQuantizeFx)' Reviewed By: vkuzo Differential Revision: D42023273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91905 Approved by: https://github.com/jerryzh168	2023-01-12 05:24:38 +00:00
Khushi Agrawal	ccd8b66b0a	[testing] add ErrorInputs for `adaptive_{avg, max}_poolnd` (#90924 ) Ref: https://github.com/pytorch/pytorch/pull/88906#discussion_r1040157313 Covers: - [x] adaptive_avg_pool1d - [x] adaptive_avg_pool2d - [x] adaptive_avg_pool3d - [x] adaptive_max_pool1d - [x] adaptive_max_pool2d - [x] adaptive_max_pool3d Pull Request resolved: https://github.com/pytorch/pytorch/pull/90924 Approved by: https://github.com/mruberry	2023-01-12 05:24:01 +00:00
Will Constable	6cfaa92239	Handle tensor default func args when inlining (#90575 ) Handle tensor default func/method args when inlining Previously, when inlining a function, its default arguments were only wrapped with VariableTrackers if non-tensor. Now, tensor default args are also handled by adding them to the parent InstructionTranslator as an attribute. - also patches up a missing source in nnmodule call_function, needed to properly guard on a default arg in its methods - adds new 'DefaultsSource' type which guards either a `__defaults__` or `__kwdefaults__` entry on a function Fixes #90361 https://github.com/pytorch/torchdynamo/issues/1968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90575 Approved by: https://github.com/voznesenskym	2023-01-12 05:04:18 +00:00
Will Constable	8e2e648f84	Propagate sources in VariableBuilder and add SuperSource (#91729 ) Motivation When adding support for default args (#90575), a lot of VariableTrackers missing sources were encountered. Currently, in a lot of cases it seems OK to skip the source for VariableTrackers created (especially during inlining), but that assumption breaks down when inlining functions with default arguments. Summary of changes - propagate the self.source of the VariableBuilder to the new variables being built, which seems like it was an omission previously - Add SuperSource to track usages of super(), so that SuperVariables can support function calls with default args Pull Request resolved: https://github.com/pytorch/pytorch/pull/91729 Approved by: https://github.com/ezyang	2023-01-12 05:04:18 +00:00
Emilio Castillo	07e595e88a	Add `device_idx` to `free_fn` in `CUDAPluggableAllocator` (#91398 ) This was requested by nvidia folks, track also the device_id in the free function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91398 Approved by: https://github.com/albanD	2023-01-12 05:03:48 +00:00
PyTorch MergeBot	723d7641e2	[vision hash update] update the pinned vision hash (#91744 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91744 Approved by: https://github.com/pytorchbot	2023-01-12 04:08:10 +00:00
Nikita Vedeneev	18677d5249	sparse_mask: faster, with support for uncoalesced mask (#91964 ) This PR updates `sparse_mask` to be: * about 30% faster on CUDA. * able to support uncoalesced masks. * much shorted code-wise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91964 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-01-12 04:02:05 +00:00
Andrew Gu	3305265962	[FSDP] Clarify `MixedPrecision` docs (#91974 ) New docs: ![Screen Shot 2023-01-10 at 8 07 19 PM](https://user-images.githubusercontent.com/31054793/211694428-c8ebf210-85c5-4b8a-a174-ee8022d8b8fd.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91974 Approved by: https://github.com/zhaojuanmao	2023-01-12 03:41:58 +00:00
Aleksandar Samardžić	8612ec5b90	Implement hybrid sparse to/from dense conversions. (#90177 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90177 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-01-12 03:31:30 +00:00
Xia, Weiwen	e1bcbbf18c	[Quant] make x86 the default quantization backend (qengine) (#91235 ) Summary Make x86 the default quantization backend (qengine) for X86 CPU platforms. X86 is a unified quantization backend combining goodness of fbgemm and onednn. For more details please see https://github.com/pytorch/pytorch/issues/83888 Test plan python test/test_quantization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/91235 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/malfet	2023-01-12 02:14:28 +00:00
zhxchen17	5766764d6c	[functorch] Fix map() operator behavior. (#91906 ) 3 fixes made to control_flow.map: 1. argument list won't accept torch.nn.Module anymore, only Tensors. 2. during tracing we call new_empty from the returned sample output instead xs to correctly inherit tensor metadata. 3. for FakeTensorMode we implement map() using new_empty() as well instead of torch.stack() to preserve symbolic shape output. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91906 Approved by: https://github.com/tugsbayasgalan	2023-01-12 01:54:34 +00:00
samdow	b8252e07c7	[Reland] add DisableTorchFunction that matches DisableTorchDispatch (#88219 ) (#92012 ) Reland of #88219 Closes #87990. This implements a new disable guard that matches DisableTorchDispatch (disables all subclasses and modes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92012 Approved by: https://github.com/albanD	2023-01-12 01:27:47 +00:00
Mengwei Liu	6676193b5e	[frontend] Expose real_type getter for torch.Argument (#91938 ) Exposing an API to get real_type from an Argument. This is useful for Argument types such as SymInt. Differential Revision: [D42425661](https://our.internmc.facebook.com/intern/diff/D42425661/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91938 Approved by: https://github.com/ezyang	2023-01-12 01:26:50 +00:00
mingfeima	dc6916b341	optimize gather performance for gnn usage on CPU (#87586 ) On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized. This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension. Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format. for more details, please refer to https://github.com/pyg-team/pytorch_geometric/issues/4891#issuecomment-1288423705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87586 Approved by: https://github.com/rusty1s, https://github.com/malfet	2023-01-12 00:43:43 +00:00
eqy	f8026413f5	Fix `CUDA_MAX_THREADS_PER_SM` for `sm_89` (#91972 ) Basically the same as #88644, to fix warnings like `ptxas warning : Value of threads per SM for entry _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_10NormTwoffEEjfLi4EEEEEvT1_ is out of range. .minnctapersm will be ignored` CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91972 Approved by: https://github.com/ngimel	2023-01-12 00:30:27 +00:00
Xia, Weiwen	3613ff06b1	[MKLDNN] Rename pooling_avg to pooling_avg_exclude_padding (#90247 ) Summary Rename `pooling_avg` to `pooling_avg_exclude_padding` to align with onednn v3.0. It does not affect correctness or performance. Same as https://github.com/pytorch/pytorch/pull/87851 . Looks like https://github.com/pytorch/pytorch/pull/87851 did not cover all occurrences. Test plan python test/test_mkldnn.py python caffe2/python/ideep/pool_op_test.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/90247 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-01-12 00:08:30 +00:00
BowenBao	c537f5bee8	[ONNX] Documentation for `torch.onnx.find_mismatch` (#90728 ) Doc preview: * `find_mismatch`: https://docs-preview.pytorch.org/90728/onnx.html#torch.onnx.verification.find_mismatch * `GraphInfo`: https://docs-preview.pytorch.org/90728/onnx.html#classes and https://docs-preview.pytorch.org/90728/generated/torch.onnx.verification.GraphInfo.html#torch.onnx.verification.GraphInfo * `VerificationOptions`: https://docs-preview.pytorch.org/90728/onnx.html#classes and https://docs-preview.pytorch.org/90728/generated/torch.onnx.verification.VerificationOptions.html#torch.onnx.verification.VerificationOptions Pull Request resolved: https://github.com/pytorch/pytorch/pull/90728 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-01-11 23:58:57 +00:00
Jane Xu	ed7885c254	[utils][foreach] Add group tensor by device and dtype util (#92014 ) Add util that will be commonly used throughout optim Pull Request resolved: https://github.com/pytorch/pytorch/pull/92014 Approved by: https://github.com/albanD	2023-01-11 23:37:20 +00:00
min-jean-cho	af242eedfb	[Inductor] Added aten.uniform_ decomp (#90869 ) Fixes #90815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90869 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel, https://github.com/albanD	2023-01-11 23:23:42 +00:00
Yanbo Liang	f40777e4ad	[Dynamo] Fix guard bug when np.float used in control flow (#91991 ) Fixes 14k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_Sanster_lama_cleaner.py#L2392 Error ``` File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/guards.py", line 263, in CONSTANT_MATCH self.EQUALS_MATCH(guard) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/guards.py", line 197, in EQUALS_MATCH assert istype( AssertionError: float64 ``` ```np.float``` is unspecialized by default, which has guard on ```TYPE_MATCH```. However, it will be baked when being used in control flow, which has guard on ```EQUALS_MATCH```. We should make ```EQUALS_MATCH``` support ```np.float```. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91991 Approved by: https://github.com/jansel	2023-01-11 23:16:56 +00:00
Leon Gao	8007c2d96a	Python Script Object to IValue (#91776 ) Summary: * when we try to port py obj of script module/obj to c++, `tryToInferType` is flawed in providing type inference metadata. but change it would break normal torch.jit.script flow, so we try extract the ivalue in the py obj value. Test Plan: NA Reviewed By: PaliC Differential Revision: D41749823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91776 Approved by: https://github.com/842974287	2023-01-11 23:06:57 +00:00
Edward Z. Yang	8b00c54425	Add utility report_compile_source_on_error (#91069 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91069 Approved by: https://github.com/soumith, https://github.com/albanD	2023-01-11 22:54:46 +00:00
Fabio Rocha	4e21fc2075	In inductor triton generated code, avoid masking when numel=1 (#91254 ) This is implementing an idea from @lezcano : if we have a generated triton kernel with `xnumel=1`, then `xmask` is just `0<1` and can be dropped from all `load`/`store`/`where`. The `xnumel=1` case actually comes up relatively often when code for reductions is being generated. @lezcano reported some performance gains in micro-benchmarks (see comment below) and it is a very simple change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91254 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-01-11 22:40:06 +00:00
Huy Do	33e3c9ac67	Not explicitly set the manifest filename in Windows (#91988 ) I'm at a loss to explain why this happens, but not setting the manifest file explicitly in the linker fixes it. ### Testing locally * With `/MANIFESTFILE:bin\torch_python.dll.manifest` ``` C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_python.dll.manifest LINK : fatal error LNK1000: Internal error during CImplib::EmitImportThunk ``` * Work fine without the flag ``` C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST ``` In both case, the `/MANIFEST` flag is set, so the manifest file is there. In the latter case, the filename comes by appending `.manifest` suffix to `bin\torch_python.dll`. Thus, it's still correctly be `bin\torch_python.dll.manifest`. Weird. ``` C:\actions-runner\_work\pytorch\pytorch>ls -la build/bin/torch_* -rwxr-xr-x 1 runneruser 197121 246796288 Jan 11 04:30 build/bin/torch_cpu.dll -rw-r--r-- 1 runneruser 197121 381 Jan 11 04:26 build/bin/torch_cpu.dll.manifest -rwxr-xr-x 1 runneruser 197121 9728 Jan 11 03:55 build/bin/torch_global_deps.dll -rw-r--r-- 1 runneruser 197121 381 Jan 11 03:55 build/bin/torch_global_deps.dll.manifest -rwxr-xr-x 1 runneruser 197121 11746816 Jan 11 04:31 build/bin/torch_python.dll -rw-r--r-- 1 runneruser 197121 381 Jan 11 04:30 build/bin/torch_python.dll.manifest ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91988 Approved by: https://github.com/malfet, https://github.com/Blackhex, https://github.com/ZainRizvi	2023-01-11 22:28:08 +00:00
Rohan Varma	a155f64957	Update _optim_utils.py (#91935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91935 Approved by: https://github.com/awgu, https://github.com/fegin	2023-01-11 22:06:26 +00:00
Sherlock Huang	28c736a424	Third batch of canonical aten ops (#91995 ) Following aten ops appears as high frequency ops in the 14k github crawl model, and they don't have decomps: https://github.com/jansel/pytorch-jit-paritybench as_strided floor select.int topk max_pool3d_with_indices reflection_pad2d replication_pad2d replication_pad3d Full dump of aten ops from 14k model can be found here: https://docs.google.com/spreadsheets/d/1sEt0HD-0YAF5lfdOUPPZd2xIvwPL0emE7GaiqgMaTSM/edit?usp=sharing Pull Request resolved: https://github.com/pytorch/pytorch/pull/91995 Approved by: https://github.com/ezyang	2023-01-11 21:36:17 +00:00
Alex Silverstein	d0bfd79f3d	Make ModuleList derive from Sequence[T] and type it appropriately (#89135 ) I see https://github.com/pytorch/pytorch/issues/53103 says this might be problematic, but I'm a bit confused at this point, because it looks like ModuleList does in fact already adhere to the Sequence API The big win here is that for homogenous ModuleLists, you now get typing for individual members, e.g. `ModuleList([Linear(), Linear(), Linear()])[1]` properly has type `Linear` If this looks good, I can do a followup PR to do similarly for `ModuleDict` and `Parameter[List,Dict]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89135 Approved by: https://github.com/albanD	2023-01-11 21:21:32 +00:00
PyTorch MergeBot	c5836153f5	Revert "optimize sampled_addmm performance on CPU (SparseCSR) (#90978 )" This reverts commit 645fb217c06348a4f1ccdf68a93bd711f7158c62. Reverted https://github.com/pytorch/pytorch/pull/90978 on behalf of https://github.com/seemethere due to This broke internal builds for android due to the new file added being missing in build_variables.bzl	2023-01-11 20:12:12 +00:00
Edward Z. Yang	74cbf058a5	Support --dynamic-ci-skips (#91893 ) This makes it easier for us to run only the skipped benchmarks and see if that actually started passing. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91893 Approved by: https://github.com/albanD	2023-01-11 20:02:58 +00:00
David Berard	83e6e9dde3	Disable NVFuser in internal (Meta) build (#91836 ) In preparation for https://github.com/pytorch/pytorch/pull/89621. The build changes in #89621 would require re-writing the internal build in order to get NVFuser support. As-is, #89621 would disable NVFuser in the internal build; so I would need to add some internal-only changes associated with the internal copy of the PR (not visible from github) to fix the internal build. However, I don't think NVFuser is actually being used internally anywhere at the moment, so it may be easier to land #89621 as is, and then we can fix the internal build later if needed. To verify that, I want to land this PR instead to flush out any issues caused by disabling NVFuser. If the PR lands without issues, then we can move on to landing #89621. If the PR breaks things internally, then it will need to be reverted; and that will probably be easier than having to revert and reland #89621. Differential Revision: [D42398050](https://our.internmc.facebook.com/intern/diff/D42398050) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91836 Approved by: https://github.com/jjsjann123	2023-01-11 19:33:10 +00:00
Peter Bell	4806a9e7f6	Remove DL_RUNTIME_BUG (#91960 ) This macro was made a no-op in #61903 and so we should clean up the surrounding boiler-plate which used it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91960 Approved by: https://github.com/lezcano	2023-01-11 18:26:41 +00:00
Max Podkorytov	6287bb78dc	[static-runtime] clamp fast_sigmoid result into (0,1) range (#91993 ) fast_sigmoid uses fast_tanh under the hood which is not precise; the op outputs are treated as probability-like numbers; in a reeeally small percentage of cases the outputs fell out of acceptable range for probabilities Test Plan: ci Differential Revision: D42445821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91993 Approved by: https://github.com/davidberard98	2023-01-11 17:41:42 +00:00
samdow	d8e795ecd5	[modes] make python arg parser also check for python key (#91573 ) Fixes #90652 Previously, we had assumed that the only way to call `handle_torch_function_no_python_arg_parser` was through the Python key. This is no longer true with FakeTensor. Specifically `_like` functions will call `.device()` on FakeTensors when the args list is being parsed. In order to respect that the mode stack shouldn't run when the python key is off, this just adds that a check that the python key is on/the torch_function equivalent to that function Pull Request resolved: https://github.com/pytorch/pytorch/pull/91573 Approved by: https://github.com/ezyang	2023-01-11 15:19:43 +00:00
samdow	702838637d	[Modes] Add assert that the mode isn't already on the stack (#90770 ) Redo of #89726 on a clean PR, thanks @voznesenskym for the first draft! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90770 Approved by: https://github.com/ezyang	2023-01-11 15:19:43 +00:00
samdow	8b3c4bc481	[stateless] add weight tying support (#90477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90477 Approved by: https://github.com/zou3519	2023-01-11 15:19:09 +00:00
Yu, Guangye	e03ac0ee8c	Add bf16 and change header file include path (#91838 ) # Motivation We would like to add the bfloat16 header file to PyTorch to make PyTorch and Intel extension for PyTorch support the bfloat16 data type. # Solution - Note that bfloat16 is an Intel extension implementation in the DPC++ compiler instead of standard SYCL, we need to guarantee the bfloat16 header can be included only using the DPC++ compiler. Please refer to [sycl 2020 feature test macros](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_feature_test_macros). Intel DPC++ compiler uses [SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc) to check bfloat16 feature. - Refer to [intel/llvm](`59dd38795c/clang/lib/Basic/Version.cpp (L129)`). SYCL_LANGUAGE_VERSION is defined in both SYCL 1.2.1 and SYCL 2020. But only CL_SYCL_LANGUAGE_VERSION is defined in SYCL 1.2.1. So we should check CL_SYCL_LANGUAGE_VERSION first for SYCL 1.2.1. If it is not defined then check SYCL_LANGUAGE_VERSION for SYCL 2020. This will guarantee to be compatible with SYCL 1.2.1 and SYCL 2020. # Additional No need UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91838 Approved by: https://github.com/ezyang	2023-01-11 15:18:56 +00:00
Edward Z. Yang	d24324bf1d	s/INDCUTOR/INDUCTOR/ (#91885 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91885 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet	2023-01-11 12:28:19 +00:00
MohammedZ666	84b819d083	Preventing crashing incase of no network by loading from cache (#91569 ) Fixes #91568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91569 Approved by: https://github.com/NicolasHug	2023-01-11 11:56:46 +00:00
Nikita Vedeneev	850cf8949a	enable `conj()` for sparse compressed tensors (#91695 ) Fixes https://github.com/pytorch/pytorch/issues/91631. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91695 Approved by: https://github.com/pearu, https://github.com/cpuhrsch, https://github.com/albanD	2023-01-11 11:46:50 +00:00
Edward Z. Yang	56ed976edf	hrnet_w18, tts_angular works with dynamic shapes (#91891 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91891 Approved by: https://github.com/voznesenskym	2023-01-11 11:40:16 +00:00
David Berard	d7dc1c2fd5	Support zero dimensions in softmax decompositions (#91322 ) The eager implementation of softmax supports computation along zero dimensions, but many of the other implementations did not, including: * decompositions & refs (this was causing dynamo failures) * forward AD for logsumexp * MPS log_softmax_backward This PR handles the `input.numel() == 0` cases separately to avoid running `amax()`, which fails for zero dimensions, and updates opinfos. example of "computation along zero dimensions": ```python # example of where import torch t = torch.rand((4, 0, 0)) print("~") print(torch.nn.functional.softmax(t, dim=-1)) # this passes print("~") torch._refs.softmax(t, dim=-1) # this fails print("~") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91322 Approved by: https://github.com/lezcano	2023-01-11 09:35:43 +00:00
Jiayi Sun	afd8dd085f	replace vec::vec_scalar_t with at::opmath_type (#91086 ) ### Motivation The two accumulation types vec::vec_scalar_t and at::opmath_type are duplicated, so we replace vec::vec_scalar_t with at::opmath_type, and vec::vec_scalar_t will be deprecated later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91086 Approved by: https://github.com/jgong5, https://github.com/mingfeima	2023-01-11 09:08:01 +00:00
XiaobingSuper	3790b50505	inductor: fix .to(memort_format) issue which doesn't generate right stride (#91948 ) Motivation: for .to(memory_format), the inductor doesn't generate the right stride, see the following example: ``` class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): x = x.to(memory_format=torch.contiguous_format) return x ``` the generated code doesn't do the memory format change and gets a wrong stride (802816, 1, 14336, 256), it is not a contiguous stride. ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() return (arg0_1, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((128, 256, 56, 56), (802816, 1, 14336, 256), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` After this PR, the will have a memory format change: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<256; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3136; i2+=1) { auto tmp0 = in_ptr0[i1 + (256i2) + (802816i0)]; out_ptr0[i2 + (3136i1) + (802816i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf1 = empty_strided((128, 256, 56, 56), (802816, 3136, 56, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf1.data_ptr())) del arg0_1 return (buf1, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((128, 256, 56, 56), (802816, 1, 14336, 256), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91948 Approved by: https://github.com/ngimel	2023-01-11 08:23:26 +00:00
Driss Guessous	92855a215b	[SDPA] Guard mem efficient attention in deterministic mode (#91979 ) # Summary Memory efficient attention is a non deterministic algorithm. This PR ensures that the sdp_choice will allow for mem-efficient to be used as the backend to SDPA if we are in warn only mode. Otherwise if we have enabled determinism and and set warn_only to False sdp_choice will not return memory efficient attention as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91979 Approved by: https://github.com/cpuhrsch	2023-01-11 07:40:31 +00:00
BowenBao	d540442e36	[ONNX] Fix 'prim::PackPadded' shape inference (#91829 ) In `peephole` pass, user nodes of output of `prim::PackPadded` are modified to consume the input of `prim::PackPadded` instead. Hence the logic in shape type inference. However only the first output requires this workaround. Fixes #91528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91829 Approved by: https://github.com/titaiwangms	2023-01-11 07:35:55 +00:00
Dmytro Dzhulgakov	812d774cc9	Easy: add instructions for testing pytorch/builder (#91923 ) Also makes the repo name configurable for branches in forks Pull Request resolved: https://github.com/pytorch/pytorch/pull/91923 Approved by: https://github.com/malfet, https://github.com/seemethere	2023-01-11 07:26:46 +00:00
mingfeima	8f5f15a64b	optimize scatter_add performance for gnn usage on CPU (#82703 ) ### Motivation of this PR This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered. `Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations. To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value. ### Algorithm Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized. This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to: * convert memory format from `COO` to `CSR` * do spmm reduce ### Perf improvement The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit. CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz ` aten::scatter_add_` has been reduced from 37.797s to 5.989s: * breakdown before ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::scatter_add_ 49.00% 37.797s 49.00% 37.797s 41.445ms 912 aten::index_select 19.74% 15.223s 19.74% 15.227s 6.678ms 2280 aten::linear 0.01% 5.706ms 15.04% 11.602s 12.721ms 912 aten::addmm 6.62% 5.108s 7.92% 6.112s 13.403ms 456 aten::matmul 0.00% 2.339ms 7.10% 5.475s 12.006ms 456 ``` * breakdown after ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::index_select 32.41% 14.677s 32.42% 14.681s 6.439ms 2280 aten::linear 0.01% 6.665ms 26.43% 11.968s 13.123ms 912 aten::addmm 11.76% 5.328s 13.76% 6.232s 13.667ms 456 aten::scatter_add_ 13.22% 5.989s 13.22% 5.989s 6.566ms 912 aten::matmul 0.01% 2.303ms 12.63% 5.720s 12.543ms 456 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82703 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-01-11 05:55:09 +00:00
min-jean-cho	364f526b9c	[Inductor] assert generator for random, dropout (#91833 ) See comment https://github.com/pytorch/pytorch/pull/90869#discussion_r1063731541 , https://github.com/pytorch/pytorch/pull/91673#discussion_r1061099337. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91833 Approved by: https://github.com/jansel	2023-01-11 03:24:10 +00:00
Masaki Kozuki	554a796aef	Implement `torch._foreach_lerp` (#87562 ) As per title. - [ ] ~~Q: Do we want `torch._foreach_lerp.ScalarList` as well?~~ - [ ] ~~we might want to have `ATen/native/cuda/lerp.cuh` and include it in `ATen/native/cuda/Lerp.cu` and `ATen/native/cuda/ForeachTernaryOp.cu`~~ Related: - https://github.com/pytorch/pytorch/issues/58833 - https://github.com/pytorch/pytorch/issues/71683 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87562 Approved by: https://github.com/ngimel	2023-01-11 02:52:04 +00:00
Edward Z. Yang	7c907bd829	Minor doc updates for S3 update procedure (#91978 ) It would be good to make the s3_init_config.json instructions more detailed (like step-by-step for how to run the custom build) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91978 Approved by: https://github.com/malfet	2023-01-11 02:36:29 +00:00
Eddie Yan	19723d754d	[CUBLAS][TF32] Change cuBLAS TF32 environment variable to be initialization only (#85859 ) CC @ptrblck @xwang233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85859 Approved by: https://github.com/ngimel	2023-01-11 02:03:11 +00:00
Catherine Lee	de4e4c785a	[mergebot] Fix mergebot allow revert of codev diff (#91975 ) mergebot was allowing non facebook-github-bot users to revert codev diffs when it shouldnt be allowed Fixes https://github.com/pytorch/test-infra/issues/1381 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91975 Approved by: https://github.com/ZainRizvi, https://github.com/kit1980, https://github.com/malfet	2023-01-11 01:59:07 +00:00
Catherine Lee	6b542147a3	Make job names match BUILD_ENVIRONMENT (#91512 ) test-times.json uses the job name as the key, but when looking up the the times in CI, the BUILD_ENVIRONMENT is used because we don't have a good way of getting the job name (it usually turns out to be just "test" or "build" instead of "linux-cuda..."), so having the job names match the BUILD_ENVIRONMENT is necessary for sharding to work Another solution might be to make the lookup more robust or look up the job name similar to how we get the job id. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91512 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet	2023-01-11 01:56:18 +00:00
PyTorch MergeBot	43050b8301	Revert "[Inductor] Added aten.uniform_ decomp (#90869 )" This reverts commit c55293d64099ac4380f5e3955a891d1d7924f327. Reverted https://github.com/pytorch/pytorch/pull/90869 on behalf of https://github.com/huydhn due to Crossref error cannot just simply be ignored because it would break trunk for every commits after this, i.e. `fd0030fe74`. The failure would need to be handled gracefully, i.e. adding an XFAIL for example	2023-01-11 01:18:11 +00:00
Aaron Gokaslan	4dcb10e027	Add missing clang-tidy fixes for modernize-use-equals-(default\|delete) (#91857 ) More clang-tidy for default or deleting more ctors and dtors. This is slightly more efficient and more readable Pull Request resolved: https://github.com/pytorch/pytorch/pull/91857 Approved by: https://github.com/ezyang	2023-01-11 01:16:05 +00:00
Pearu Peterson	b9a035c1c5	Add check-sparse-tensor-invariants flag to Context. (#90849 ) This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI: - `torch.enable_check_sparse_tensor_invariants` and `torch.is_check_sparse_tensor_invariants_enabled` functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively - `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden. The PR also fixes https://github.com/pytorch/pytorch/issues/90833 # Main issue The following content is outdated after merging the PRs in this ghstack but kept for the record. The importance of this feature is that when enabling the invariants checks by default, say, via <details> ``` $ git diff diff --git a/torch/__init__.py b/torch/__init__.py index c8543057c7..19a91d0482 100644 --- a/torch/__init__.py +++ b/torch/__init__.py @@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ: # Populate magic methods on SymInt and SymFloat import torch.fx.experimental.symbolic_shapes + +# temporarily enable sparse tensor arguments validation in unsafe +# constructors: + +torch._C._set_check_sparse_tensor_invariants(True) ``` </details> a massive number of test failures/errors occur in test_sparse_csr.py tests: ``` $ pytest -sv test/test_sparse_csr.py <snip> ==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ==== ``` that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised: ``` AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor" RuntimeError: CUDA error: device-side assert triggered RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied. RuntimeError: expected col_indices to be a strided and contiguous tensor RuntimeError: expected row_indices to be a strided and contiguous tensor RuntimeError: expected values to be a strided and contiguous tensor RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90849 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-01-11 01:05:14 +00:00
kshitij12345	949f25be0c	[vmap] all, any : batching rule (#91966 ) Fixes https://github.com/pytorch/functorch/issues/1060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91966 Approved by: https://github.com/srossross, https://github.com/zou3519	2023-01-11 00:45:51 +00:00
Jason Ansel	7c1c239db1	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575 ) This reverts commit 94262efc7d381ace82aa74ed2f5f5ec76f8fca95 to reland #91105 / #90738. Fixes https://github.com/pytorch/torchdynamo/issues/2015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575 Approved by: https://github.com/ngimel	2023-01-11 00:08:03 +00:00
Peter Bell	6912f7c564	Update references to 1.14 to 2.0 (#91769 ) There won't be a 1.14 release, so these should be updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91769 Approved by: https://github.com/Skylion007, https://github.com/svekars, https://github.com/lezcano	2023-01-10 23:42:07 +00:00
lezcano	fd0030fe74	Fix indexing_dtype_strength_reduction (#91601 ) Many of the previous inductive cases were wrong (e.g. `abs`, `sq`, `div` and `truediv`). We rewrite it using the mathematical terms that allow to prove the relevant upper and lower bounds. Note that the inductive step can be seen as a not-too-difficult optimisation problem with constraints, hence the naming of the functions. For many of the other functions, we also simplify the formulas, which will be useful when this code is generalised to work with symbolic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91601 Approved by: https://github.com/jansel, https://github.com/eellison	2023-01-10 23:39:30 +00:00
min-jean-cho	c55293d640	[Inductor] Added aten.uniform_ decomp (#90869 ) Fixes #90815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90869 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel, https://github.com/albanD	2023-01-10 23:05:01 +00:00
Denis Vieriu	0a677f2335	[MPS] Add testcase for copying cpu tensors into strided mps tensors (#91784 ) Fixes https://github.com/pytorch/pytorch/issues/86975 If the destination is a strided MPS tensor and the source is a CPU tensor, we cannot perform a blit directly to copy the memory from the CPU tensor into the MPS tensor. We need to scatter the data into the right indices. ``` a1 = torch.Tensor([[1,2],[3,4], [5,6]]).to(torch.device("mps")) b1 = torch.Tensor([-1, -1]) a1[1:,1] = b1 # strided MPS destination / contiguous CPU source ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91784 Approved by: https://github.com/kulinseth	2023-01-10 22:45:48 +00:00
Denis Vieriu	09c2b2af53	[MPS] Solve contiguos view tensors using arrayViews instead of blits (#146 ) (#91743 ) Solve contiguous view tensors using arrayViews directly instead of performing blit or gather. E.g in case of the following example: ``` x = torch.tensor([1,2,3,4], device="mps') y = x[2:] r = y + 2 ``` Previously, `y` would be materialized using a gather or a blit. With this change, the memory of `y` is aliased directly using arrayViews, thus skipping the need for blit or gather. Fixes pytorch#85297, pytorch#86048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91743 Approved by: https://github.com/razarmehr, https://github.com/kulinseth	2023-01-10 22:39:29 +00:00
Kazuaki Ishizaki	4f91b8e0ee	Fix typo under docs directory (#91871 ) This PR fixes typo in '.rst' files under 'docs' directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/91871 Approved by: https://github.com/ngimel	2023-01-10 22:33:36 +00:00
mingfeima	645fb217c0	optimize sampled_addmm performance on CPU (SparseCSR) (#90978 ) ### Target and Background This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference. The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`). ### Benchmarks Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where: * number of nodes: 2.4 * 10^6 * number of edges: 1.26 * 10^8 * number of features: 128 So if we store the adjacency matrix is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, 1100x speedup: CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket. ``` ### before: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms! ### after: run 1000 rows from the whole dataset sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms! ### after: run the whole dataset sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-01-10 22:13:35 +00:00
PyTorch MergeBot	3aeb7127b4	Revert "Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#91600 )" This reverts commit 370df963e062d8eb409d4426dd59b3f0cac8c3d1. Reverted https://github.com/pytorch/pytorch/pull/91600 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-01-10 21:38:40 +00:00
salilsdesai	323e0143d6	[Op Benchmark] Add Pointwise Conv2d Op Benchmark (#91918 ) @bypass-github-export-checks Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with The configs are copied from Conv2d with the kernel parameter removed. I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case. Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918 Approved by: https://github.com/mcr229	2023-01-10 21:36:37 +00:00
salilsdesai	a6e2d76bb9	[Vulkan] Add Override Mechanism to Shader Registry (#91917 ) @bypass-github-export-checks Setting overrides in the Vulkan Shader Registry will be used with the Op Benchmark Tool to benchmark different shaders on different devices. Differential Revision: [D41738945](https://our.internmc.facebook.com/intern/diff/D41738945/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91917 Approved by: https://github.com/mcr229	2023-01-10 21:34:26 +00:00
salilsdesai	3d6f85c936	[Vulkan] Enable Codegen ShaderInfo Registry from GLSLT + Params YAML files (conv2d_pw) (#91916 ) @bypass-github-export-checks This diff allows for adding entries to the shader registry by specifying which op names and registry keys should map to a template codegen Shader in the codegen Shader's glslt and params yaml files. This can be done by - adding a REGISTER_FOR entry which maps to either a tuple of (op name, list of registry keys) or null to the YAML file, and - adding a ```REGISTER_FOR = $REGISTER_FOR``` line to the ShaderInfo comment in the glslt file Ex. YAML File: ``` conv2d_pw: parameter_names_with_default_values: ... REGISTER_FOR: - !!python/tuple ["conv2d_pw", ["catchall"]] parameter_values: - ... REGISTER_FOR: null ``` GLSLT File: ``` ... * REGISTER_FOR = $REGISTER_FOR ... ``` This diff also registers the conv2d_pw_2x2 Shader under ```'conv2d_pw → 'catchall'``` in the registry and uses ```VK_REGISTRY_KERNEL``` to retrieve the shader by look up in the registry The shader registry generated in spv.cpp now looks like ``` ShaderRegistry shader_registry = { {"conv2d", {{"catchall", "conv2d"}}}, {"conv2d_pw", {{"catchall", "conv2d_pw_2x2"}}}}; ``` and the generated conv2d_p2_KxK.glsl files look like: K=1 ``` ... /* * TILE_SIZE = (1, 1, 1) * WEIGHT_STORAGE = TEXTURE_2D * WEIGHT_STORAGE_LAYOUT = OC4,IC4,4ic,4oc * BIAS_STORAGE = TEXTURE_2D * REGISTER_FOR = None / ... ``` K=2 ``` ... / * TILE_SIZE = (2, 2, 1) * WEIGHT_STORAGE = TEXTURE_2D * WEIGHT_STORAGE_LAYOUT = OC4,IC4,4ic,4oc * BIAS_STORAGE = TEXTURE_2D * REGISTER_FOR = ('conv2d_pw', ['catchall']) */ ... ``` Differential Revision: [D42198560](https://our.internmc.facebook.com/intern/diff/D42198560/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91916 Approved by: https://github.com/mcr229	2023-01-10 21:32:23 +00:00
salilsdesai	67d401d1be	[Vulkan] Enable Codegen ShaderInfo Registry from GLSL files (conv2d) (#91915 ) @bypass-github-export-checks This diff allows for adding entries to the shader registry by specifying which op names and registry keys should map to a Shader in the Shader's glsl file. This can be done by adding a REGISTER_FOR line with a tuple of (op name, list of registry keys) to the ShaderInfo comment in the glsl file Ex. ``` REGISTER_FOR = ('conv2d', ['catchall', ...]) ``` This diff also registers the conv2d Shader under ```'conv2d → 'catchall'``` in the registry and uses ```VK_REGISTRY_KERNEL``` to retrieve the shader by look up in the registry The shader registry generated in spv.cpp now looks like ``` ShaderRegistry shader_registry = { {"conv2d", {{"catchall", "conv2d"}}}}; ``` Differential Revision: [D42197400](https://our.internmc.facebook.com/intern/diff/D42197400/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91915 Approved by: https://github.com/mcr229	2023-01-10 21:25:16 +00:00
Tugsbayasgalan Manlaibaatar	0c3ed2ed22	[dynamo] Support dynamic slicing (#91341 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91341 Approved by: https://github.com/voznesenskym	2023-01-10 21:23:55 +00:00
salilsdesai	3139e687db	[Vulkan] Add Basic Shader Registry (#91914 ) @bypass-github-export-checks We want to be able to look-up which shader to use in a registry given a particular op/algorithm name, which is what this diff enables. This is done with the newly added ```shader_registry``` map and ```look_up_shader_info``` function. After this change, Shaders can be retrieved either with the ```VK_KERNEL``` macro, which gets the Shader with a specified name directly, or with the ```VK_REGISTRY_KERNEL``` macro, which looks up what Shader should be used for a specified algorithm name in the registry. For now, the registry is empty and unused. In the next diffs in this stack, I will be adding support for registering a shader in the registry in GLSL and GLSLT + Params Yaml files. I also - Adjusted the formatting of spv.h and spv.cpp so that they are closer to what clang wants, which makes them easier to read. (proper indentation, proper order of includes, etc.) - Moved the codegen spv/registry code from at::native::vulkan to at::native::vulkan::api (since registry.cpp / .h are in ```ATen/native/vulkan/api```) Now spv.h looks like ``` #pragma once #include <ATen/native/vulkan/api/Types.h> #include <ATen/native/vulkan/api/vk_api.h> #include <c10/util/flat_hash_map.h> #include <string> namespace at { namespace native { namespace vulkan { namespace api { struct ShaderInfo; } // namespace api typedef ska::flat_hash_map<std::string, api::ShaderInfo> ShaderListing; typedef ska::flat_hash_map<std::string, std::string> RegistryKeyMap; typedef ska::flat_hash_map<std::string, RegistryKeyMap> ShaderRegistry; extern const ShaderListing shader_infos; extern ShaderRegistry shader_registry; inline const ShaderListing& get_shader_infos() { return shader_infos; } inline ShaderRegistry& get_shader_registry() { return shader_registry; } } // namespace vulkan } // namespace native } // namespace at ``` and spv.cpp looks like ``` #include <ATen/native/vulkan/api/Shader.h> #include <ATen/native/vulkan/spv.h> #include <stdint.h> #include <vector> namespace at { namespace native { namespace vulkan { namespace { const uint32_t adaptive_avg_pool2d_bin[] = { 119734787, ... }; ... const uint32_t conv2d_pw_2x2_bin[] = { 119734787, ... }; } // namespace const ShaderListing shader_infos = { {"adaptive_avg_pool2d", api::ShaderInfo( "vulkan.adaptive_avg_pool2d", adaptive_avg_pool2d_bin, 3204, {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER}, std::vector<uint32_t>(), api::StorageType::UNKNOWN, api::StorageType::UNKNOWN)}, ... {"conv2d_pw_2x2", api::ShaderInfo( "vulkan.conv2d_pw_2x2", conv2d_pw_2x2_bin, 7736, {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER}, {2, 2, 1}, api::StorageType::TEXTURE_2D, api::StorageType::TEXTURE_2D)}}; ShaderRegistry shader_registry = { }; } // namespace vulkan } // namespace native } // namespace at ``` (Full File: P594112814) Differential Revision: [D41594453](https://our.internmc.facebook.com/intern/diff/D41594453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91914 Approved by: https://github.com/mcr229	2023-01-10 21:13:17 +00:00
salilsdesai	cd62ad5f88	[Vulkan] Enable including GLSL files from custom locations in gen_vulkan_spv (#91913 ) @bypass-github-export-checks To include custom locations when building with buck, use a ```-c gen_vulkan_spv.additional_glsl_paths="..."``` flag where ... is a list of filegroups and source directory paths separated by spaces, ex. to include the sources added in D41413913, you would use ``` buck build ... -c gen_vulkan_spv.additional_glsl_paths="//xplat/caffe2:test_glsl_src_path_a test_src/a //xplat/caffe2:test_glsl_src_path_b test_src/b" ``` (as shown in the test plan) Differential Revision: [D41413914](https://our.internmc.facebook.com/intern/diff/D41413914/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41413914/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/91913 Approved by: https://github.com/mcr229	2023-01-10 20:35:23 +00:00
salilsdesai	ec94cbc66a	[Vulkan] Remove GLSL Code Gen (#91912 ) @bypass-github-export-checks GLSL Code Gen is not used, so this diff removes - GLSL parts of ShaderSource - Anything enclosed by USE_VULKAN_SHADERC_RUNTIME, as well as the flag itself - gen_vulkan_glsl script Plus some additional refactoring Differential Revision: [D41358861](https://our.internmc.facebook.com/intern/diff/D41358861/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41358861/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/91912 Approved by: https://github.com/mcr229	2023-01-10 20:29:47 +00:00
salilsdesai	28eb3c8faf	[Vulkan] Generate ShaderInfos Directly via Codegen in gen_vulkan_spv (#91911 ) @bypass-github-export-checks Before this change, we have the data members which make up a ```ShaderInfo``` sitting in ```spv.h/.cpp``` in an unorganized manner. This diff makes the change such that the ```ShaderInfo```s are initialized directly in spv.h/.cpp Now spv.h looks like ``` #pragma once #include <stdint.h> #include <vector> #include <string> #include <ATen/native/vulkan/api/Types.h> #include <ATen/native/vulkan/api/vk_api.h> namespace at { namespace native { namespace vulkan { namespace api { struct ShaderInfo; } // namespace api extern const api::ShaderInfo adaptive_avg_pool2d_spv; ... extern const api::ShaderInfo conv2d_pw_2x2_spv; } // namespace vulkan } // namespace native } // namespace at ``` (Full File: P557399150) and spv.cpp looks like ``` #include <ATen/native/vulkan/spv.h> #include <ATen/native/vulkan/api/Shader.h> namespace at { namespace native { namespace vulkan { namespace { const uint32_t adaptive_avg_pool2d_spv_bin[] = { 119734787, ... }; ... const uint32_t conv2d_pw_2x2_spv_bin[] = { 119734787, ... }; } // namespace const api::ShaderInfo adaptive_avg_pool2d_spv( "vulkan.adaptive_avg_pool2d", adaptive_avg_pool2d_spv_bin, 3204, {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER}, std::vector<uint32_t>(), api::StorageType::UNKNOWN, api::StorageType::UNKNOWN ); ... const api::ShaderInfo conv2d_pw_2x2_spv( "vulkan.conv2d_pw_2x2", conv2d_pw_2x2_spv_bin, 7736, {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER}, {2, 2, 1}, api::StorageType::TEXTURE_2D, api::StorageType::TEXTURE_2D ); } // namespace vulkan } // namespace native } // namespace at ``` (Full File: P584237146) Differential Revision: [D41354313](https://our.internmc.facebook.com/intern/diff/D41354313/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91911 Approved by: https://github.com/mcr229	2023-01-10 20:22:30 +00:00
salilsdesai	776fef9ecc	[Vulkan] Merge ShaderSource into ShaderInfo (#91910 ) @bypass-github-export-checks ```ShaderInfo``` was added by Kimish in D40280338 to be an extension of ```ShaderSource``` with extra fields. In this diff, I merge the two into one struct, using the combined struct in place of wherever either of the two was used before Differential Revision: [D41197273](https://our.internmc.facebook.com/intern/diff/D41197273/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91910 Approved by: https://github.com/mcr229	2023-01-10 20:19:11 +00:00
Salil Desai	370df963e0	Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#91600 ) Summary: X-link: https://github.com/facebookresearch/d2go/pull/452 Remove MobileOptimizerType and all rewrite flags from torch.X and torch._C.X to clean up torch.X and torch._C.X namespaces The affected rewrite flags are - CONV_BN_FUSION - FUSE_ADD_RELU - HOIST_CONV_PACKED_PARAMS - INSERT_FOLD_PREPACK_OPS - REMOVE_DROPOUT - VULKAN_AUTOMATIC_GPU_TRANSFER Bc-Breaking Change: Before this change, the rewrite flags were accessible through all of 1. torch.utils.mobile_optimizer.MobileOptimizerType.X 2. torch._C.MobileOptimizerType.X 3. torch.X 4. torch.MobileOptimizerType.X 5. torch._C.X But after this change, only torch.utils.mobile_optimizer.MobileOptimizerType.X (option 1 above) and the newly added torch._C._MobileOptimizerType.X remain Corresponding updates to PyTorch Tutorial Docs are in https://github.com/pytorch/tutorials/pull/2163 Test Plan: ```buck test caffe2/test:test_mobile_optimizer``` ``` Summary Pass: 6 Skip: 1 ↻ caffe2/test:test_mobile_optimizer - test_mobilenet_optimize_for_mobile (test_mobile_optimizer.TestOptimizer) ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124793514412 ``` ___ With temporary testing changes in D41690204: ```buck run caffe2:test_rewrite_flags_api``` Before: ``` torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ✅ torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ❌ (module 'torch._C' has no attribute '_MobileOptimizerType') torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ torch.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ✅ ``` After: ``` torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ✅ torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ✅ \| Result: ✅ torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch._C' has no attribute 'MobileOptimizerType') torch.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER') torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch' has no attribute 'MobileOptimizerType') torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER Expected: ❌ \| Result: ❌ (module 'torch._C' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER') ``` ```buck test caffe2/test:public_bindings -- test_no_new_bindings``` ``` Summary Pass: 1 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7881299473114294 ``` Differential Revision: D41690203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91600 Approved by: https://github.com/albanD, https://github.com/malfet	2023-01-10 20:16:53 +00:00
Sean Silva	e9cd7e0869	[dynamo] Fix rst syntax for list (#90390 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90390 Approved by: https://github.com/soumith	2023-01-10 19:56:26 +00:00
PyTorch MergeBot	7f2b5ea1e1	Revert "Avoid device casting for all singleton tensors in optimizer states (#91454 )" This reverts commit 1e725c97470d8cf74e85984ca997e77c76e91a18. Reverted https://github.com/pytorch/pytorch/pull/91454 on behalf of https://github.com/janeyx99 due to Likely caused regression where checkpoint resume fails during training	2023-01-10 18:57:50 +00:00
Denis Vieriu	e0b82d7d1f	[MPS] Fix convolution `Source and weight input channels mismatch' crash (#91822 ) Fixes crashes in conv input/weight backward passes due to NCHW / NHWC formats. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91822 Approved by: https://github.com/razarmehr	2023-01-10 18:30:18 +00:00
Pearu Peterson	cdc30048e5	Fix numel() result after resizing a sparse compressed tensor. (#91831 ) Fixes #91830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91831 Approved by: https://github.com/cpuhrsch	2023-01-10 18:21:07 +00:00
Jeff Daily	ce50a8de75	[CI][ROCm] add test_dataloader to CI_SERIAL_LIST (#91895 ) Still working towards solving #90940 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/91895 Approved by: https://github.com/huydhn	2023-01-10 16:32:39 +00:00
XiaobingSuper	1892c75a45	fix norrow_copy correctness issue for non-contiguous input for cpu path(reland) (#91883 ) This PR is about re-land https://github.com/pytorch/pytorch/pull/91789. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91883 Approved by: https://github.com/lezcano	2023-01-10 10:56:18 +00:00
Nikita Karetnikov	d1cc64b2ac	[primTorch] Fix masking in `logsumexp` ref (#91941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91941 Approved by: https://github.com/ngimel, https://github.com/lezcano	2023-01-10 10:55:04 +00:00
PyTorch MergeBot	498be7ed25	Revert "Refactor stack_trace preservation for node meta preservation (#90803 )" This reverts commit 0f1302eeaed3b10ab6db493c1c33797a6ec46866. Reverted https://github.com/pytorch/pytorch/pull/90803 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-01-10 10:44:28 +00:00
anjali411	c887837ec3	Reland "Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463 )" (#91897 ) This reverts commit 84266ae6701c95fd76b50101e07981b1ef6dfe33. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91897 Approved by: https://github.com/ngimel	2023-01-10 08:16:07 +00:00
Michael Voznesensky	3726d23219	Torch package support in dynamo (#91821 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91821 Approved by: https://github.com/suo, https://github.com/malfet	2023-01-10 06:53:15 +00:00
Michael Voznesensky	ae2e755f15	RM (unused?) has_mutation (#91931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91931 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-01-10 06:50:37 +00:00
Jesse Cai	32e9b29ce9	[pruning][core][feature] Add in SaliencyPruner to pruner._experimental (#91814 ) Summary: This PR adds in SaliencyPruner, an implementation of L1 norm pruning for structured pruning, as well as additional tests for the SaliencyPruner The README.md references this file but I forgot to add it in earlier when writing the tutorial. Test Plan: ``` python test/test_ao_sparsity.py -- TestSaliencyPruner ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91814 Approved by: https://github.com/jerryzh168	2023-01-10 04:04:55 +00:00
Sherlock Huang	42a63a7ed9	Dynamo.export uses dynamic=True for symbolic tracing (#91899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91899 Approved by: https://github.com/ezyang	2023-01-10 01:12:22 +00:00
jjsjann123	4919e11900	fixing test_batch_norm_implicit_dtype_promotion (__main__.TestNvFuserDynamo) (#91541 ) patches the missing pin_memory argument on full::meta_impl. This is not a functional break, but it does give test failure, which asserts on no warning. vvv `python test/test_nvfuser_dynamo.py -k test_batch_norm_implicit_dtype_promotion` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91541 Approved by: https://github.com/malfet	2023-01-10 00:31:36 +00:00
Edward Z. Yang	67f965b15a	Add Skylion007 as a core reviewer (#91890 ) Skylion007 has been diligently improving the state of our C++ code to follow best practices and make it possible to run lint on it (at the moment the code is so messy it cannot be linted), and I would like to give him review permissions to facilitate this work. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91890 Approved by: https://github.com/Skylion007, https://github.com/soumith	2023-01-10 00:27:18 +00:00
Zain Rizvi	b0f359a3c9	Disable win vs2019 cpu build+test until we figure out the linker crash (#91932 ) `win-vs2019-cpu-py3 / build` builds are failing consistently right now with a linker crash. Tracked by sev https://github.com/pytorch/pytorch/issues/91933 Disable those workflows for to mitigate the damage until we figure out a root cause Example: [win-vs2019-cpu-py3 / build](https://github.com/pytorch/pytorch/actions/runs/3877976332/jobs/6614897752#logs) Exact error: ``` FAILED: bin/torch_python.dll lib/torch_python.lib cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\torch\CMakeFiles\torch_python.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib && cd ." LINK: command "C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_python.dll.manifest" failed (exit code 0) with the following output: LINK : fatal error LNK1000: Internal error during CImplib::EmitImportThunk Access violation ninja: build stopped: subcommand failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91932 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-01-10 00:21:08 +00:00
lezcano	138a0188e0	Add support for logaddexp(float16) in CUDA and implement its reference (#91869 ) The reference is implemented so that it generates efficient and numerically stable triton code. Fixes https://github.com/pytorch/pytorch/issues/91683 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91869 Approved by: https://github.com/ngimel	2023-01-10 00:19:24 +00:00
Zain Rizvi	df3adbd521	Pin onnx-script to a version before they bumped numpy (#91929 ) Onnx PR https://github.com/microsoft/onnx-script/pull/289 expects numpy to be upgraded. That breaks pytorch's onnx builds. For now, mitigate by pinning the script to an older version until there's a proper solution Pull Request resolved: https://github.com/pytorch/pytorch/pull/91929 Approved by: https://github.com/justinchuby, https://github.com/kit1980	2023-01-10 00:00:37 +00:00
Sherlock Huang	0f1302eeae	Refactor stack_trace preservation for node meta preservation (#90803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90803 Approved by: https://github.com/jerryzh168, https://github.com/albanD	2023-01-09 23:23:27 +00:00
Catherine Lee	1e768c63c1	Add merged label to ghstack prs (#90238 ) not very elegant other option might be adding something to pytorchbot to listen to push events for master? Pull Request resolved: https://github.com/pytorch/pytorch/pull/90238 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-01-09 22:49:20 +00:00
fduwjj	32356aaee6	[4/N] Add test for partial training for NamedOptimizer (#91344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91344 Approved by: https://github.com/rohan-varma	2023-01-09 22:19:49 +00:00
Michael Gschwind	26beb46da4	Reduce #iters to make test run always (#91837 ) Summary: Reduce #iters to make test run always Test Plan: sandcastle Reviewed By: drisspg Differential Revision: D42397999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91837 Approved by: https://github.com/drisspg	2023-01-09 21:38:18 +00:00
drisspg	95e3e339a8	Add log_once to fused attention kernels (#91858 ) # Summary Adding log once to track usage statistics of the fused attention kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91858 Approved by: https://github.com/cpuhrsch	2023-01-09 20:59:26 +00:00
Edward Z. Yang	333540a458	Reland "Add torch.utils.device_mode" (#91796 ) Original PR https://github.com/pytorch/pytorch/pull/91525 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91796 Approved by: https://github.com/albanD	2023-01-09 20:57:12 +00:00
milesial	9d20d6d5ec	Foreach clamp_min clamp_max (#91384 ) Adds `_foreach_clamp_min` and `_foreach_clamp_max` as binary ops, with scalar, scalarlist and tensorlist support. Timing example for `_foreach_clamp_min_` on a GTX3070Ti across a list of tensors with varying count and item size (times are in microseconds (us)): CUDA: ``` [------------------ (tensors, scalar) -------------------] \| for loop \| foreach 10 tensors of size 4 \| 29.0 \| 10.2 100 tensors of size 4 \| 234.4 \| 18.3 1000 tensors of size 4 \| 2194.1 \| 113.5 10000 tensors of size 4 \| 21745.6 \| 1144.5 10 tensors of size 16 \| 29.5 \| 12.0 100 tensors of size 16 \| 256.9 \| 19.9 1000 tensors of size 16 \| 2499.7 \| 123.6 10000 tensors of size 16 \| 25022.2 \| 1295.6 10 tensors of size 256 \| 32.8 \| 11.2 100 tensors of size 256 \| 258.8 \| 19.7 1000 tensors of size 256 \| 2509.2 \| 123.7 10000 tensors of size 256 \| 25016.2 \| 1295.4 10 tensors of size 65536 \| 32.9 \| 18.7 100 tensors of size 65536 \| 327.1 \| 150.3 1000 tensors of size 65536 \| 3051.3 \| 1388.0 10000 tensors of size 65536 \| 30476.9 \| 14021.5 [------------------ (tensors, tensors) ------------------] \| for loop \| foreach 10 tensors of size 4 \| 26.8 \| 17.3 100 tensors of size 4 \| 206.8 \| 90.5 1000 tensors of size 4 \| 1993.0 \| 828.9 10000 tensors of size 4 \| 19851.0 \| 9063.3 10 tensors of size 16 \| 34.7 \| 20.0 100 tensors of size 16 \| 232.2 \| 102.1 1000 tensors of size 16 \| 2220.9 \| 977.3 10000 tensors of size 16 \| 22644.5 \| 10361.4 10 tensors of size 256 \| 30.5 \| 19.7 100 tensors of size 256 \| 231.6 \| 102.4 1000 tensors of size 256 \| 2251.9 \| 978.7 10000 tensors of size 256 \| 22680.3 \| 10405.8 10 tensors of size 65536 \| 30.6 \| 34.4 100 tensors of size 65536 \| 315.1 \| 223.6 1000 tensors of size 65536 \| 3252.1 \| 2114.4 10000 tensors of size 65536 \| 30578.0 \| 22826.3 ``` CPU: ``` [------------------- (tensors, scalar) -------------------] \| for loop \| foreach 10 tensors of size 4 \| 13.0 \| 9.6 100 tensors of size 4 \| 62.4 \| 31.6 1000 tensors of size 4 \| 562.2 \| 245.6 10000 tensors of size 4 \| 5552.2 \| 2517.7 10 tensors of size 16 \| 14.9 \| 11.3 100 tensors of size 16 \| 74.1 \| 36.9 1000 tensors of size 16 \| 663.7 \| 285.5 10000 tensors of size 16 \| 6765.2 \| 2947.5 10 tensors of size 256 \| 15.2 \| 11.8 100 tensors of size 256 \| 76.0 \| 37.7 1000 tensors of size 256 \| 728.8 \| 323.9 10000 tensors of size 256 \| 7274.4 \| 3800.3 10 tensors of size 65536 \| 105.6 \| 124.5 100 tensors of size 65536 \| 982.8 \| 939.7 1000 tensors of size 65536 \| 14993.1 \| 14579.2 10000 tensors of size 65536 \| 163091.0 \| 151555.8 [------------------- (tensors, tensors) ------------------] \| for loop \| foreach 10 tensors of size 4 \| 11.8 \| 10.5 100 tensors of size 4 \| 53.1 \| 38.2 1000 tensors of size 4 \| 465.1 \| 316.1 10000 tensors of size 4 \| 4616.9 \| 3625.9 10 tensors of size 16 \| 13.5 \| 12.3 100 tensors of size 16 \| 63.0 \| 46.5 1000 tensors of size 16 \| 560.1 \| 359.9 10000 tensors of size 16 \| 5586.8 \| 3765.9 10 tensors of size 256 \| 15.2 \| 13.7 100 tensors of size 256 \| 64.4 \| 48.3 1000 tensors of size 256 \| 653.7 \| 410.0 10000 tensors of size 256 \| 5916.6 \| 3901.3 10 tensors of size 65536 \| 109.1 \| 106.8 100 tensors of size 65536 \| 1128.9 \| 1105.0 1000 tensors of size 65536 \| 16245.0 \| 15950.8 10000 tensors of size 65536 \| 171111.3 \| 163540.2 ``` Example use: ``` tensors = [torch.randn(16, device='cuda') for _ in range(10)] out = torch._foreach_clamp_min(tensors, 0.1) out = torch._foreach_clamp_min(tensors, [0.1] * len(tensors)) out = torch._foreach_clamp_min(tensors, tensors) torch._foreach_clamp_min_(tensors, 0.1) torch._foreach_clamp_min_(tensors, [0.1] * len(tensors)) torch._foreach_clamp_min_(tensors, tensors) ``` Does not support complex types. Changes the existing `foreach_minimum/maximum` to use this new implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91384 Approved by: https://github.com/ngimel	2023-01-09 19:28:47 +00:00
Matthew Graca	b2646dcb65	Consider updating pin to CMake version 3.23.1 on run_torchbench CI (#91739 ) ### Issues Affected Fixes #74985 and #75705 ### Description Unpinned cmake initially used version 3.23.0 which broke a build https://github.com/pytorch/pytorch/issues/74985#issue-1187048138. This previously led to a necessary pin to cmake version 3.22. With the release of cmake 3.23.1, it is no longer necessary to pin cmake https://github.com/pytorch/pytorch/issues/74985#issuecomment-1102355302. CMake unpin change has not been added to `.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat` because Windows dependencies were refactored away https://github.com/pytorch/pytorch/pull/88862. ### How is this Tested? This change is tested using "RUN_TORCHBENCH:" in the PR body https://github.com/pytorch/pytorch/pull/77577#issuecomment-1128048251. ### People with Relevant Context @janeyx99 RUN_TORCHBENCH: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91739 Approved by: https://github.com/huydhn	2023-01-09 19:07:29 +00:00
Clive Chan	d4aa807ba9	Enable bfloat16 for hardtanh_backward_cuda (#91511 ) I'm not sure why this was left out in the first place as all adjacent operations have both Half and BFloat16. Things seem to work as expected and this enables `relu6` to be used in bfloat16 training. Hardtanh backward is super simple and precision is not relevant. ``` import torch x_fp32 = torch.tensor([-1,2,4,7], requires_grad=True, dtype=torch.float32, device="cuda") x_bf16 = torch.tensor([-1,2,4,7], requires_grad=True, dtype=torch.bfloat16, device="cuda") torch.nn.functional.relu6(x_fp32).sum().backward() torch.nn.functional.relu6(x_bf16).sum().backward() assert (x_fp32.grad == x_bf16.grad).all() ``` Previously would fail with: ``` Traceback (most recent call last): File "test_hardtanh_patch.py", line 5, in <module> torch.nn.functional.relu6(x_bf16).sum().backward() File ".../lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File ".../lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: "hardtanh_backward_cuda" not implemented for 'BFloat16' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91511 Approved by: https://github.com/ngimel	2023-01-09 18:50:28 +00:00
Will Constable	630ef6c711	Fix Dynamo+DDP documentation (#91832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91832 Approved by: https://github.com/soumith, https://github.com/davidberard98	2023-01-09 17:35:49 +00:00
Catherine Lee	e67f5ab6cc	Print and zip remaining test logs (#91510 ) When CI times out or gets cancelled, the code to print and delete logs for currently running tests doesn't get run, which makes it hard to debug what's going on, so print the logs in a new step and also zip them into the usage-log zip (which should probably get a name change at some point) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91510 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-09 17:31:36 +00:00
Nikita Karetnikov	00e5f3a9c5	[primTorch] Move `logsumexp` decomp to refs (#91860 ) Fixes #91843. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91860 Approved by: https://github.com/lezcano	2023-01-09 17:00:43 +00:00
PyTorch MergeBot	84266ae670	Revert "Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463 )" This reverts commit 9945a78a94bd9907c05b102984c7233faa44ad14. Reverted https://github.com/pytorch/pytorch/pull/90463 on behalf of https://github.com/ZainRizvi due to This is causing test failures: FAILED inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_pinv_singular_cuda_float64 - RuntimeError: unexpected success linalg.pinv.singular, torch.float64, cuda	2023-01-09 16:43:36 +00:00
PyTorch MergeBot	f6c7cf1bf5	Revert "Torch package support in dynamo (#91821 )" This reverts commit eeb3e49ed46803dc5d62b306df128b66db14f901. Reverted https://github.com/pytorch/pytorch/pull/91821 on behalf of https://github.com/malfet due to According to minihud broke misc tests, see `eeb3e49ed4`	2023-01-09 14:39:14 +00:00
samdow	39524f20de	[functorch] excise remaining functorch imports from examples (#91282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91282 Approved by: https://github.com/zou3519	2023-01-09 14:35:21 +00:00
samdow	071756c9cf	[functorch] rewrite examples that use make_functional to use functional_call (#88851 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88851 Approved by: https://github.com/zou3519	2023-01-09 14:35:21 +00:00
Denis Vieriu	0ec3c5bc72	[MPS] Reduce ops multi axes support (#91734 ) Currently, most of the reduction ops are flattening the input tensor to 1D to perform the operation. This change removes the flattening of the tensors / the unranked placeholders and adds support for multi axes in all the reduction ops. - Fixes reduction ops with correctness and shape issues. - Fixes masked.argmax / masked.argmin. In case of passing inf to argmax / argmin, MPS will return nan as index for these numbers. Casting this nan to Long will make it -1. This change avoids negative values by clamping them to 0 (matching CPU results). TestConsistency issues fixed: ``` std var amax amin sum prod mean count_nonzero masked.amax masked.amin masked.mean masked.prod masked.std masked.sum ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91734 Approved by: https://github.com/kulinseth	2023-01-09 10:55:11 +00:00
Chen Lai	fd213c3231	Match get_attr when compare node (#91657 ) The pattern can't be matched if one attribute is `_param_constant1` and the other is `_param_constant0` Large graph: ``` # call_function addmm_default aten.addmm.default (_param_constant1, ph_0, _tensor_constant0) {} ``` Pattern graph ``` # call_function addmm_default aten.addmm.default (_param_constant0, ph_0, _tensor_constant0) {} ``` Differential Revision: [D42316574](https://our.internmc.facebook.com/intern/diff/D42316574/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91657 Approved by: https://github.com/SherlockNoMad	2023-01-09 08:10:55 +00:00
Philip Meier	fe80f190df	use context manager for path extension in torch.hub (#75786 ) We are using the idiom ```py sys.path.insert(0, path) # do something sys.path.remove(path) ``` three times in `torch.hub`. This is a textbook case for using a context manager. In addition, by using `try` / `finally` we can enforce the Python path is back in its original state even if the actual action raises an exception: ```py import sys path = "/tmp" # PR try: sys.path.insert(0, path) try: # Any exception raised while performing the actual functionality raise Exception finally: sys.path.remove(path) except Exception: assert path not in sys.path # main try: sys.path.insert(0, path) # Any exception raised while performing the actual functionality raise Exception sys.path.remove(path) except Exception: assert path in sys.path ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/75786 Approved by: https://github.com/NicolasHug	2023-01-09 07:08:35 +00:00
PyTorch MergeBot	d85f3c8237	Revert "fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789 )" This reverts commit 136dadd689981a334985f2029f6d3e747c36da5c. Reverted https://github.com/pytorch/pytorch/pull/91789 on behalf of https://github.com/huydhn due to This breaks trunk with XPASS test_vmap_exhaustive_narrow_copy_cpu_float32 `136dadd689`	2023-01-09 06:50:20 +00:00
PyTorch MergeBot	9b415240d4	Revert "Reland "Add torch.utils.device_mode" (#91796 )" This reverts commit 81b5eff3c383f5308416e129861a2689d717702c. Reverted https://github.com/pytorch/pytorch/pull/91796 on behalf of https://github.com/huydhn due to This breaks trunk with the following failed test https://hud.pytorch.org/failure/test_jit_save%2CTestTracer	2023-01-09 04:45:47 +00:00
anjali411	9945a78a94	Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463 ) Fixes https://github.com/pytorch/pytorch/issues/88843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90463 Approved by: https://github.com/ngimel	2023-01-09 04:11:23 +00:00
mingfeima	3643b4ee4a	fix sort crash when the input is expanded scalar (#91752 ) fix https://github.com/pytorch/pytorch/issues/91420 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91752 Approved by: https://github.com/ezyang	2023-01-09 02:02:56 +00:00
XiaobingSuper	136dadd689	fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789 ) Fix https://github.com/pytorch/pytorch/issues/91690. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91789 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-01-09 00:55:03 +00:00
Aaron Gokaslan	8cec433cf2	Apply clang-tidy fixes to api/csrc/api/include/torch/nn (#91766 ) Split off from #91559 Add move operations to missing shims / helper methods in torch/nn/functional Pull Request resolved: https://github.com/pytorch/pytorch/pull/91766 Approved by: https://github.com/soumith	2023-01-08 23:39:15 +00:00
Tugsbayasgalan Manlaibaatar	f59845db40	Symintify pytorch slicing logic (#91340 ) Differential Revision: [D42398023](https://our.internmc.facebook.com/intern/diff/D42398023) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91340 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-01-08 22:51:42 +00:00
Edward Z. Yang	81b5eff3c3	Reland "Add torch.utils.device_mode" (#91796 ) Original PR https://github.com/pytorch/pytorch/pull/91525 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91796 Approved by: https://github.com/albanD	2023-01-08 03:44:56 +00:00
Michael Voznesensky	eeb3e49ed4	Torch package support in dynamo (#91821 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91821 Approved by: https://github.com/suo	2023-01-08 01:46:24 +00:00
Aaron Gokaslan	73e5379fab	Apply clang-tidy perf fixes to aten (#91772 ) Mostly just automated fixes to get rid of implicit copies. I also fixed on clang-tidy NOLINT comment that was in the wrong spot. Split off from #91559 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91772 Approved by: https://github.com/soumith	2023-01-07 21:15:43 +00:00
Natalia Gimelshein	2c00064113	remove unnecessary decomps (#91828 ) in favor of refs. Generated triton code is the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91828 Approved by: https://github.com/lezcano, https://github.com/soumith	2023-01-07 20:37:12 +00:00
AllenTiTaiWang	e3ed55d483	[ONNX] Add aten::zero support (#91731 ) Fixes #90268 When we use `tensor.zero_()` with inplace slice, it actually uses `aten::zero` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91731 Approved by: https://github.com/BowenBao	2023-01-07 11:07:54 +00:00
blzheng	0c1777acec	Dynamo benchmark: add CPU specific changes (#88477 ) This pr adds some CPU specific changes: - Add support for IPEX backend - https://github.com/pytorch/torchdynamo/issues/1618 - https://github.com/pytorch/torchdynamo/issues/1534 - Enable CPU launcher in runner.py. - Fix the issue that some environment variables are not support on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/88477 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-07 09:26:06 +00:00
Will Constable	75c652821c	Assert valid base source for derivative sources (#91711 ) We should not allow creating a derived source (e.g. AttrSource), without a valid base source. It's more reliable to check this in the source `__init__` or `__post_init__` than asserting we have a valid source before passing that to an AttrSource() call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91711 Approved by: https://github.com/voznesenskym	2023-01-07 00:51:55 +00:00
Peter Bell	edaba335b9	[primTorch] Use torch.fill to implement prims.fill (#91747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91747 Approved by: https://github.com/mruberry	2023-01-07 00:49:11 +00:00
Jeff Daily	faed4db497	[CI][ROCm] prune all stopped containers (#91815 ) After #91740, stopped containers remained and consumed disk space. Avoid no space left on device errors by removing all stopped containers any time we stop them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91815 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn	2023-01-07 00:41:59 +00:00
Tugsbayasgalan Manlaibaatar	b32b81a0c5	Make torch.split take symint as arg (#91724 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91724 Approved by: https://github.com/voznesenskym	2023-01-07 00:00:03 +00:00
PyTorch MergeBot	08a378a286	Revert "[ONNX] Add aten::zero support (#91731 )" This reverts commit ff23508c0d491553dc8eea85fb45f49de52ca41f. Reverted https://github.com/pytorch/pytorch/pull/91731 on behalf of https://github.com/clee2000 due to failing test_correct_module_names `ff23508c0d` https://github.com/pytorch/pytorch/actions/runs/3859079162/jobs/6578419644	2023-01-06 23:57:57 +00:00
Kenneth Ding	a2c5efaf0f	Un fold squeeze permute (#91656 ) Fixes #91505 Hey, this should partially fix some of the problems discussed in the issue above-if I'm on the right track, I'll update this PR with more fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91656 Approved by: https://github.com/ngimel	2023-01-06 23:55:38 +00:00
fduwjj	5fabd96f3c	[PT-D][3/N] Add FSDP hook with Named Optimizer (#91321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91321 Approved by: https://github.com/fegin	2023-01-06 23:51:33 +00:00
Jeff Daily	acab0edfab	[ROCm] fix hipify mapping for cuDeviceGet (#90726 ) The mapping was incorrect, but only certain downstream pytorch extensions found this issue. pytorch CI does not cover this mapping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90726 Approved by: https://github.com/pruthvistony, https://github.com/atalman	2023-01-06 22:57:44 +00:00
Denis Vieriu	53ef96faae	[MPS] Add support for randperm (#91708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91708 Approved by: https://github.com/kulinseth	2023-01-06 22:49:06 +00:00
AllenTiTaiWang	ff23508c0d	[ONNX] Add aten::zero support (#91731 ) Fixes #90268 When we use `tensor.zero_()` with inplace slice, it actually uses `aten::zero` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91731 Approved by: https://github.com/BowenBao	2023-01-06 22:48:54 +00:00
Andrew M. James	766ebf4441	Remove hard numpy dependency introduced by inductor (#90796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90796 Approved by: https://github.com/ngimel, https://github.com/cpuhrsch	2023-01-06 22:36:38 +00:00
Andrew M. James	7cd951c21e	Properly guard all numpy usage within dynamo and remove UnspecializedNumpyVariable (#90795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90795 Approved by: https://github.com/ngimel, https://github.com/cpuhrsch	2023-01-06 22:36:38 +00:00
Jeff Daily	f44946289b	[CI][ROCm] fix device visibility, again (#91813 ) The previous PR #91137 was incomplete. Though it successfully queried for the number of available GPUs, it still resulted in test files sharing the same GPU. This PR lifts the maxtasksperchild=1 restriction so that Pool workers will always use the same GPU. This also adds a Note in run_test.py for future reference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91813 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet	2023-01-06 22:19:07 +00:00
Nikita Shulga	4f1f14e38b	[JIT] Skip builtins while enumerating class methods (#91805 ) This is needed to support `enum.Enum` derived classes in Python-3.11 that adds `_new_member_` to classdict, see: `15c44789bb/Lib/enum.py (L529)` Following snippet illustrates the problem with the previous iteration of the code on 3.11: ```python from enum import Enum import inspect class Color(Enum): RED = 1 GREEN = 2 def print_routines(cls): print(cls.__name__) for name in cls.__dict__: fn = getattr(cls, name) if inspect.isroutine(fn): print(name, fn, f"has_globals: {hasattr(fn, '__globals__')}") print_routines(Color) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91805 Approved by: https://github.com/albanD, https://github.com/suo	2023-01-06 21:45:09 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	69acc34083	Automatically convert real tensors to fake in dynamo export (#91742 ) Summary: We don't care about params/buffers being mutated in dynamo export, so it is safe to always convert them to faketensor Test Plan: CI Differential Revision: D42353789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91742 Approved by: https://github.com/qihqi	2023-01-06 21:34:31 +00:00
Natalia Gimelshein	ef495b7d64	make sure mutated args are iterated in the same order (#91792 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/91792 Approved by: https://github.com/soumith	2023-01-06 20:46:07 +00:00
PyTorch MergeBot	b3603f8129	Revert "Deduplicate c10 error and PyTorchError hierarchy (#87855 )" This reverts commit 34f2d3e6ae56744c20c2f859f97101dff291bbbc. Reverted https://github.com/pytorch/pytorch/pull/87855 on behalf of https://github.com/osalpekar due to perf regression in quantization tests	2023-01-06 19:56:35 +00:00
Driss Guessous	f219970990	Return empty attention weights when need_atten_weights = False (#91782 ) # Summary This PR updates the second return value from SDPA to return an empty tensor of size 0 not what it would be if need_attn_weights is True. Also updates the meta function to account for this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91782 Approved by: https://github.com/cpuhrsch	2023-01-06 19:06:48 +00:00
Vivek Khandelwal	f77a9a585c	Add shape function for movedim op (#91696 ) Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91696 Approved by: https://github.com/davidberard98	2023-01-06 18:24:52 +00:00
PyTorch MergeBot	f556c5b979	Revert "[dynamo] Support dynamic slicing (#91341 )" This reverts commit 8e7dcd140ace26a7e3096a26fbeec9f572e9aaa7. Reverted https://github.com/pytorch/pytorch/pull/91341 on behalf of https://github.com/clee2000 due to breaking various tests `8e7dcd140a` https://github.com/pytorch/pytorch/actions/runs/3856936505/jobs/6574089745 marking this as weird because it was merged via codev?	2023-01-06 18:09:21 +00:00
Catherine Lee	f4b3b577d8	Docs push fix .netrc sometimes a directory (#91745 ) Sometimes .netrc is a directory even though it's in the temp folder. AFAIK there's nothing in the folder https://github.com/pytorch/pytorch/actions/runs/3842987245/jobs/6544919416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91745 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-06 17:46:02 +00:00
Ramin Azarmehr	87164ace51	[MPS] Fix the ChannelsLast memory format in cat_out_mps() (#91786 ) - Fixed the memory leak with the `malloc()` - Introduced shortened data type strings (optional) to avoid getting extra long cached graph string keys with ops such as cat_out() - Fixed data type issues in Monterey - Removed the unused `use_scalar_value` argument from `getTensorsStringKey()` - Clean up and refactoring Fixes #89353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91786 Approved by: https://github.com/kulinseth	2023-01-06 17:28:49 +00:00
Sherlock Huang	eeba9d5ab4	Preserve node's meta during fx.transformation (#90737 ) We wish to preserve node.meta over fx.Transformer transformation and aot_autograd. This will preserve all the meta fields in the original node, including stack_trace, nn_module_stack, val, tensor_meta... Sample Here's a graph produced by Dynamo. ``` class GraphModule(torch.nn.Module): def forward(self, x : torch.Tensor, y : torch.Tensor): # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:35, code: a = torch.cos(x) cos = torch.cos(x); x = None # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:36, code: b = torch.sin(y) sin = torch.sin(y); y = None # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:37, code: return a + b add = cos + sin; cos = sin = None return (add,) x {'creation_timestamp': 0, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 45, in forward\n def forward(self, x, y):\n'} y {'creation_timestamp': 0, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 45, in forward\n def forward(self, x, y):\n'} cos {'creation_timestamp': 3, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 35, in forward\n a = torch.cos(x)\n \| File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n return self.block(x, y)\n'} sin {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 36, in forward\n b = torch.sin(y)\n \| File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n return self.block(x, y)\n'} add {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 37, in forward\n return a + b\n \| File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n return self.block(x, y)\n'} output {'creation_timestamp': 4} ``` After lowering to aten graph with aot_autograd_simplified() ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: f32[2, 3], primals_2: f32[2, 3]): # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:35, code: a = torch.cos(x) cos: f32[2, 3] = torch.ops.aten.cos.default(primals_1) # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:36, code: b = torch.sin(y) sin: f32[2, 3] = torch.ops.aten.sin.default(primals_2) # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:37, code: return a + b add: f32[2, 3] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None return [add, primals_2, primals_1] primals_1 {'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=True, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})} primals_2 {'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=True, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})} cos {'creation_timestamp': 3, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 35, in forward\n a = torch.cos(x)\n \| File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n return self.block(x, y)\n', 'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})} sin {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 36, in forward\n b = torch.sin(y)\n \| File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n return self.block(x, y)\n', 'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})} add {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': ' File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 37, in forward\n return a + b\n \| File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n return self.block(x, y)\n', 'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})} output {} ``` Notice that output fx node have creation_time_stamp, nn_module_stack and stack_trace copied from the original fx node. val and tensor_meta were latter populated by a subsequent fake_tensor_propagation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90737 Approved by: https://github.com/jerryzh168	2023-01-06 17:21:02 +00:00
Tugsbayasgalan Manlaibaatar	8e7dcd140a	[dynamo] Support dynamic slicing (#91341 ) Differential Revision: [D42223259](https://our.internmc.facebook.com/intern/diff/D42223259) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91341 Approved by: https://github.com/voznesenskym	2023-01-06 16:52:12 +00:00
Kulin Seth	de99bc39e8	[MPS] Remap the view ops to exisiting graph APIs. (#89436 ) This helps in performance by avoiding the generic gather/scatter graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89436 Approved by: https://github.com/razarmehr	2023-01-06 16:02:25 +00:00
Kshiteej K	2354ff5fab	[functorch] test: try using reference_inputs in vmap tests (#91355 ) Ref https://github.com/pytorch/functorch/issues/1090 Timings: `test_vmap_exhaustive` After PR ``` == 1168 passed, 55 skipped, 2353 deselected, 153 xfailed in 195.07s (0:03:15) == ``` Before PR ``` == 1134 passed, 55 skipped, 2316 deselected, 150 xfailed in 77.18s (0:01:17) == ``` `test_op_has_batch_rule` After PR ``` == 988 passed, 57 skipped, 2353 deselected, 331 xfailed in 144.70s (0:02:24) == ``` Before PR ``` == 969 passed, 57 skipped, 2316 deselected, 313 xfailed in 65.86s (0:01:05) == ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91355 Approved by: https://github.com/zou3519	2023-01-06 15:00:36 +00:00
drisspg	eb8547e939	Add a NestedTensor Readme (#91472 ) # Summary This PR adds a NestedTensor Readme which explains the code structure and will hopefully serve as a reference point for new contributors, especially if they would like to implement a NestedTensor kernel implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91472 Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch	2023-01-06 14:44:55 +00:00
Jiong Gong	859ac58c54	[Inductor] Support loop split at given depth in CPP codegen (#91397 ) This PR refactors the loop related data structure to support the loop split at a given depth. Before this PR, the loop split is always supported at the inner-most level. With this PR, it is possible to support tiling at outer levels and at more than one levels. The `LoopNest` data structure is extended to support loop splits at various levels and renamed to `LoopNestWithSplit`. The `codegen_loops` function is also rewritten to be general to support arbitrary kernels set at the leaves of the loop structure. This PR also improves the handling of reduction loops with split. The main loop and tail loop now work on their own reduction variables in parallel without data dependency as previous do. With this, two workarounds can be removed in the `CppVecKernel`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91397 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-01-06 12:53:46 +00:00
Wu, Chunyuan	2555971b76	[inductor] fix output_stride of cat (#91233 ) When the inputs to the ConcatKernel come from both ExternKernel and Loops, the output format of Loops might still be a FlexibleLayout (with contiguous strides). When deciding the output stride of the ConcatKernel, the Loops output has been wrongly assumed to be contiguous, thus the output format of the ConcatKernel is set to be contiguous. In this PR, we propose the below heuristics to decide the output of the ConcatKernel: If any of the inputs to ConcatKernel is a FixedLayout and is in the channels last format, we set the output of the ConcatKernel to the channels last format as well. ### Before ```python kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma omp parallel num_threads(56) { #pragma omp for collapse(2) for(long i0=0; i0<5; i0+=1) { for(long i1=0; i1<256; i1+=1) { { { auto tmp0 = in_ptr0[i0 + (5i1)]; out_ptr0[i1 + (256i0)] = tmp0; } } } } #pragma omp for collapse(2) for(long i0=0; i0<64; i0+=1) { for(long i1=0; i1<16; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<16; i2+=1) { { { auto tmp0 = in_ptr1[i0 + (128i2) + (4096i1)]; auto tmp1 = in_ptr1[64 + i0 + (128i2) + (4096i1)]; auto tmp3 = in_ptr1[2048 + i0 + (128i2) + (4096i1)]; auto tmp5 = in_ptr1[2112 + i0 + (128i2) + (4096i1)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); out_ptr1[i2 + (16i1) + (256i0)] = tmp6; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): primals_1, primals_2, primals_3, primals_4 = args args.clear() buf0 = aten.convolution(primals_3, primals_1, primals_2, (1, 1), (0, 0), (1, 1), False, (0, 0), 1) assert_size_stride(buf0, (1, 5, 16, 16), (1280, 1, 80, 5)) del primals_2 buf3 = empty_strided((1, 69, 16, 16), (17664, 256, 16, 1), device='cpu', dtype=torch.float32) buf1 = as_strided(buf3, (1, 5, 16, 16), (17664, 256, 16, 1)) # alias buf2 = as_strided(buf3, (1, 64, 16, 16), (17664, 256, 16, 1), 1280) # alias kernel_cpp_0(c_void_p(buf0.data_ptr()), c_void_p(primals_4.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf2.data_ptr())) del buf0 del primals_4 return (buf3, primals_1, primals_3, ) ``` ### After ```python kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma omp parallel num_threads(56) { #pragma omp for for(long i0=0; i0<256; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<5; i1+=1) { { { auto tmp0 = in_ptr0[i1 + (5i0)]; out_ptr0[i1 + (69i0)] = tmp0; } } } } #pragma omp for collapse(2) for(long i0=0; i0<16; i0+=1) { for(long i1=0; i1<16; i1+=1) { for(long i2=0; i2<4; i2+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + (16i2) + (128i1) + (4096i0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 64 + (16i2) + (128i1) + (4096i0)); auto tmp3 = at::vec::Vectorized<float>::loadu(in_ptr1 + 2048 + (16i2) + (128i1) + (4096i0)); auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 2112 + (16i2) + (128i1) + (4096i0)); auto tmp2 = at::vec::maximum(tmp1, tmp0); auto tmp4 = at::vec::maximum(tmp3, tmp2); auto tmp6 = at::vec::maximum(tmp5, tmp4); tmp6.store(out_ptr1 + (16i2) + (69i1) + (1104i0)); } #pragma omp simd simdlen(8) for(long i2=64; i2<64; i2+=1) { auto tmp0 = in_ptr1[i2 + (128i1) + (4096i0)]; auto tmp1 = in_ptr1[64 + i2 + (128i1) + (4096i0)]; auto tmp3 = in_ptr1[2048 + i2 + (128i1) + (4096i0)]; auto tmp5 = in_ptr1[2112 + i2 + (128i1) + (4096i0)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); out_ptr1[i2 + (69i1) + (1104*i0)] = tmp6; } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): primals_1, primals_2, primals_3, primals_4 = args args.clear() buf0 = aten.convolution(primals_3, primals_1, primals_2, (1, 1), (0, 0), (1, 1), False, (0, 0), 1) assert_size_stride(buf0, (1, 5, 16, 16), (1280, 1, 80, 5)) del primals_2 buf3 = empty_strided((1, 69, 16, 16), (17664, 1, 1104, 69), device='cpu', dtype=torch.float32) buf1 = as_strided(buf3, (1, 5, 16, 16), (17664, 1, 1104, 69)) # alias buf2 = as_strided(buf3, (1, 64, 16, 16), (17664, 1, 1104, 69), 5) # alias kernel_cpp_0(c_void_p(buf0.data_ptr()), c_void_p(primals_4.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf2.data_ptr())) del buf0 del primals_4 return (buf3, primals_1, primals_3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91233 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-01-06 10:44:09 +00:00
Wu, Chunyuan	c99a2a43ad	[inductor] decompose tanh in CPP backend (#91687 ) ## Description The decomposition of `tanh` has been removed in https://github.com/pytorch/pytorch/pull/90889. ```python @register_decomposition([aten.tanh]) def tanh(x): return 2.0 / (1.0 + torch.exp(-2.0 * x)) - 1.0 ``` We've observed performance regression on CPU for `lennard_jones` in the TorchBench suite. This PR decomposes `tanh` in CPP backend to fix the regression. ### Performance - Model: lennard_jones - Machine: IceLake (32 cores per socket) - Configuration: single instance, 32 cores per instance - jemalloc and iomp enabled ```bash python benchmarks/dynamo/torchbench.py --inductor-settings --inductor --performance --float32 -dcpu -n500 --no-skip --dashboard --only=lennard_jones --quiet ``` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Time before regression \| Time after regression \| Time with this PR -- \| -- \| -- 0.000262036 \| 0.0003618 \| 0.000267888 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91687 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-06 10:05:36 +00:00
PyTorch MergeBot	ad70a70171	Revert "[functorch] test: try using reference_inputs in vmap tests (#91355 )" This reverts commit a51090d4b14610d72a8e22209a7d69b5a90bf45d. Reverted https://github.com/pytorch/pytorch/pull/91355 on behalf of https://github.com/kshitij12345 due to Broke trunk	2023-01-06 09:57:21 +00:00
kshitij12345	a51090d4b1	[functorch] test: try using reference_inputs in vmap tests (#91355 ) Ref https://github.com/pytorch/functorch/issues/1090 Timings: `test_vmap_exhaustive` After PR ``` == 1168 passed, 55 skipped, 2353 deselected, 153 xfailed in 195.07s (0:03:15) == ``` Before PR ``` == 1134 passed, 55 skipped, 2316 deselected, 150 xfailed in 77.18s (0:01:17) == ``` `test_op_has_batch_rule` After PR ``` == 988 passed, 57 skipped, 2353 deselected, 331 xfailed in 144.70s (0:02:24) == ``` Before PR ``` == 969 passed, 57 skipped, 2316 deselected, 313 xfailed in 65.86s (0:01:05) == ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91355 Approved by: https://github.com/zou3519	2023-01-06 08:16:11 +00:00
Adrian Ostrowski	d0a4e2e782	Don't remove files across the whole OS on clean (#91503 ) setup.py clean now won't remove paths matching .gitignore patterns across the entire OS. Instead, now only files from the repository will be removed. `/build_*` had to be removed from .gitignore because with the wildcard fixed, build_variables.bzl file was deleted on cleanup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91503 Approved by: https://github.com/soumith	2023-01-06 05:13:51 +00:00
Xilun Wu	e3bd38d224	[DTensor] fix test_device_mesh failure on GPU (#91783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91783 Approved by: https://github.com/wanchaol	2023-01-06 04:22:09 +00:00
BowenBao	66745831d7	[ONNX] Support constant 'aten::__contains__' (#91660 ) #84624 introduces an update on `torch.norm` [dispatch logic](`eaa43d9f25/torch/functional.py (L1489)`) which now depends on `layout`. Resulting in regressions to export related operators from TorchScript. This PR resolves the regression by partially supporting a subset use case of `prim::layout` (only `torch.strided`), `aten::__contains__` (only constants) operators. It requires much more effort to properly support other layouts, e.g. `torch.sparse_coo`. Extending JIT types, and supporting related family of ops like `aten::to_sparse`. This is out of the scope of this PR. Fixes #83661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91660 Approved by: https://github.com/justinchuby, https://github.com/kit1980	2023-01-06 01:39:32 +00:00
Ramin Azarmehr	2f0e4839ee	[MPS] Fix correctness issues with Pooling ops (#91519 ) - Workaround for MaxPool when ceilMode=true - Workaround for ChannelsLast memory format - Workaround for divisor_override in AvgPool ops - Enabled count_include_pad parameter for AvgPool - Refactoring and clean up of duplicate code - Enable MaxPool tests in TestConsistency Pull Request resolved: https://github.com/pytorch/pytorch/pull/91519 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-01-06 01:35:46 +00:00
XiaobingSuper	33547bb587	inductor: Move graph.lint() in Intel's FX Passes to the End of Loop to Reduce Compile Time(part 2) (#91677 ) As https://github.com/pytorch/pytorch/pull/91179 to Reduce Compile Time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91677 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-06 01:26:16 +00:00
XiaobingSuper	25ff10caa7	inductor:enable conv+unary fusion for torch unary function (#91609 ) This PR is about to enable unary fusion which the unary is the torch function, this PR will improve timm models performance a lot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91609 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-01-06 01:23:35 +00:00
Liao, Xuan	2175c9414e	[cpu] implement erf based on oneDNN algorithm for aten::Vec (#91613 ) Aten's `erf` implementation will invoke `MKL` function which shows better performance than current Torchinductor's `erf` implementation who calls `sleef` function in `aten::Vec`. The performance benefits from the algorithm. `sleef` uses the Taylor expansion more precise than `MKL`, resulting in longer time-consuming. As the implementations of `erf` in `oneDNN` and `MKL` are similar, we implement the algorithm of `erf` in `aten::Vec` based on `oneDNN` algorithm. Performance data for eager v.s. inductor: `gelu` also benefits from this modification for it uses `erf`. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> suite \| op_name \| improved_ratio_speedup0.2 \| improved_ratio_speedup0.5 \| improved_ratio_speedup0.8 \| speedup_old_0.2 \| RSD(3) \| speedup_old_0.5 \| RSD(3) \| speedup_old_0.8 \| RSD(3) \| speedup_new_0.2 \| RSD(3) \| speedup_new_0.5 \| RSD(3) \| speedup_new_0.8 \| RSD(3) -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- torchbench \| aten.erf.default \| 138.54% \| 138.54% \| 138.54% \| 0.402057897 \| 13.54% \| 0.402057897 \| 13.54% \| 0.402057897 \| 13.54% \| 0.959050302 \| 4.21% \| 0.959050302 \| 4.21% \| 0.959050302 \| 4.21% torchbench \| aten.gelu.default \| 196.94% \| 16.28% \| 3.28% \| 0.303611506 \| 0.88% \| 0.865411422 \| 0.23% \| 0.984732108 \| 0.15% \| 0.901534389 \| 1.04% \| 1.006314977 \| 0.10% \| 1.017019831 \| 0.37% huggingface \| aten.gelu.default \| 178.90% \| 153.93% \| 22.70% \| 0.324031619 \| 8.16% \| 0.40085369 \| 1.67% \| 0.839170801 \| 1.30% \| 0.90371451 \| 2.25% \| 1.017872459 \| 0.47% \| 1.029638829 \| 0.49% timm \| aten.gelu.default \| 12.76% \| 3.01% \| 1.98% \| 0.892005539 \| 0.22% \| 0.979783341 \| 0.16% \| 0.998917466 \| 0.08% \| 1.005821648 \| 0.11% \| 1.009227094 \| 0.07% \| 1.018701655 \| 0.30% torchbench \| aten.gelu_backward.default \| 124.25% \| 53.19% \| 5.96% \| 0.437150835 \| 6.11% \| 0.664341696 \| 0.24% \| 0.983091818 \| 2.49% \| 0.980304388 \| 1.86% \| 1.017688734 \| 0.33% \| 1.041684409 \| 0.74% huggingface \| aten.gelu_backward.default \| 126.26% \| 32.55% \| 11.61% \| 0.446699743 \| 0.34% \| 0.781550075 \| 0.73% \| 0.989682073 \| 0.28% \| 1.010687581 \| 1.31% \| 1.035929929 \| 1.11% \| 1.104549968 \| 2.68% timm \| aten.gelu_backward.default \| 5.65% \| 1.79% \| 2.58% \| 0.955116562 \| 0.40% \| 0.99782989 \| 0.18% \| 1.002408412 \| 0.13% \| 1.00905163 \| 0.07% \| 1.015649447 \| 0.26% \| 1.028238613 \| 0.24% </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91613 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/EikanWang, https://github.com/desertfire	2023-01-06 01:20:49 +00:00
Liao, Xuan	745dc3a13c	[inductor] optimize lowering for empty-related operators (#91350 ) For micro-benchmark, `new_empty_strided` and `new_empty` have poor performance with inductor compared to eager. The main reason is that inductor initializes new tensor with 0 during lowering, which generates a useless cpp kernel. Actually, it is not needed for operator semantics, but costs additional time. The same problem is also found in lowerings of `empty_strided` and `empty`. This PR tends to remove useless cpp kernel of tensor initialization by generating a NopKernelSchedulerNode instead of a SchedulerNode. The lowering functions of following operators are optimized: - `torch.empty` - `aten.empty` - `aten.new_empty` - `aten.empty_strided` - `aten.new_empty_strided` We take output code of `new_empty_strided` as example. _Before change_ ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(float* __restrict__ out_ptr0) { #pragma omp parallel num_threads(28) { #pragma omp for for(long i0=0; i0<57600; i0+=1) { auto tmp0 = at::vec::Vectorized<float>(static_cast<float>(0)); tmp0.store(out_ptr0 + 16*i0); } #pragma omp for simd simdlen(8) for(long i0=921600; i0<921600; i0+=1) { auto tmp0 = static_cast<float>(0); out_ptr0[i0] = tmp0; } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((60, 60, 256), (15360, 256, 1), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(buf0.data_ptr())) return (buf0, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((60, 60, 256), (60, 1, 3600), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` _After change_ ``` async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() buf0 = empty_strided((60, 60, 256), (15360, 256, 1), device='cpu', dtype=torch.float32) return (buf0, ) if __name__ == "__main__": from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((60, 60, 256), (60, 1, 3600), device='cpu', dtype=torch.float32) print_performance(lambda: call([arg0_1])) ``` Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> suite \| op_name \| improved_ratio_speedup0.2 \| improved_ratio_speedup0.5 \| improved_ratio_speedup0.8 \| speedup_old_0.2 \| RSD(3) \| speedup_old_0.5 \| RSD(3) \| speedup_old_0.8 \| RSD(3) \| speedup_new_0.2 \| RSD(3) \| speedup_new_0.5 \| RSD(3) \| speedup_new_0.8 \| RSD(3) -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- torchbench \| aten.new_empty_strided.default \| 235.94% \| 100.94% \| 50.23% \| 0.325947 \| 2.96% \| 0.550267 \| 2.03% \| 0.747997 \| 2.93% \| 1.094985 \| 0.81% \| 1.105722 \| 0.55% \| 1.12372 \| 0.68% huggingface \| aten.new_empty_strided.default \| 120.58% \| 81.16% \| 87.41% \| 0.503116 \| 28.27% \| 0.668831 \| 5.85% \| 0.705637 \| 2.76% \| 1.109785 \| 1.70% \| 1.211641 \| 0.74% \| 1.322434 \| 0.82% timm \| aten.new_empty_strided.default \| 129.24% \| 72.75% \| 47.91% \| 0.490658 \| 15.87% \| 0.76711 \| 13.11% \| 0.904033 \| 4.44% \| 1.124806 \| 1.19% \| 1.325182 \| 0.65% \| 1.337114 \| 1.01% torchbench \| aten.new_empty.default \| 69.41% \| 1.60% \| 0.90% \| 0.732117 \| 5.24% \| 1.228356 \| 1.18% \| 1.241341 \| 0.81% \| 1.24031 \| 1.96% \| 1.248061 \| 1.70% \| 1.252525 \| 1.84% huggingface \| aten.new_empty.default \| 150.01% \| 79.29% \| 39.91% \| 0.49547 \| 12.67% \| 0.692498 \| 22.11% \| 0.889526 \| 27.37% \| 1.238706 \| 1.58% \| 1.241606 \| 1.49% \| 1.244506 \| 1.41% timm \| aten.new_empty.default \| 11.61% \| 11.13% \| 11.07% \| 1.115127 \| 0.65% \| 1.124302 \| 0.80% \| 1.132986 \| 1.38% \| 1.244582 \| 1.12% \| 1.249459 \| 1.31% \| 1.258416 \| 1.14% </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91350 Approved by: https://github.com/EikanWang, https://github.com/anijain2305, https://github.com/jgong5, https://github.com/desertfire	2023-01-06 01:20:17 +00:00
Nikita Shulga	e1a2b0d34f	Fix `test_math_ops` for python-3.11 (#91774 ) From [math.pow](https://docs.python.org/3/library/math.html#math.pow) documentation: > Changed in version 3.11: The special cases `pow(0.0, -inf)` and `pow(-0.0, -inf)` were changed to return `inf` instead of raising [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError), for consistency with IEEE 754. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91774 Approved by: https://github.com/ngimel	2023-01-06 00:56:43 +00:00
Xia, Weiwen	de9c82f41a	[Meta] Register aten.pixel_shuffle.default for meta (#91605 ) Summary Fixes #91551 `aten.pixel_shuffle.default` is not registered for meta and it always generates contiguous (channels-first) layout of outputs. It can be reproduced by `torch.compile` (as described in the issue #91551) and running in FakeTensorMode. Test plan python test/inductor/test_torchinductor.py -k test_pixel_shuffle_channels_last python test/test_proxy_tensor.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/91605 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/anijain2305	2023-01-06 00:45:14 +00:00
leslie-fang-intel	b2c68c1dea	[Quant] Update IDeep to support oneDNN conv add fusion (#90605 ) Summary This PR updates IDeep to support oneDNN conv add fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90605 Approved by: https://github.com/jgong5	2023-01-05 23:58:59 +00:00
leslie-fang-intel	aab55d6d0d	[Quant] Remove all the dequant nodes when the ref module has multi input args (#90157 ) Summary: When converting a ref module into a quant module, `_lower_static_weighted_ref_module` pass assumes the `ref_node` only has 1 input node, and only remove the first `dequant` node. We add a check in this PR to ensure this is the case for `_lower_static_weighted_ref_module` pass. Test Plan: We only add a check in this PR, there is no new added test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90157 Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5, https://github.com/jerryzh168	2023-01-05 23:58:45 +00:00
Peter Bell	ae0c4c4c29	Update version numbers in torch.{stft,istft} deprecations (#91761 ) Since there won't be a 1.14 release, these need to be updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91761 Approved by: https://github.com/lezcano	2023-01-05 22:17:37 +00:00
Peter Bell	2a64365a29	Fix rendering of std/var docs (#91730 ) Due to the indentation, "versionchanged" is being rendered as if it was an argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91730 Approved by: https://github.com/albanD, https://github.com/lezcano	2023-01-05 22:17:37 +00:00
PyTorch MergeBot	f571ae4fdb	Revert "Make torch.device usable as a context manager (#91525 )" This reverts commit 619d52a5d296bc236ac98f40c7f7de54ab7c9d37. Reverted https://github.com/pytorch/pytorch/pull/91525 on behalf of https://github.com/mehtanirav due to Internal breakages	2023-01-05 21:34:50 +00:00
PyTorch MergeBot	c73147f741	Revert "[decomp] Use new squeeze.dims overload in decompositions (#91602 )" This reverts commit 9262ffc692a1d2cd49597ae7f0a7e4394feca022. Reverted https://github.com/pytorch/pytorch/pull/91602 on behalf of https://github.com/clee2000 due to stacked pr was reverted, this is dependent	2023-01-05 20:39:52 +00:00
Sean Ross-Ross	0100293a7b	feat: adding greater_equal Scalar variant (#91324 ) Fixes https://github.com/pytorch/functorch/issues/1080 ```py import torch from functorch import vmap def f(x): return torch.greater_equal(torch.cumsum(x, dim=0), .5 * 10) x = torch.randn([10,10]) vmap(f)(x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91324 Approved by: https://github.com/zou3519	2023-01-05 20:25:38 +00:00
Nikita Shulga	3b4e4d2b62	Make `requirements-ci.txt` reading cwd independent (#91771 ) Discovered while running `test_typing.py` locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/91771 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-01-05 20:08:23 +00:00
Shunting Zhang	a5f32f8978	training support for dynamo+torchxla integration (#88449 ) We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla. Training is trickier than inference and we may not expect much perf gains since 1. in training case, torchxla only generate a single combined graph for fwd/bwd/optimizer while in `torchxla_trace_once` bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations. 2. in training case, tracing overhead can be overlapped with computation. Tracing overhead is not as a big deal for training as for inference. After all training cares more about throughput while inference cares more about latency. 3. in training case, people can increase batch size to 'mitigate' the tracing overhead. Increase batch size does not change tracing overhead, thus it shows like the tracing overhead 'per example' reduces. But we still want to add training support to dynamo/torchxla to make the work complete. We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above. Results: With '--iterations-per-run' equals to 1, here are the perf numbers: ``` +-------------------------+--------------------+-------------------------+ \| Model \| XLA (trace once) \| XLA (trace everytime) \| +=========================+====================+=========================+ \| resnet18 \| 0.91 \| 0.959 \| +-------------------------+--------------------+-------------------------+ \| resnet50 \| 0.917 \| 0.932 \| +-------------------------+--------------------+-------------------------+ \| resnext50_32x4d \| 0.912 \| 0.905 \| +-------------------------+--------------------+-------------------------+ \| alexnet \| 1.038 \| 0.974 \| +-------------------------+--------------------+-------------------------+ \| mobilenet_v2 \| 0.881 \| 0.835 \| +-------------------------+--------------------+-------------------------+ \| mnasnet1_0 \| 0.903 \| 0.931 \| +-------------------------+--------------------+-------------------------+ \| vgg16 \| 0.914 \| 0.967 \| +-------------------------+--------------------+-------------------------+ \| BERT_pytorch \| 1.359 \| 0.84 \| +-------------------------+--------------------+-------------------------+ \| timm_vision_transformer \| 1.288 \| 0.893 \| +-------------------------+--------------------+-------------------------+ \| geomean \| 1.0006 \| 0.913794 \| +-------------------------+--------------------+-------------------------+ ``` Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run' NOTE: In torchbench.py I added the following code to do a few workaround: ``` from myscripts import workaround # TODO will remove this line before landing ``` Here are the content of workaround.py: ``` import torch from torch import nn import os # override max_pool2d with avg_pool2d if os.environ.get("REPLACE_MAXPOOL", "0") == "1": torch.nn.MaxPool2d = torch.nn.AvgPool2d ``` It work around a few issues we found 1. MaxPool2d does not work for training in dynamo/torchxla: https://github.com/pytorch/torchdynamo/issues/1837 . WIP fix from Brian in https://github.com/pytorch/pytorch/pull/90226 , https://github.com/pytorch/xla/pull/4276/files (WIP) 2. recent change ( this PR https://github.com/pytorch/pytorch/pull/88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in https://github.com/pytorch/xla/pull/4282#event-7969608134 . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd) 3. we have issue to handle dropout because of random seed out of sync issue. Here is the fix: https://github.com/pytorch/xla/pull/4293 (confirmed the fix) Example command: ``` REPLACE_MAXPOOL=1 USE_FAKE_TENSOR=0 GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only vgg16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88449 Approved by: https://github.com/wconstab, https://github.com/qihqi, https://github.com/malfet	2023-01-05 19:59:34 +00:00
PyTorch MergeBot	df4b3b13bc	Revert "squeeze: allow squeezing multiple dimensions at once (#89017 )" This reverts commit e26cb06681f4ae92ba28c802cbea263f9a97c2ff. Reverted https://github.com/pytorch/pytorch/pull/89017 on behalf of https://github.com/mehtanirav due to Internal breakages	2023-01-05 19:25:08 +00:00
Jeff Daily	f11dc26ed5	[ROCm] tools/stats/monitor.py support (#91732 ) Initial support for rocm-smi monitoring of GPU utilization. Works around difficulties of using the rocm-smi python bindings without having an explicit package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91732 Approved by: https://github.com/huydhn, https://github.com/pruthvistony	2023-01-05 18:34:11 +00:00
Peter Bell	9262ffc692	[decomp] Use new squeeze.dims overload in decompositions (#91602 ) This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602 Approved by: https://github.com/ngimel	2023-01-05 17:59:32 +00:00
PyTorch MergeBot	3bb63aa387	Revert "Symintify pytorch slicing logic (#91340 )" This reverts commit 8c172fa98a52e95675e9425ac4b23f190f53f9ed. Reverted https://github.com/pytorch/pytorch/pull/91340 on behalf of https://github.com/clee2000 due to breaking mac builds `8c172fa98a` https://github.com/pytorch/pytorch/actions/runs/3845932024/jobs/6550654339, marking this as weird because it was merged via codev?	2023-01-05 17:14:49 +00:00
Ramin Azarmehr	9ca37d6527	[MPS] Improve the performance of torch.linear() (#91114 ) * Clean up redundant headers and namespaces from Linear.mm * This should improve the Bert sample in #77799 by ~3x Pull Request resolved: https://github.com/pytorch/pytorch/pull/91114 Approved by: https://github.com/DenisVieriu97, https://github.com/malfet, https://github.com/kulinseth	2023-01-05 16:30:27 +00:00
Jeff Daily	c775eb2879	[CI][ROCm] always stop all docker containers (#91740 ) We observed multiple running docker containers on several ROCm self-hosted runners. This commit ensures all containers are stopped prior to starting the tests. This commit also fixes setup/teardown differences between various ROCm workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91740 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-01-05 16:28:16 +00:00
Denis Vieriu	1a0738f599	[MPS] Add support for torch.linalg.cross (#91642 ) * Add support for torch.linalg.cross * Make use of `metal::cross` for float and half. For the other dtypes implement cross manually Pull Request resolved: https://github.com/pytorch/pytorch/pull/91642 Approved by: https://github.com/razarmehr, https://github.com/malfet	2023-01-05 14:48:34 +00:00
Tugsbayasgalan Manlaibaatar	8c172fa98a	Symintify pytorch slicing logic (#91340 ) Differential Revision: [D42223260](https://our.internmc.facebook.com/intern/diff/D42223260) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91340 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-01-05 10:33:37 +00:00
Aaron Gokaslan	18b37bbff9	Clang-Tidy: Improve tensorexpr headers with additional std::moves (#91572 ) Splitting #91559 into smaller pieces Pull Request resolved: https://github.com/pytorch/pytorch/pull/91572 Approved by: https://github.com/ezyang	2023-01-05 09:57:54 +00:00
Aaron Gokaslan	3d1772857e	Apply clang-tidy perf improvements to aten and torch/jit/passes/onnx (#91726 ) Applies some minor performance fixups to pytorch regarding an implicit promotion and unnecessary copies (when const ref would have worked just as well). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91726 Approved by: https://github.com/ezyang	2023-01-05 06:48:59 +00:00
Eddie Yan	bac33ea8b6	[CUDA] Drop CUDA 10 support (#89582 ) CC @ptrblck @ngimel @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/89582 Approved by: https://github.com/malfet, https://github.com/ngimel	2023-01-05 05:11:53 +00:00
ssjia	13b3d862dd	[vulkan] Move Tensor.* from `ops/` folder to `api/` folder (#91033 ) Moves `Tensor.h` and `Tensor.cpp` from the `ops/` folder to the `api/` folder. Differential Revision: [D42106179](https://our.internmc.facebook.com/intern/diff/D42106179/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91033 Approved by: https://github.com/kirklandsign	2023-01-05 02:46:49 +00:00
ssjia	aa562f94b3	[vulkan] Remove dependencies from op/ in vTensor and move it to higher level namespace (#91023 ) Small refactor to remove any code used by vTensor under the `op/` folder to appropriate locations in the `api/` folder. Also remove vTensor from the `ops` namespace, it now resides in the higher level `at::native::vulkan` namespace which will also be used for the Graph data structures in the future. This is the last step required for vTensor to be able to moved to the api folder. Differential Revision: [D42052680](https://our.internmc.facebook.com/intern/diff/D42052680/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91023 Approved by: https://github.com/salilsdesai	2023-01-05 02:30:19 +00:00
Mark Saroufim	c7f32613ec	Find other temp directory for code cache if no /tmp (#91701 ) Fixes https://github.com/pytorch/torchdynamo/issues/2004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91701 Approved by: https://github.com/anijain2305, https://github.com/wconstab	2023-01-05 02:29:52 +00:00
Ramin Azarmehr	229f12bf6a	[MPS] Implement nan_to_num() for MPS backend (#91110 ) Added a test case, and also enabled it in TestConsistency Pull Request resolved: https://github.com/pytorch/pytorch/pull/91110 Approved by: https://github.com/malfet, https://github.com/kulinseth	2023-01-05 02:17:48 +00:00
Natalia Gimelshein	197e57ee68	Use indexing instead of reshape for broadcasting (#91722 ) This is needed for MLIR rewrite This replaces ``` xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK, 1]) ``` with ``` xindex = xoffset + tl.arange(0, XBLOCK)[:, None] ``` so code is a bit more readable, and compiles with master triton (which doesn't currently support first construct). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91722 Approved by: https://github.com/desertfire	2023-01-05 02:05:31 +00:00
ssjia	ca62ed9067	[vulkan] Remove ATen dependencies in vTensor class (#91022 ) This diff removes all dependencies on ATen from the vTensor class, in preparation for moving the class to the `api/` folder so that it can be part of the core library (i.e. part of the `torch_vulkan_api` target introduced in the below diff which should have no dependencies on ATen. Most notably, the constructor of `vTensor` is changed to ``` vTensor( api::Context* context, IntArrayRef sizes, const c10::ScalarType dtype = c10::kFloat, const api::StorageType storage_type = api::StorageType::TEXTURE_3D, const c10::MemoryFormat memory_format = c10::MemoryFormat::Contiguous); ``` Instead of accepting a `TensorOptions` argument, since `TensorOptions` is a part of ATen. The majority of changes in this diff are due to updating vTensor construction to use the new constructor. Differential Revision: [D42049862](https://our.internmc.facebook.com/intern/diff/D42049862/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91022 Approved by: https://github.com/kimishpatel	2023-01-05 01:51:02 +00:00
yanbing-j	f630294f59	Optimize GELU BFloat16 Impl in CPU path (#79378 ) ### Description For slow path (with non-contiguous inputs) with `none` or `tanh` approximate, current bfloat16 impl is not performance friendly in ATen. This PR uses float32 as an immediate type, in order to reduce the heavy cost of converting bf16 to fp32. ### Test IceLake 2S 32C (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz) single socket (32 cores): approximate is `none`: \|input shapes \| forward ( base) (ms) \| backward (base) (ms) \| forward (optimized) (ms) \| backward (optimized) (ms) \|--\|------\| --\| --\| --\| \|[16, 32, 32] \| 0.361 \| 1.055 \| 0.348 \| 0.672 \|[32, 32, 64] \| 0.084 \| 2.003 \| 0.076 \| 1.426 \|[32, 64, 128] \| 0.237 \| 2.007 \| 0.22 \| 1.454 \|[64, 128, 128] \| 2.23 \| 6.348 \| 1.943 \| 4.103 approximate is `tanh`: \|input shapes \| forward ( base) (ms) \| backward (base) (ms) \| forward (optimized) (ms) \| backward (optimized) (ms) \|--\|------\| --\| --\| --\| [16, 32, 32] \| 0.203 \| 1.209 \| 0.138 \| 0.474 [32, 32, 64] \| 0.063 \| 2.497 \| 0.043 \| 0.985 [32, 64, 128] \| 0.201 \| 2.707 \| 0.141 \| 1.205 [64, 128, 128] \| 1.549 \| 8.749 \| 1.065 \| 3.635 single core: approximate is `none`: \|input shapes \| forward ( base) (ms) \| backward (base) (ms) \| forward (optimized) (ms) \| backward (optimized) (ms) \|--\|------\| --\| --\| --\| [16, 32, 32] \| 0.359 \| 1.055 \| 0.267 \| 0.592 [32, 32, 64] \| 1.11 \| 3.483 \| 1.063 \| 2.373 [32, 64, 128] \| 4.478 \| 13.866 \| 4.27 \| 9.426 [64, 128, 128] \| 17.675 \| 55.231 \| 16.805 \| 37.509 approximate is `tanh`: \|input shapes \| forward ( base) (ms) \| backward (base) (ms) \| forward (optimized) (ms) \| backward (optimized) (ms) \|--\|------\| --\| --\| --\| [16, 32, 32] \| 0.202 \| 1.212 \| 0.138 \| 0.473 [32, 32, 64] \| 0.776 \| 4.843 \| 0.531 \| 1.872 [32, 64, 128] \| 3.203 \| 19.267 \| 2.16 \| 7.243 [64, 128, 128] \| 12.33 \| 76.834 \| 8.286 \| 29.553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79378 Approved by: https://github.com/mingfeima	2023-01-05 01:43:17 +00:00
Peter Bell	ad7aefb608	Fix Meta tests for FFT functions (#91628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91628 Approved by: https://github.com/kit1980	2023-01-05 00:58:26 +00:00
Ramin Azarmehr	b44d46702a	[MPS] Fix correctness issues with Upsample 1D and 2D (#91669 ) - Implemented following new ops: upsample_nearest1d_backward upsample_nearest_exact1d upsample_nearest_exact1d_backward - Moved Upsample code from Shape.mm to Upsample.mm - Fallback to CPU for nearest mode on Monterey Pull Request resolved: https://github.com/pytorch/pytorch/pull/91669 Approved by: https://github.com/malfet	2023-01-05 00:48:54 +00:00
Jeff Daily	7ff97d2e95	update .circleci/docker/common/install_cmake.sh for centos (#91647 ) Otherwise .circleci/docker/common/install_cmake.sh fails for centos due to use of apt-get instead of yum. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91647 Approved by: https://github.com/malfet	2023-01-05 00:43:10 +00:00
ssjia	64a3738fcd	[vulkan] Remove external dependencies in core API and introduce torch_vulkan_api target (#91021 ) This diff isolates the core components of the Pytorch Vulkan backend into its own target (`//xplat/caffe2:torch_vulkan_api`). The main motivation for this is to create a library that does not have a dependency on the ATen library which can then be used to build a graph mode runtime for Vulkan for Executorch. In addition to introducing the new target, this diff also removes some references to external dependencies in the `api/` folder so that files in that folder are completely self contained. Differential Revision: [D42038817](https://our.internmc.facebook.com/intern/diff/D42038817/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42038817/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/91021 Approved by: https://github.com/kirklandsign	2023-01-05 00:41:23 +00:00
lezcano	700399e3f1	Make sure the ends of linspace are correct regardless of the precision (#91625 ) This operation is usually called with small sizes, so the fact that this adds a couple of operations should be alright. Even more, given the structure of the data, the branching in the `where` is pretty much free. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91625 Approved by: https://github.com/peterbell10, https://github.com/ngimel	2023-01-05 00:23:19 +00:00
lezcano	223d1aa692	Improve linspace decomposition and remove its lowering (#91621 ) The code produced by the lowering and the decomposition is now the same modulo a casting to `float32`. This casting is necessary as otherwise the tests do not pass due to accuracy errors. We prefer accuracy over speed here, given that this is an associative scan, and thus it's prone to numerical errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91621 Approved by: https://github.com/ngimel	2023-01-05 00:23:19 +00:00
TachikakaMin	6790a558dd	Simplify macOS build instruction (#91561 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91561 Approved by: https://github.com/malfet	2023-01-05 00:10:16 +00:00
Mark Saroufim	d6bd67f2eb	vmap support for torch.trace (#91679 ) Fixes #91404 As expected ```python import torch from functorch import vmap x = torch.randn(32, 3, 3, 3) y = vmap(torch.trace)(x) print(y) ``` Now gives the exact same runtime error as eager mode ``` (sourcetorch) ubuntu@ip-172-31-39-26:~/test$ python functorch_test_pos.py Traceback (most recent call last): File "functorch_test_pos.py", line 4, in <module> y = vmap(torch.trace)(x) File "/home/ubuntu/pytorch/torch/_functorch/vmap.py", line 420, in wrapped return _flat_vmap( File "/home/ubuntu/pytorch/torch/_functorch/vmap.py", line 39, in fn return f(args, kwargs) File "/home/ubuntu/pytorch/torch/_functorch/vmap.py", line 605, in _flat_vmap batched_outputs = func(batched_inputs, **kwargs) RuntimeError: trace: expected a matrix, but got tensor with dim 3 ``` Equivalent eager code ```python import torch x = torch.randn(32, 3, 3, 3) results = [] for xi in x: y = torch.trace(xi) results.append(y) ``` ``` (sourcetorch) ubuntu@ip-172-31-39-26:~/test$ python functorch_test_neg.py Traceback (most recent call last): File "functorch_test_neg.py", line 5, in <module> y = torch.trace(xi) RuntimeError: trace: expected a matrix, but got tensor with dim 3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91679 Approved by: https://github.com/zou3519	2023-01-04 23:45:49 +00:00
Iris	56db21aec1	[Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092 ) This PR updates the 2d checkpoint model state test to include: 1. optimizer state dict test 2. simple resharding test (pg change) 3. rename test Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092 Approved by: https://github.com/fduwjj	2023-01-04 23:26:30 +00:00
Ramin Azarmehr	7dd28e9e83	[MPS] Fix data type and shape issues in Scatter and Gather ops (#91514 ) - Clean up redundant code and headers - Move scatter/gather ops from block list to allow list in TestConsistency Pull Request resolved: https://github.com/pytorch/pytorch/pull/91514 Approved by: https://github.com/kulinseth	2023-01-04 23:20:01 +00:00
Kulin Seth	fc59664ef4	[MPS] Add Unique and unique_consecutive ops. (#88532 ) Add check for macos 13.0 Fixes #88487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88532 Approved by: https://github.com/malfet	2023-01-04 22:15:13 +00:00
Ramin Azarmehr	13de5a0150	[MPS] Fix the right padding bug in Monterey (#91522 ) - Workaround for the bool type bug in padding (needed for both Monterey and Ventura) - Move the recently fixed padding tests of TestConsistency to AllowList Pull Request resolved: https://github.com/pytorch/pytorch/pull/91522 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet	2023-01-04 22:00:37 +00:00
Joel Schlosser	1effabe257	Support per-parameter test decoration (#91658 ) Continuation of #79979. Fixes #79161 This PR does the following: * Expands the `parametrize_fn()` signature from returning a 3-tuple of `(test, test_name, param_kwargs)` to returning a 4-tuple of `(test, test_name, param_kwargs, decorator_fn)`. Expected signature for the addition is `decorator_fn(param_kwargs) -> List[decorator]` i.e. given the full set of test params, return a list of decorators to apply. * `modules`, `ops`, and `parametrize` now fit the new signature, returning `decorator_fn`s instead of applying decorators themselves. * `instantiate_parametrized_tests()` and `instantiate_device_type_tests()` now call the returned `decorator_fn`, passing in the full set of `param_kwargs` (after composition + `device` / `dtype` additions) and applying the returned decorators. * Composing multiple `parametrize_fn`s also composes the corresponding `decorator_fn`s; the composed `decorator_fn` simply concatenates the decorator lists returned by the constituents. * Expands `DecorateInfo.is_active` to support callables: ```python DecorateInfo( unittest.expectedFailure, "TestOps", "test_python_ref_executor", device_type='cuda', active_if=lambda params: params['executor'] == 'nvfuser' ), ``` * Adds several tests to `test/test_testing.py` ensuring proper decoration using `@parametrize`, `@modules`, and `@ops`. * (minor) Fixes a couple `ModuleInfo` naming oddities uncovered during testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91658 Approved by: https://github.com/malfet	2023-01-04 21:08:32 +00:00
Nikita Shulga	0e60bef516	[Lint] Update clang-tidy to 11.1.0 (#91709 ) Also, add option to download to distinguish between universal/i386 only and separate i386 and arm binaries for MacOS Follow up for https://github.com/pytorch/test-infra/pull/1354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91709 Approved by: https://github.com/huydhn	2023-01-04 20:04:07 +00:00
Tugsbayasgalan Manlaibaatar	d4713b4c7d	[dynamo] Fix bug in tensor.item fake tensor propogation (#91668 ) When we run the node with fake value for tensor.item, it would previously error because the utility method doesn't know how to handle placeholder node. The tensor we are calling item can be input from user will be placeholder in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91668 Approved by: https://github.com/voznesenskym	2023-01-04 19:51:19 +00:00
PyTorch MergeBot	4bad40f559	Revert "inductor: add conv+hardsigmoid fusion for cpu path (#91433 )" This reverts commit 1d2bfea33e59d2e6fbff57755cd92d9942488a23. Reverted https://github.com/pytorch/pytorch/pull/91433 on behalf of https://github.com/mehtanirav due to Internal breakages due to different ideep version	2023-01-04 19:44:26 +00:00
Jeff Daily	c18e8c68d8	[ROCm] fix parallel test runners and device visibility (#91137 ) Fixes #90940. This PR revamps how tests are run in parallel as well as device visibility at the docker container and within the run_test.py test runner. First, running multiple test modules concurrently on the same GPU was causing instability for ROCm runners manifesting as timeouts. ROCm runners have at least 1 GPU each, but often 2 or more. This PR allows NUM_PROCS to be set equal to the number of devices available, but also takes care to set HIP_VISIBLE_DEVICES to avoid oversubscribing any GPU. Second, we had introduced env vars `-e ROCR_VISIBLE_DEVICES` (#91031) to prepare for two GHA runners per CI node, to split up the GPU visibility at the docker level between the two runners. This effort wasn't fully realized; to date, we haven't had more than one runner per CI host. We abandon this effort in favor of all GPUs being visible to a single runner and managing GPU resources as stated above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91137 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/pruthvistony	2023-01-04 19:40:05 +00:00
Sergei Vorobev	5a6019033f	[bazel] change visibility for //c10:headers (#91422 ) At Cruise we are actively depending on the c10 headers, I'm not certain what is the reason to hide them to the pkg level. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91422 Approved by: https://github.com/malfet	2023-01-04 19:04:35 +00:00
Wanchao Liang	17bc40c19d	add __hash__ to FunctionSchema (#90730 ) This PR adds __hash__ to FunctionSchema pybind binding, so that it could be used for things like dict indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/90730 Approved by: https://github.com/ezyang	2023-01-04 18:59:22 +00:00
Samantha Andow	a7749ae177	[reland] rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218 ) (#89221 ) Summary: First half of #87990. This doesn't change any of the behavior and is just a rename #88218 got reverted for internal breakages. This is the reland of started from internal Differential Revision: D41268423 LaMa Project: L1098534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89221 Approved by: https://github.com/meliy-meyada, https://github.com/zou3519	2023-01-04 18:32:49 +00:00
Sergei Vorobev	a5e2309f5e	[bazel] Add @pytorch in tools/bazel.bzl (#91424 ) This is a follow-up from #89660 There is another place that needs to be updated. I think this time I covered all of them... Pull Request resolved: https://github.com/pytorch/pytorch/pull/91424 Approved by: https://github.com/malfet	2023-01-04 18:28:19 +00:00
Joel Schlosser	1e725c9747	Avoid device casting for all singleton tensors in optimizer states (#91454 ) Fixes #75224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91454 Approved by: https://github.com/janeyx99	2023-01-04 17:55:00 +00:00
Ramin Azarmehr	979255067d	[MPS] Fix the crash in max_out() caused by cached key conflict (#91520 ) The shape of input and indices tensors were missing in the cached key Pull Request resolved: https://github.com/pytorch/pytorch/pull/91520 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet	2023-01-04 17:53:19 +00:00
Ikko Ashimine	ce9963e6ba	Fix typo in _lobpcg.py (#91641 ) represenation -> representation Pull Request resolved: https://github.com/pytorch/pytorch/pull/91641 Approved by: https://github.com/zou3519	2023-01-04 15:19:05 +00:00
Syed Tousif Ahmed	66b3325304	Adds more nvidia pypi dependencies (#89944 ) This PR adds more nvidia pypi dependencies for cuda 11.7 wheel. Additionally, it pins cufft version to 10.9.0.58 to resolve https://github.com/pytorch/pytorch/issues/88038 Depends on: https://github.com/pytorch/builder/pull/1196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89944 Approved by: https://github.com/atalman	2023-01-04 15:08:08 +00:00
Peter Bell	e26cb06681	squeeze: allow squeezing multiple dimensions at once (#89017 ) Ref #70924 This addresses part 1 of the issue, allowing `torch.squeeze` to be passed a tuple of dimensions. e.g. ```python x.squeeze(0).squeeze(0) ``` can now be written ```python x.squeeze((0, 1)) ``` (assuming x has at least 2 dimensions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89017 Approved by: https://github.com/albanD	2023-01-04 14:40:56 +00:00
lezcano	3120054c15	Vectorize norm(double, p=2) on cpu (#91502 ) This gives a speed up of 100x on my machine: ``` [------------------ Master -------------------] \| (200000, 3) 32 threads: ---------------------------------- torch linalg_norm \| 10000 torch linalg_vector_norm \| 10000 torch custom \| 397 numpy norm \| 3123 numpy custom_np \| 3119 Times are in microseconds (us). [------------------- PR -------------------] \| (200000, 3) 32 threads: ---------------------------------- torch linalg_norm \| 107 torch linalg_vector_norm \| 100 torch custom \| 400 numpy norm \| 3170 numpy custom_np \| 3162 Times are in microseconds (us). ``` Fixes https://github.com/pytorch/pytorch/issues/91373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91502 Approved by: https://github.com/mingfeima, https://github.com/ngimel	2023-01-04 08:03:38 +00:00
Yanli Zhao	2004df9097	Remove python ddp (#91663 ) As it is not used by anyone and also it is not maintained by PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/91663 Approved by: https://github.com/rohan-varma	2023-01-04 05:22:30 +00:00
Vasiliy Kuznetsov	ebb7f20afc	quant: make various configs printable (#91419 ) Summary: Makes various quantization configs print out human readable values instead of just the class name. This is useful when printing these configs out when debugging. Test plan: test script ``` conf_1 = torch.ao.quantization.backend_config.backend_config.DTypeConfig() print(conf_1) conf_2 = torch.ao.quantization.backend_config.backend_config.BackendConfig() print(conf_2) conf_3 = torch.ao.quantization.backend_config.backend_config.BackendPatternConfig() print(conf_3) conf_4 = torch.ao.quantization.fx.custom_config.PrepareCustomConfig()\ .set_input_quantized_indexes([0]) print(conf_4) conf_5 = torch.ao.quantization.fx.custom_config.ConvertCustomConfig()\ .set_preserved_attributes(['foo']) print(conf_5) conf_6 = torch.ao.quantization.fx.custom_config.FuseCustomConfig()\ .set_preserved_attributes(['foo']) print(conf_6) ``` test script output ``` DTypeConfig(input_dtype_with_constraints=DTypeWithConstraints(dtype=None, quant_min_lower_bound=None, quant_max_ upper_bound=None, scale_min_lower_bound=None, scale_max_upper_bound=None, scale_exact_match=None, zero_point_exa ct_match=None), output_dtype_with_constraints=DTypeWithConstraints(dtype=None, quant_min_lower_bound=None, quant _max_upper_bound=None, scale_min_lower_bound=None, scale_max_upper_bound=None, scale_exact_match=None, zero_poin t_exact_match=None), weight_dtype_with_constraints=DTypeWithConstraints(dtype=None, quant_min_lower_bound=None, quant_max_upper_bound=None, scale_min_lower_bound=None, scale_max_upper_bound=None, scale_exact_match=None, zero _point_exact_match=None), bias_dtype=None, is_dynamic=None) BackendConfig({'name': '', '_pattern_complex_format_to_config': {}}) BackendPatternConfig({'observation_type': <ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT: 0>}) PrepareCustomConfig({'input_quantized_indexes': [0]}) ConvertCustomConfig({'preserved_attributes': ['foo']}) FuseCustomConfig({'preserved_attributes': ['foo']}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91419 Approved by: https://github.com/andrewor14	2023-01-04 04:52:20 +00:00
Huy Do	316ba9e6fc	Run jit legacy tests sequentially (#91518 ) Fixes https://github.com/pytorch/pytorch/issues/91457. I have been re-running the 2 tests `test_jit_legacy` and `test_jit_fuser_legacy` in `jit_legacy` shard multiple times (100+) without any flaky issues found. I suspect that we might have a test parallelization flakiness here. So this PR runs these 2 tests serially. They takes less than 5 minutes to finish, so running them sequentially won't be an issue (https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=jit_legacy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91518 Approved by: https://github.com/clee2000	2023-01-04 04:13:01 +00:00
Denis Vieriu	80394bb734	[MPS] Register norm_dtype_out_mps and cdist (#91643 ) Add support for `norm_dtype_out` and `cdist` ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/91643 Approved by: https://github.com/razarmehr	2023-01-04 02:20:53 +00:00
Edward Z. Yang	619d52a5d2	Make torch.device usable as a context manager (#91525 ) Fixes https://github.com/pytorch/pytorch/issues/82296 Fixes https://github.com/pytorch/pytorch/issues/27878 Fixes https://github.com/pytorch/pytorch/issues/260 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91525 Approved by: https://github.com/albanD	2023-01-04 01:32:00 +00:00
Liao, Xuan	aa0ca994ca	[Inductor] add missing ops for cpp vectorization overrides (#90750 ) For micro-benchmark, aten.elu.default and aten.elu_backward.default have poor performance with inductor compared to eager. The main reason is lack of the vectorization. With adding missing ops for cpp vectorization overrides, the vectorization could be successfully applied. Performance data for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl63 {mso-number-format:Percent;} .xl64 {color:gray;} --> </head> <body link="#0563C1" vlink="#954F72"> op \| speedup_old \| RSD (3) \| speedup_new \| RSD (3) \| increased_performance -- \| -- \| -- \| -- \| -- \| -- aten.elu.default \| 0.205947276 \| 1.73% \| 0.995302802 \| 4.76% \| 383.28% aten.elu_backward.default \| 0.336280639 \| 0.58% \| 1.69473642 \| 1.96% \| 403.96% </body> </html> The new supported ops for cpp vectorization overrides: - eq - ne - lt - gt - le - ge Pull Request resolved: https://github.com/pytorch/pytorch/pull/90750 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2023-01-04 01:31:43 +00:00
XiaobingSuper	1d2bfea33e	inductor: add conv+hardsigmoid fusion for cpu path (#91433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91433 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-01-04 01:22:07 +00:00
PyTorch MergeBot	6f9a4ae5c9	Revert "Populate the eviction_policy field for load/store properly (#91316 )" This reverts commit 3f4e87beaf67ec44d609605777d9da9e65cfbdd9. Reverted https://github.com/pytorch/pytorch/pull/91316 on behalf of https://github.com/ngimel due to regresses performance	2023-01-04 00:47:37 +00:00
Yanli Zhao	e116f1a3ff	Add an env variable to disable addmm_cuda_lt kernel (#91436 ) addmm_cuda_lt failed for some corner cases, so far we can not reproduce the corner cases in the unit tests, seems that the failures do not only depend on matrices' shape and strides. For now, add an environment variable to allow users disable this kernel for such corner cases. See the case one with more error logs: RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 80 n 1024 k 160 mat1_ld 160 mat2_ld 160 result_ld 80 abcType 14 computeType 68 scaleType 0 result_shape 1024 80 result_stride 80 1 self_shape 80 self_stride 1 mat1_shape 1024 160 mat1_stride 160 1 mat2_shape 160 80 mat2_stride 1 160 Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first): another case with more error logs: RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 16384 k 48 mat1_ld 48 mat2_ld 48 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 16384 16 result_stride 16 1 self_shape 16 self_stride 1 mat1_shape 16384 48 mat1_stride 48 1 mat2_shape 48 16 mat2_stride 1 48 Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first): Pull Request resolved: https://github.com/pytorch/pytorch/pull/91436 Approved by: https://github.com/ngimel	2023-01-04 00:46:19 +00:00
samdow	162474d7fd	[functorch] add new ensembling api, demonstrate in example (#88850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88850 Approved by: https://github.com/zou3519	2023-01-04 00:33:14 +00:00
samdow	c5e5916fff	[functorch] add functorch functional_call, update tests to test this (#89213 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89213 Approved by: https://github.com/zou3519	2023-01-04 00:33:14 +00:00
Richard Zou	264f5ed516	[autograd.Function] Add docs on the functorch interaction (#91452 ) This PR: - Updates autograd.Function.forward docs to reflect how you either define a forward with ctx or a separate forward and setup_context - Updates the "Extending Autograd" docs to suggest the usage of autograd.Function with separate forward and setup_context. This should be the default because there is a low barrier to go from this to an autograd.Function that is fully supported by functorch transforms. - Adds a new "Extending torch.func with autograd.Function" doc that explains how to use autograd.Function with torch.func. It also explains how to use generate_vmap_rule and how to manually write a vmap staticmethod. While writing this, I noticed that the implementation of setup_context staticmethod/generate_vmap_rule/vmap staticmethod are a bit inconsistent with the other method/attributes on autograd.Function: - https://github.com/pytorch/pytorch/issues/91451 - I'm happy to fix those if we think it is a problem, either in this PR or a followup (this PR is getting long, I want some initial docs out that I can point early adopters at, and fixing the problems in the future isn't really BC-breaking). Test Plan: - view docs preview Pull Request resolved: https://github.com/pytorch/pytorch/pull/91452 Approved by: https://github.com/soulitzer	2023-01-04 00:28:19 +00:00
Catherine Lee	31a699934b	Remove CircleCI ios PR jobs (#91638 ) We added this because we wanted to burn our extra CIrcleCI credits, but now that it's the next year, those should be gone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91638 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2023-01-04 00:27:49 +00:00
Denis Vieriu	38de981e16	[MPS] Add nonzero mps support (#91616 ) Adds nonzero support for mps: Pseudocode: ``` // // inputTensor = [1, 0, 0, 3] // inputNonZero = [1, 0, 0, 1] (input != 0) // scan = [1, 1, 1, 2] (prefix sum) // maskedIndices = [0, -1, -1, 1] (select) // coordinates = [0, 1, 2, 3] (coordinateAlongAxis) // scatterResult = [0, 3] (scatter) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91616 Approved by: https://github.com/razarmehr	2023-01-04 00:02:24 +00:00
eqy	97ff20d722	[cuBLAS] (re-open) Fix default cuBLAS workspace size and parsing for multiple workspaces (#91564 ) re-open of #89027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91564 Approved by: https://github.com/ngimel	2023-01-03 23:48:15 +00:00
PyTorch MergeBot	0a6053e9b5	Revert "Avoid copies in matmul (#76828 )" This reverts commit 8c2e82b48790afb7df8d77ffd9ced74083a3f5b7. Reverted https://github.com/pytorch/pytorch/pull/76828 on behalf of https://github.com/mehtanirav due to Internal breakages	2023-01-03 23:36:58 +00:00
Bin Bao	6bf0e3b697	[inductor] Check for BackendCompilerFailed on CI (#91634 ) Summary: https://github.com/pytorch/pytorch/pull/91283/ skips certain random triton failure on CI, but we need to check against the BackendCompilerFailed exception type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91634 Approved by: https://github.com/ngimel	2023-01-03 22:38:29 +00:00
Driss Guessous	3a60debe9d	implement ordering (#91362 ) # Summary In some cases, dependent on input, flash-attention is not the fastest fused kernel and memory-efficient attention is better. This implements a simple heuristic function for deciding the ordering of kernel functions. This was based off of the xformer function found here: `15bff4986c/xformers/ops/fmha/dispatch.py (L13)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91362 Approved by: https://github.com/cpuhrsch	2023-01-03 22:33:14 +00:00
Yanli Zhao	743c385543	refactor show_traces in memory_tracker (#90145 ) refactor show_tracers in memory_tracker to make it plot multiple figures and also can load serialized stats and then plot figures Pull Request resolved: https://github.com/pytorch/pytorch/pull/90145 Approved by: https://github.com/rohan-varma	2023-01-03 22:10:15 +00:00
PyTorch MergeBot	b6bb726cc3	Revert "Dispatch the auxiliary frobenius_norm and nuclear_norm to better implementations and deprecate them (#81763 )" This reverts commit 122245985a544d9d74d7b5037493541f5e525498. Reverted https://github.com/pytorch/pytorch/pull/81763 on behalf of https://github.com/mehtanirav due to Internal breakages	2023-01-03 21:54:25 +00:00
Jiawen Liu	57b7f33ba8	[Inductor] Move graph.lint() in Intel's FX Passes to the End of Loop to Reduce Compile Time (#91179 ) Summary: Move `graph.lint()` in Intel's FX passes to the end of loop to reduce compile time, as there is no need to place `graph.lint()` within loop Test Plan: CI Reviewed By: jansel Differential Revision: D41964322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91179 Approved by: https://github.com/XiaobingSuper, https://github.com/jansel	2023-01-03 21:26:31 +00:00
Natalia Gimelshein	818079dc4e	disabled flaky c2 test (#91640 ) Summary: disables flaky test, T93236537 Test Plan: Existing tests Differential Revision: D42314944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91640 Approved by: https://github.com/malfet	2023-01-03 21:26:21 +00:00
Nikita Vedeneev	7ef7c57ae7	CSC/BSC -> COO coalesce fix (#91440 ) Fixes https://github.com/pytorch/pytorch/issues/91010. CSC and BSC sparse formats are not inherently `coalesced`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91440 Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch	2023-01-03 18:42:39 +00:00
Wei Wang	4709523722	Revert D42051833: Multisect successfully blamed D42051833 for test or build failures (#91458 ) Summary: This diff is reverting D42051833 D42051833 has been identified to be causing the following test or build failures: Tests affected: - [//xplat/pytorch_models/build/MultitaskPeopleSegmentation/v7020:MultitaskPeopleSegmentation7020_testAndroid-64bit - runAllTests (com.facebook.xplat.XplatTestRunner)](https://www.internalfb.com/intern/test/281475056077477/) - [//xplat/pytorch_models/build/MultitaskPeopleSegmentation/v4020:PYTORCH_MODEL_testAndroid-64bit - runAllTests (com.facebook.xplat.XplatTestRunner)](https://www.internalfb.com/intern/test/844425007913475/) Here's the Multisect link: https://www.internalfb.com/intern/testinfra/multisect/1478566 Here are the tasks that are relevant to this breakage: T93205881: 15 tests started failing for oncall ai_infra_mobile_platform in the last 2 weeks We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. Test Plan: NA Differential Revision: D42090396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91458 Approved by: https://github.com/kit1980	2023-01-03 18:17:35 +00:00
Nikita Shulga	2965d7e11a	[CI] Disable rocm distributed tests (#91632 ) As they has been broken since Dec 16th See https://github.com/pytorch/pytorch/issues/91630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91632 Approved by: https://github.com/atalman, https://github.com/albanD	2023-01-03 17:14:19 +00:00
Ramin Azarmehr	688e351970	[MPS] Implement MPSGenerator to enable manual random seeding (#91348 ) This patch adds support for creating torch.Generator for MPS device, and enables its functions such as manual_seed, get_state, and set_state. Fixes #84288 and #84516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91348 Approved by: https://github.com/malfet, https://github.com/albanD	2023-01-03 16:01:19 +00:00
XiaobingSuper	dfb651452a	inductor: meta registration for mkldnn ops (#91299 ) Fix https://github.com/pytorch/torchdynamo/issues/198, which supports Meta tensor for conv/linear fused ops to reduce the compilation time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91299 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-01-03 14:24:36 +00:00
lezcano	8c2e82b487	Avoid copies in matmul (#76828 ) With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805 Fixes https://github.com/pytorch/pytorch/issues/76702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828 Approved by: https://github.com/ngimel	2023-01-03 14:18:38 +00:00
PyTorch MergeBot	db2a237763	Revert "Avoid copies in matmul (#76828 )" This reverts commit 0c3659586d26a762426805af5d4536e0dd01a0c6. Reverted https://github.com/pytorch/pytorch/pull/76828 on behalf of https://github.com/lezcano due to Makes functorch tests fail	2023-01-03 12:26:29 +00:00
Nuno Lopes	2b0abd4ce3	symbolic shapes: add parenthesis around FloorDiv expression (#91554 ) Before it would print the guard expression like: `23//2` and now: `2(3//2)` ```python print(23//2) # 3 print(2(3//2)) # 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91554 Approved by: https://github.com/ezyang	2023-01-03 11:12:08 +00:00
Denis Vieriu	f7939b21e1	[MPS] Add bincount support for mps (#91267 ) Add support for bincount on MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/91267 Approved by: https://github.com/razarmehr	2023-01-03 06:01:07 +00:00
Sean Ross-Ross	cb3204823e	adding test to audit CompositeImplicitAutograd ops that do not have a batching rule (#91367 ) Fixes https://github.com/pytorch/functorch/issues/1087 It looks like there are `306` rules that should be looked into ``` test/functorch/test_vmap_registrations.py .x.....xxxxxxx.x.x.x.x.x.x.x.x........xx.x.x..x.x.xxx...xxxx.x.x.x........x.........xxxxx..x..x.....xx...xx.....xxx.xxxxxxxxxxxxxxxxx.. [ 24%] .........x.x......x.xxxxxx..x..xx.x.xxx.x.......x.xxx.xx..xxx.xxx...xxxxx.x....xxxxxxxxxxxxxxx....xx.xxx.xx.x...xx...xx...xxxxxx...xxxxx..x...xxxxxxxxxxxx..xx..xx.xx.x..xxxx..xx [ 56%] .xx..x.x....xxxxxx.x.xx...xxxxx.xx...x..x.x.xx...xx.xxxxxx.xxxxxx..x........xxxxxxxx..xxxxxxxx..xx.xxxxxxxxxxxxxxxxxxxxxxx..........xxxx.xxxx.........xxxxxxxx..xxx..xxx.x.x.x.xx [ 88%] xx.xxx.x......xxx.x.xxxxxxxx....x......xxxxxxxxx.xx.x.x.x.......xx [100%] =================================================================== 249 passed, 1185 deselected, 306 xfailed in 3.17s =================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91367 Approved by: https://github.com/zou3519	2023-01-03 04:21:39 +00:00
cnemri	6e236553f5	implemented test and Changed assert to TORCH_CHECK #88808 (#91273 ) Fixes #88808 Replaced `AT_ASSERT(dims < MAX_TENSORINFO_DIMS) ` in aten/src/ATen/cuda/detail/TensorInfo.cuh by ``` data = p; dims = dim; TORCH_CHECK(dims < MAX_TENSORINFO_DIMS, "CUDA Tensors cannot have more than 25 dimensions"); } ``` In : torch/testing/_internal/common_methods_invocations.py ``` def error_inputs_median(op_info, device, **kwargs): x = torch.tensor([[[[[[[[[[[[[[[[[[[[[[[[[nan], [nan]]]]]]]]]]]]]]]]]]]]]]]]], device=device) if device=='cuda': yield ErrorInput(SampleInput(x, kwargs=dict(dim=(-1))), error_type=RuntimeError, error_regex='CUDA Tensors cannot have more than 25 dimensions') else: return ``` And ``` OpInfo('median', ... error_inputs_func=error_inputs_median, ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91273 Approved by: https://github.com/ngimel	2023-01-03 03:06:33 +00:00
Wei Wang	cce577b391	Revert D42257039: Multisect successfully blamed D42257039 for test or build failures (#91548 ) Summary: This diff is reverting D42257039 D42257039 has been identified to be causing the following test or build failures: Tests affected: - [assistant/neural_dm/rl/modules/tests:action_mask_classifier_test - main](https://www.internalfb.com/intern/test/281475048940766/) Here's the Multisect link: https://www.internalfb.com/intern/testinfra/multisect/1493969 Here are the tasks that are relevant to this breakage: T93770103: 1 test started failing for oncall assistant_multimodal in the last 2 weeks We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. Test Plan: NA Reviewed By: weiwangmeta Differential Revision: D42272391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91548 Approved by: https://github.com/kit1980	2023-01-02 21:08:30 +00:00
Jan Červenka	fae821c2f1	fix inductor linspace when steps=1 (#91578 ) Fixes #91506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91578 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-01-02 20:30:39 +00:00
lezcano	0c3659586d	Avoid copies in matmul (#76828 ) With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805 Fixes https://github.com/pytorch/pytorch/issues/76702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828 Approved by: https://github.com/ngimel	2023-01-02 20:07:38 +00:00
lezcano	122245985a	Dispatch the auxiliary frobenius_norm and nuclear_norm to better implementations and deprecate them (#81763 ) These functions will be legacy functions. We deprecate them, but we also take this chance to dispatch to a more efficient and consistent implementation. Doing so should help writing a conversion rule for these to be able to remove them once and for all Pull Request resolved: https://github.com/pytorch/pytorch/pull/81763 Approved by: https://github.com/ngimel	2023-01-02 18:32:39 +00:00
Pearu Peterson	b797a24259	Support indices contiguity per batch and non-contiguous values in sparse compressed tensors (#91243 ) Fixes https://github.com/pytorch/pytorch/issues/91062 With this PR, all reported failures in https://github.com/pytorch/pytorch/pull/90849 are resolved (modulo test_bmm that uses an unorthodox way to construct a batch CSR tensor). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91243 Approved by: https://github.com/nikitaved, https://github.com/amjames, https://github.com/lezcano	2023-01-02 18:08:46 +00:00
Denis Vieriu	dbf96164be	[MPS] Add suport for casting updatesTensor directly in scatter (#91197 ) Fixes copies into slices where the input data type is different than the output dtype. This change removes the cast done before scatter, so we don't have to allocate additional memory to perform the casting. Scatter handles the casting directly now. device = "mps" shape = (4, 4) tensor = torch.randint(10, shape, device=device) tensor_before = tensor.clone() res = torch.empty(shape[0], shape[1] * 2, device=device)[:, ::2].copy_(tensor) torch.testing.assert_close(tensor, tensor_before) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91197 Approved by: https://github.com/razarmehr	2023-01-02 16:31:27 +00:00
William Phetsinorath	34f2d3e6ae	Deduplicate c10 error and PyTorchError hierarchy (#87855 ) Fixes #53370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87855 Approved by: https://github.com/albanD	2023-01-02 15:53:36 +00:00
PyTorch MergeBot	2b52db9c95	[xla hash update] update the pinned xla hash (#91087 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91087 Approved by: https://github.com/malfet	2023-01-02 11:12:19 +00:00
PyTorch MergeBot	39d49dbe45	Revert "[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027 )" This reverts commit b407d98dbe1dda696d993150a89e4e46aa658168. Reverted https://github.com/pytorch/pytorch/pull/89027 on behalf of https://github.com/kit1980 due to Fails test_cublas_workspace_explicit_allocation on ROCm	2022-12-31 23:04:57 +00:00
Aaron Gokaslan	77c2a8a11f	Clang-Tidy: Improve ctors by removing unnecessary copies and initializations (#91538 ) Apply clang-tidy fixups to prefer member initializer and modernize-pass-by-value. This is a mostly a noop, but it should make a few ctors slighlty more readable and more efficient. Also drops in some missing moves that prevents a lot of unnecessary copying. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91538 Approved by: https://github.com/ezyang	2022-12-31 07:19:30 +00:00
eqy	b407d98dbe	[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027 ) Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027 Approved by: https://github.com/ngimel	2022-12-31 06:58:04 +00:00
Yanli Zhao	f613633124	Remove _ignored_param_names (#91530 ) '_ignored_param_names' is only used in 'param_hook' during state_dict() post hook processing to check a parameter key needs to be cloned or not. But it is not needed, as state_dict() post hook only passes fsdp managed parameter keys to 'param_hook', see https://github.com/pytorch/pytorch/blob/master/torch/distributed/fsdp/_state_dict_utils.py#L203. That means the passed parameter keys are always not part of '_ignored_param_names'. so we should be able to safely remove '_ignored_param_names' and related codes Pull Request resolved: https://github.com/pytorch/pytorch/pull/91530 Approved by: https://github.com/rohan-varma	2022-12-31 03:28:22 +00:00
Nikita Shulga	6cef59487a	[BE] Move internal only non-globbed lists to OSS (#91513 ) Summary: Should prevent internal only fixes that were required for https://github.com/pytorch/pytorch/pull/91104 Just moves the list to `build_variables.bzl` and makes it a sublist of aten_cpu_source_non_codegen_list Test Plan: CI Differential Revision: D42281502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91513 Approved by: https://github.com/kit1980, https://github.com/atalman	2022-12-31 00:02:43 +00:00
Eddie Yan	73436af43f	[cuDNN][cuDNN V8 API] Improve hot path heuristics performance in V8 (#90811 ) Small optimization for the hot path when thrashing the cache with dynamic shapes; in most cases we don't need the fallback generator so we can omit it unless needed later. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/90811 Approved by: https://github.com/ngimel	2022-12-30 23:39:49 +00:00
Nikita Shulga	bc92444b34	Rename `torchtriton` (#91539 ) to `pytorch-triton` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91539 Approved by: https://github.com/seemethere, https://github.com/soumith	2022-12-30 22:49:17 +00:00
dependabot[bot]	62713636d8	Bump protobuf from 3.20.1 to 3.20.2 in /.github/requirements (#91540 ) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf's releases</a>.</em></p> <blockquote> <h2>Protocol Buffers v3.20.2</h2> <h1>C++</h1> <ul> <li>Reduce memory consumption of MessageSet parsing</li> <li>This release addresses a <a href="https://github.com/protocolbuffers/protobuf/security/advisories/GHSA-8gq9-2x98-w8hf">Security Advisory for C++ and Python users</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`a20c65f2cd`"><code>a20c65f</code></a> Updating changelog</li> <li><a href="`c49fe79af9`"><code>c49fe79</code></a> Updating version.json and repo version numbers to: 20.2</li> <li><a href="`806d7e4ce6`"><code>806d7e4</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10544">#10544</a> from deannagarcia/3.20.x</li> <li><a href="`ae718b3902`"><code>ae718b3</code></a> Add missing includes</li> <li><a href="`b4c395aaed`"><code>b4c395a</code></a> Apply patch</li> <li><a href="`6439c5c013`"><code>6439c5c</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10531">#10531</a> from protocolbuffers/deannagarcia-patch-7</li> <li><a href="`22c79e6e4c`"><code>22c79e6</code></a> Update version.json</li> <li><a href="`c1a2d2ec29`"><code>c1a2d2e</code></a> Fix python release on macos (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10512">#10512</a>)</li> <li><a href="`a826282e15`"><code>a826282</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10505">#10505</a> from deannagarcia/3.20.x</li> <li><a href="`7639a710e1`"><code>7639a71</code></a> Add version file</li> <li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.20.1...v3.20.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=protobuf&package-manager=pip&previous-version=3.20.1&new-version=3.20.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91540 Approved by: https://github.com/huydhn	2022-12-30 20:31:27 +00:00
Remi Domingues	fdbbd20f32	Cache conda and pip for IOS CI (#91359 ) Fixes T137630520 Caching for conda and pip dependencies for iOS CI workflow. - Conda and pip dependencies have been moved from [_ios-build-test.yml](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_ios-build-test.yml) to dedicated requirements files - Miniconda shell installation has been replaced by `setup-miniconda@main` which supports caching Pull Request resolved: https://github.com/pytorch/pytorch/pull/91359 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-12-30 17:52:20 +00:00
Michael Gschwind	af589b3d1f	switch causal mask for is_causal flag (#91171 ) Summary: switch causal mask for is_causal flag Test Plan: sandcastle & github Differential Revision: D42089340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91171 Approved by: https://github.com/wushirong, https://github.com/drisspg	2022-12-30 17:24:58 +00:00
cyy	9710ac6531	Some CMake and CUDA cleanup given recent update to C++17 (#90599 ) The main changes are: 1. Remove outdated checks for old compiler versions because they can't support C++17. 2. Remove outdated CMake checks because it now requires 3.18. 3. Remove outdated CUDA checks because we are moving to CUDA 11. Almost all changes are in CMake files for easy audition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90599 Approved by: https://github.com/soumith	2022-12-30 11:19:26 +00:00
lezcano	d5163f5206	Fix NumPy broadcasting in lstsq_backward (#91460 ) Fixes https://github.com/pytorch/pytorch/issues/77225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91460 Approved by: https://github.com/albanD	2022-12-30 10:49:20 +00:00
lezcano	051d16a2f7	Fix NumPy-compat broadcasting in the derivative of linalg.solve (#91456 ) Fixes https://github.com/pytorch/pytorch/issues/89761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91456 Approved by: https://github.com/albanD	2022-12-30 10:49:20 +00:00
lezcano	484dd40022	Implement PReLU in a compositional way (#91238 ) The PReLU implementation was all over the place. This lead to a number of bugs like https://github.com/pytorch/pytorch/issues/68760. We fix it by: - Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel - This second kernel is just a good-ol' pointwise kernel. - We implement the derivative for the pointwise kernel via TI as well for speed. - We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally This fixes a number of issues: - We don't perform copies any more when the inputs are not contiguous - The derivatives are now correct - We fix vmap and many other functorch-related issues. - CPU and CUDA now share the relevant broadcasting logic - The implementation is about 1/3 the length. Fixes https://github.com/pytorch/pytorch/issues/68760 Fixes https://github.com/pytorch/pytorch/issues/89895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238 Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD	2022-12-30 10:42:30 +00:00
Chien-Chin Huang	0e8565d1d5	[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load (#91234 ) What does this PR do? This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91234 Approved by: https://github.com/rohan-varma	2022-12-30 06:56:44 +00:00
Edward Z. Yang	f8740db410	Properly resolve source_ref when constructing shape guards (#91058 ) Whenever you guard on something, you're supposed to tell GuardBuilder about it, so GuardBuilder knows that it has to actually bind it in scope when it creates the guard function. But shape env guards bypass that mechanism completely. Well, now they don't. For the most part, this didn't matter in practice, because we usually had a `TENSOR_MATCH` guard floating around that made sure that the guard stayed live. But if we ever eliminate those guards (e.g., because we build it into the shape guard directly; something we'll probably want to do when https://github.com/pytorch/pytorch/pull/89707 goes online) then this will indeed matter. One complication: some of the shape env guards are on globals. You have to make sure to shunt the usage to the correct guard builder in that case. Maybe it would be better if we refactored things so there is only one GuardBuilder. Not sure. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91058 Approved by: https://github.com/voznesenskym	2022-12-30 05:56:56 +00:00
Edward Z. Yang	bcf15cd93b	Store source, not sname, in Symbol (#91057 ) I'm going to need this in the follow up PR. Instead of storing only Source.name() in Symbol, I now store a full on Source. Lots of replumbing reoccurs. In particular: - Move Source to torch._guards to break cycles - I have to add TensorPropertySource and NegateSource to handle x.size()[0] and -x codegen that I was doing with string manipulation previously - I tighten up invariants so that I never pass source=None; instead I pass ConstantSource (these are constant sources right) and test for that rather than source being missing. I think this is more parsimonious - Some mypy wobbles from new imports I didn't move LocalSource and friends to torch._guards, but I ended up needing to access them in a few places. The main annoyance with moving these is that then I also need to move the bytecode codegen stuff, and that's not so easy to move without bringing in the kitchen sink. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91057 Approved by: https://github.com/albanD, https://github.com/voznesenskym, https://github.com/zou3519	2022-12-30 05:56:56 +00:00
Aaron Enye Shi	2edf589e66	[Profiler] Fix SOFT_ASSERT test to not raise on debug builds (#91464 ) Summary: There was a patch to not raise SOFT_ASSERT in debug builds. Update this test to match it. Test Plan: This test passes after this patch. Differential Revision: D42270123 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/91464 Approved by: https://github.com/robieta	2022-12-30 05:31:03 +00:00
eqy	946e57704e	Drop compute capability < 5.0 in CUDA 12 (#91213 ) CC @ptrblck @crcrpar #91122 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91213 Approved by: https://github.com/ngimel	2022-12-30 04:53:05 +00:00
Richard Zou	31e66ca4ef	[torch.func] Add docs (#91319 ) Docs copy-pasted from functorch docs with minor adjustments. We are keeping the functorch docs for BC, though that's up for debate -- we could also just say "see .. in torch.func" for some, but not all doc pages (we still want to keep around any examples that use make_functional so that users can tell what the difference between that and the new functional_call is). Test Plan: - docs preview Pull Request resolved: https://github.com/pytorch/pytorch/pull/91319 Approved by: https://github.com/samdow	2022-12-30 02:51:18 +00:00
Nikita Vedeneev	6f034dc0b0	(non-batch) BSR/BSC to COO performance improvement. (#91389 ) This PR improves the aforementioned conversions by reducing memory footprint and the number of kernels run, and also by removing the sync imposed by `at::where(condition)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91389 Approved by: https://github.com/pearu, https://github.com/kit1980	2022-12-30 00:04:50 +00:00
Aaron Gokaslan	b1bdec83c9	Clang-Tidy: Prevent implicit promotion in math functions (#91450 ) Ensure that accidental promotions in functions does not accidentally occur Pull Request resolved: https://github.com/pytorch/pytorch/pull/91450 Approved by: https://github.com/ezyang	2022-12-29 23:44:17 +00:00
Aaron Gokaslan	1c3bb2fdb0	Chore: fix clang warning - mismatched tags (#91455 ) Fixes a clang warning about mismatched tags when building PyTorch. Seems like an easy fix / oversight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91455 Approved by: https://github.com/ezyang	2022-12-29 23:43:50 +00:00
Aaron Gokaslan	a34a9c3471	Perf: Apply more clang-tidy fixups to torch headers (#91445 ) Applies so more fixes to headers that may have been missed before for performance optimization.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang @ezyang since this more in the series of the clang-tidy fixup This is PR fixes 3 main issues: 1. Use emplacement more in headers 1. Avoid unnecessary copies and use const ref when possible 1. Default any special functions when possible to make them potentially trivial and more readable. 1. There is also one change in this PR that tries to prevent unnecessary math promotion, the rest of these changes are in another PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/91445 Approved by: https://github.com/ezyang	2022-12-29 23:43:45 +00:00
Aaron Gokaslan	553b592824	Clang-Tidy: use modern for each loops and transparent functors (#91449 ) This applies some more clang-tidy fixups. Particularly, this applies the modernize loops and modernize-use-transparent-functors checks. Transparent functors are less error prone since you don't have to worry about accidentally specifying the wrong type and are newly available as of C++17. Modern foreach loops tend be more readable and can be more efficient to iterate over since the loop condition is removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91449 Approved by: https://github.com/ezyang	2022-12-29 23:37:51 +00:00
Han Qi	b8ba4802fe	Add an option to skip loading of debug traces (#91430 ) Summary: Debug traces consumes lots of memory especially for small models. Test Plan: Unit test Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91430 Approved by: https://github.com/davidberard98	2022-12-29 22:53:17 +00:00
Facebook Community Bot	6ec3d65b0c	Automated submodule update: FBGEMM (#90489 ) This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM). New submodule commit: `81ba6c51ec` Test Plan: Ensure that CI jobs succeed on GitHub before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90489 Approved by: https://github.com/malfet	2022-12-29 21:25:25 +00:00
Theodor Arsenij Larionov	3ac6106523	Add out of bounds checks inside irparser.cpp and unpickler.cpp (#91401 ) Hi! I've been fuzzing different pytorch modules, and found a few crashes. Inside unpickler.cpp/irparser.cpp there are a few places, where `.at()` and `.pop_back()` are called before checking target container size. Lack of these checks results in an attempt to access elements oob (in case of `.at()`), and an actual out-of-bounds access while calling `.pop_back()`/`.pop()` on a `stack_` variable. Crash-files: 1. Crash location: `unpickler.cpp:439` (Call to `.at(idx)` with idx that exceeds `memo_table_` size). - Reproduce the crash: `/message_deserialize_fuzz /homedir/crash-5695ad5b2921127775d4137ee02e23834a0bedc4` - Crash file: [crash-5695ad5b2921127775d4137ee02e23834a0bedc4.zip](https://github.com/pytorch/pytorch/files/10308463/crash-5695ad5b2921127775d4137ee02e23834a0bedc4.zip) - ASAN report: [asan-report-crash-5695ad5b2921127775d4137ee02e23834a0bedc4.log](https://github.com/pytorch/pytorch/files/10308612/asan-report-crash-5695ad5b2921127775d4137ee02e23834a0bedc4.log) 2. Crash location: `irparser.cpp:504` (Call to `.at(idx)` with idx that exceeds `schema->returns()` size). - Reproduce the crash: `/irparser_fuzz /homedir/crash-779ecab3d637c8c87de21e23dddb9def82a26792` - Crash file: [crash-779ecab3d637c8c87de21e23dddb9def82a26792.zip](https://github.com/pytorch/pytorch/files/10308475/crash-779ecab3d637c8c87de21e23dddb9def82a26792.zip) - ASAN report: [asan-report-crash-779ecab3d637c8c87de21e23dddb9def82a26792.log](https://github.com/pytorch/pytorch/files/10308611/asan-report-crash-779ecab3d637c8c87de21e23dddb9def82a26792.log) 3. Crash location: `unpickler.cpp:451` (Call to `.pop_back()` with empty `stack_`). - Reproduce the crash: `/message_deserialize_fuzz /homedir/crash-735acc19c9f39b9bbb5667878af995c9167da37f` - Crash file: [crash-735acc19c9f39b9bbb5667878af995c9167da37f.zip](https://github.com/pytorch/pytorch/files/10308565/crash-735acc19c9f39b9bbb5667878af995c9167da37f.zip) - ASAN report: [asan-report-crash-735acc19c9f39b9bbb5667878af995c9167da37f.log](https://github.com/pytorch/pytorch/files/10308558/asan-report-crash-735acc19c9f39b9bbb5667878af995c9167da37f.log) 4. Crash location: `unpickler.cpp:469` (Call to `.pop()` with empty `stack_`). - Reproduce the crash: `/message_deserialize_fuzz /homedir/crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9` - Crash file: [crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.zip](https://github.com/pytorch/pytorch/files/10308568/crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.zip) - ASAN report: [asan-report-crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.log](https://github.com/pytorch/pytorch/files/10308555/asan-report-crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.log) The provided patch adds missing size checks. ### How to reproduce 1. To reproduce the crashes, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/blob/master/projects/pytorch/Dockerfile) 6. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .` 7. Copy crash file to the current directory 8. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash`` 9. And execute fuzz-targets with the given arguments After execution completes you will see ASAN reports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91401 Approved by: https://github.com/davidberard98	2022-12-29 19:58:29 +00:00
Huy Do	0417da2288	Set a timeout value when testing multiprocess DataLoader (#91476 ) Setting a timeout value when testing multiprocess DataLoader to prevent ASAN jobs timing out after 4 hours. We are seeing multiple timeout issue running ASAN tests on HUD https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=asan for examples * Without mem leak check enabled https://github.com/pytorch/pytorch/actions/runs/3794216079/jobs/6455118197 * With mem leak check https://github.com/pytorch/pytorch/actions/runs/3792743994/jobs/6449356306 Looking a bit closer into the test, the hanging happens when multiprocess DataLoader is used in `test_utils`. Here is the snapshot of those processes when I log into the hang runner: ``` UID PID PPID C STIME TTY TIME CMD jenkins 1 0 0 Dec28 pts/0 00:00:00 bash jenkins 8 0 0 Dec28 pts/1 00:00:00 sh -c pip install dist/torch-2.0.0a0+git97db9fd-cp37-cp37m-linux_x86_64.whl[opt-einsum] && .jenkins/pytorch/test.sh jenkins 20 8 0 Dec28 pts/1 00:00:00 /bin/bash .jenkins/pytorch/test.sh jenkins 764 20 0 Dec28 pts/1 00:00:07 python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 5 5 --verbose jenkins 788 764 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -c from multiprocessing.semaphore_tracker import main;main(6) jenkins 3743 764 0 Dec28 pts/1 00:00:05 /opt/conda/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork jenkins 3766 3743 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3878 3766 0 Dec28 pts/1 00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3879 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3880 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3881 3766 0 Dec28 pts/1 00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests jenkins 3893 0 0 01:45 pts/2 00:00:00 /bin/bash jenkins 3904 3893 0 01:46 pts/2 00:00:00 ps -ef ``` The specific hanging test was `test_random_seed` which spawned 4 subprocesses to load data. After I killed one of them, the test could continue and printed the following stacktrace: ``` test_random_seed (__main__.TestDataLoaderUtils) ... [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) ERROR (9345.840s) test_random_seed (__main__.TestDataLoaderUtils) ... test_random_seed errored - num_retries_left: 3 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 921, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.7/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 3878) is killed by signal: Terminated. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "test_utils.py", line 469, in test_random_seed x2 = run() File "test_utils.py", line 464, in run return next(iter(dataloader)) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 635, in __next__ data = self._next_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data idx, data = self._get_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data success, data = self._try_get_data() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 3878) exited unexpectedly [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads) ok (0.137s) ``` This doesn't fix the issue which I'll need to follow up to see why they hang. However, this should allow the test to terminate gracefully and report errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91476 Approved by: https://github.com/kit1980	2022-12-29 17:50:37 +00:00
Howard Huang	bc764f453d	Fix sharded_tensor test_sharded_tensor_to_cpu (#91453 ) Fixes https://github.com/pytorch/pytorch/issues/91381 Assert needs to be updated in the test. Run `ciflow/periodic` to run the multigpu tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/91453 Approved by: https://github.com/clee2000	2022-12-29 13:21:30 +00:00
ecao	5030929c5d	add channels last with mixed data type support for GroupNorm backward (#89485 ) ### Motivation 1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last. 2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward. ### Testing Single socket (28cores): * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 3.20E-05 \| 3.60E-05 \| 8.31E-05 \| 8.13E-05 [10, 128, 50, 50] \| 0.000126 \| 0.000115 \| 0.000356 \| 0.000257 * Channels Last: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 4.11E-05 \| 4.12E-05 \| 9.74E-05 \| 9.66E-05 [10, 128, 50, 50] \| 0.000179 \| 0.000178 \| 0.000393 \| 0.000317 Single core: * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.47E-04 \| 2.53E-04 \| 5.92E-04 \| 4.50E-04 [10, 128, 50, 50] \| 0.001559 \| 0.001384 \| 0.004343 \| 0.002436 * Channels Last: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.27E-04 \| 3.24E-04 \| 0.0006224 \| 0.000459 [10, 128, 50, 50] \| 0.00167 \| 0.001278 \| 0.0041858 \| 0.003027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89485 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-12-29 07:19:39 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
eqy	fb4fc0dabe	[CUDA] Bump version requirement for CUDA Graphs debug dump function (#91429 ) #91417 CC @ptrblck @vors Pull Request resolved: https://github.com/pytorch/pytorch/pull/91429 Approved by: https://github.com/ngimel	2022-12-29 03:44:42 +00:00
Yanli Zhao	9b144ddbe4	Make input casting in root module only in default (#91365 ) Make input casting in root module only in default, meanwhile allowing to set different mixed precisions for different submodules Pull Request resolved: https://github.com/pytorch/pytorch/pull/91365 Approved by: https://github.com/awgu	2022-12-29 03:20:32 +00:00
Joel Schlosser	3d8834bdbf	SymIntify F.interpolate() with recompute_scale_factor=True (#91318 ) This PR makes the minor changes necessary to get `F.interpolate()` working with symbolic shapes when `recompute_scale_factor=True` + adds `OpInfo` samples to test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91318 Approved by: https://github.com/ezyang	2022-12-29 01:42:56 +00:00
Huy Do	dbd0d76515	Disable test_fs family for dynamo (#91459 ) This should help address https://github.com/pytorch/pytorch/issues/67002. At the end of these tests, any temp file `/dev/shm/torch_*` are cleaned up, but somehow it might take longer than 0.5s to finish causing the test to fail. So, the PR tries to increase this max waiting time to 5s while polling for the result every 0.5s as before ### Testing `pytest test_multiprocessing.py -k test_fs --verbose --flake-finder` to run `test_fs`, `test_fs_is_shared`, `test_fs_pool`, `test_fs_preserve_sharing`, and `test_fs_sharing` 50 times on a dynamo shard. All passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91459 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/atalman	2022-12-29 00:26:57 +00:00
Richard Zou	f012d0ea5b	[autograd.Function] enable the extended Function feature flag by default (#91441 ) The autograd.Function <> functorch interaction is in a mostly completed state now. There are some minor action items remaining (https://github.com/pytorch/pytorch/issues/90224), but I want to enable the feature by default so that PyTorch CI / other parties / etc can begin testing to see if there is any impact on the original autograd.Function API (there shouldn't be). The longer-term plan for the feature flag is: - keep it around until at least the next release (so that people can turn off the feature if it breaks something in existing code) - delete the flag then (either before or after the release, I haven't decided yet) Test Plan: - new test - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/91441 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-12-28 21:00:27 +00:00
soulitzer	ae52750d91	Reduce hook registration code duplication (#91418 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91418 Approved by: https://github.com/albanD	2022-12-28 20:52:04 +00:00
Li-Huai (Allan) Lin	8191c49f82	Update links in `writing_batching_rules.md` (#91354 ) Update links to fit the code migration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91354 Approved by: https://github.com/zou3519	2022-12-28 19:50:34 +00:00
Kurt Mohler	08a47549af	Rename `Tensor._storage` to `Tensor.untyped_storage` and update docs (#91414 ) Fixes #89224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91414 Approved by: https://github.com/ezyang	2022-12-28 19:21:34 +00:00
lezcano	5b223c43ec	Avoid calling allclose in the backward if there are tensor subclasses (#91444 ) `allclose` it's data-dependent (returns a bool) so it does not play well with functorch. We are skipping that check in the context of subclasses to avoid hard errors. Partially fixes https://github.com/pytorch/pytorch/issues/90499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91444 Approved by: https://github.com/albanD	2022-12-28 19:12:50 +00:00
lezcano	4444138fae	Add backward for complex numbers for diagonal_scatter (#91443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91443 Approved by: https://github.com/soulitzer	2022-12-28 19:12:50 +00:00
Khushi Agrawal	f969834f68	[functorch] vmap: nansum & nanmean (#91372 ) Fixes #91174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91372 Approved by: https://github.com/zou3519	2022-12-28 18:49:49 +00:00
Catherine Lee	d7674e70f4	Fix for tryrebase after PR was merged (#91337 ) rebasing certain merged prs results in the rebased branch pointing at the target branch b/c git believes the pr has already been included in the branch. Git does not replay the changes onto the target branch because the change is already in the target branch This usually affects PRs with only 1 commit (more commits -> trymerge squashes them when merged -> git believes that the change is not in the target branch b/c the squashed commit is different from the individual changes). It might also affect ghstack changes b/c behind the scenes the ghstack PRs are all contained within one commit on the orig branch, but I'm not sure about this. helps w/ https://github.com/pytorch/test-infra/issues/836 looks like https://github.com/clee2000/random-testing/pull/44#issuecomment-1363439534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91337 Approved by: https://github.com/ZainRizvi	2022-12-28 18:44:08 +00:00
Nikita Karetnikov	cc11edb084	[aot_autograd] symintify `logsumexp` (#91442 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91442 Approved by: https://github.com/albanD	2022-12-28 18:06:26 +00:00
Adrian Wälchli	f5e20d6060	Make the state dict of CyclicLR scheduler pickleable (#91400 ) Fixes #90414 This PR drops the unpicklable `weakref.WeakMethod` object from CyclicLR scheduler from the state dict, and re-inits the object again once the state dict gets loaded. This makes the state picklable so you can include it in your checkpoint. Also fixes https://github.com/Lightning-AI/lightning/issues/15901 A simple test was added that `pickle.dumps(state)` the state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91400 Approved by: https://github.com/albanD	2022-12-28 18:05:24 +00:00
Howard Huang	896aa72359	check_forward_backward_compatibility C10D APIs (#91409 ) Remove APIs from check since they aren't being updated anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/91409 Approved by: https://github.com/awgu	2022-12-28 17:37:12 +00:00
Joel Schlosser	8b55b86dbd	Move sym_int and sym_float alongside SymInt / SymFloat in base torch package (#91317 ) This PR moves the definitions for: * `sym_int` * `sym_ceil` (used only for `sym_int`) * `sym_floor` (used only for `sym_int`) * `sym_float` from `torch/fx/experimental/symbolic_shapes.py` to `torch/__init__.py`, where `SymInt` and `SymFloat` are already defined. This removes the need for several in-line imports, and enables proper JIT script gating for #91318. I'm very open to doing this in a better way! Pull Request resolved: https://github.com/pytorch/pytorch/pull/91317 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2022-12-28 16:08:16 +00:00
Joel Schlosser	1c40ec46ff	Decomps and meta registrations for upsample_nearest 1D / 2D / 3D (#91260 ) Adds decompositions and meta registrations for the 1D, 2D, and 3D implementations of `upsample_nearest`. All related OpInfo-based tests for AOTAutograd now pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91260 Approved by: https://github.com/ezyang	2022-12-28 16:03:25 +00:00
Salahuddin	f1d8fef4d4	Softmax added to tensor, torch and docs (#91292 ) Fixes #91107 Added `softmax` docs in - `pytorch/torch/_tensor_docs.py` - `pytorch/torch/_torch_docs.py ` - `pytorch/docs/XXX.rst` files. Here XXX represents all those files where I made the change Although I have added `softmax` in `docs` directory, I was not sure which files/folders required the edits so there could be issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/91292 Approved by: https://github.com/lezcano	2022-12-28 15:06:24 +00:00
PyTorch MergeBot	af7132302a	Revert "Softmax added to tensor, torch and docs (#91292 )" This reverts commit f8b28799f8432ab8de6c960eef4d530f45af1a5b. Reverted https://github.com/pytorch/pytorch/pull/91292 on behalf of https://github.com/weiwangmeta due to breaking internal distributed testing builds	2022-12-28 14:30:46 +00:00
Nikita Karetnikov	3066edbc60	[Inductor] fix undefined `MockHandler` use (#91434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91434 Approved by: https://github.com/soumith	2022-12-28 14:10:23 +00:00
shademe	9f91e94080	Workaround for NumPy builds that ship with a broken Dlpack deleter (#89759 ) NumPy versions 1.22 and 1.23 (and their respective bugfix releases included) have a buggy implementation of the Dlpack deleter that doesn't account for no-GIL contexts. Since we now release the GIL when deallocating tensors in `THPVariable_clear`, this leads to a failure of internal consistency checks when freeing a Dlpack-backed tensor from NumPy. This PR adds a check for the buggy NumPy versions and overrides the `DlManagedTensor` deleter to reacquire the GIL before deallocation. ### Rationale for this implementation The version check was added to `tensor_numpy.h/cpp` as it seemed like a more logical location for it than creating a new translation unit. The overriding of the deleter was originally attempted by directly modifying `at::fromDlpack`, but the lack of a build dependency on the Python C API in A10 prevented that. So, I extended the A10 Dlpack API instead to additionally accept a custom deleter functor. Fixes #88082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89759 Approved by: https://github.com/albanD	2022-12-28 13:23:29 +00:00
lezcano	41a0318f2d	Remove overload at::frobenius_norm(const Tensor&) (#81762 ) This function is an auxiliary function for `torch.norm`. This particular overload was not even used or tested. I hope it's not used internally either. If it is, we can simply drop this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/81762 Approved by: https://github.com/ngimel	2022-12-28 13:12:01 +00:00
ecao	274d3b24c3	use scatter_add for index_add when dim is the most inner dim (#88729 ) ### Motivation When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations. `scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel. ### Testing - Single core: Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.82E-03 \| 2.11E-03 [10, 128, 50, 50] \| 0.023604 \| 0.023794 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 9.30E-04 \| 1.66E-03 [10, 128, 50, 50] \| 0.005995 \| 0.010003 - Single socket (28 cores): Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.96E-03 \| 2.52E-03 [10, 128, 50, 50] \| 0.012208 \| 0.012568 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 7.44E-05 \| 1.33E-04 [10, 128, 50, 50] \| 0.000333 \| 0.000469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-12-28 12:04:17 +00:00
Aaron Gokaslan	700941f683	Fixup c10 headers with clang-tidy (#91407 ) Clang-tidy was not applied properly to headers in c10 as documented #91406. These are the easy automated fixes that came out of applying clang-tidy to the c10 part of the code base. cc @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/91407 Approved by: https://github.com/ezyang	2022-12-28 11:12:22 +00:00
Aaron Gokaslan	c470ad4f4a	Add missing overload for ivalue toSym(Int\|Float) (#91405 ) Noticed the toSymFloat / toSymInt overloads always copied the internal pointer of an ivalue even if it was an rvalue unlike other overloads (like toTensor). This fixes that issue by adding the appropriate methods needed to facilitate that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91405 Approved by: https://github.com/ezyang	2022-12-28 11:07:37 +00:00
Jiawen Liu	b416d50502	[inductor] Fix "RuntimeError: Tried to erase Node permute but it still had 3 users in the graph" (#91327 ) Summary: Fix "RuntimeError: Tried to erase Node permute but it still had 3 users in the graph" to unblock internal models Differential Revision: D42213859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91327 Approved by: https://github.com/ngimel, https://github.com/anijain2305, https://github.com/jianyuh	2022-12-28 10:21:49 +00:00
Jiewen Tan	22a718b40b	[LTC] Restore LazyTensor() = delete (#91426 ) Summary: XLA's LTC migration is completed. Let's restore some hacks. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91426 Approved by: https://github.com/JackCaoG	2022-12-28 09:21:55 +00:00
Kshiteej K	3fdbf824ae	[functorch] jacrev: chunk_size=1 without vmap (#91326 ) As discussed at https://github.com/pytorch/pytorch/pull/91157#discussion_r1053679272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91326 Approved by: https://github.com/zou3519	2022-12-28 04:56:25 +00:00
Lukas N Wirz	878719a2db	initialise the members boolean_ and integer_ of at::indexing::TensorIndex (#91399 ) initialise the members boolean_ and integer_ of at::indexing::TensorIndex to false and 0 respectively, because the compiler generated copy-ctor accesses them which is UB. This resolves a compile time warning, a runtime error from UBSan + gcc, and a runtime error from MSVC when compiling debug. Fixes #90951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91399 Approved by: https://github.com/bdhirsh	2022-12-28 04:23:32 +00:00
soulitzer	1b2ee4d0e1	Update functorch supported autograd.Function to allow mark_dirty (#91222 ) Fixes https://github.com/pytorch/pytorch/issues/90225 Uses what was originally in `32a57bcdb6` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91222 Approved by: https://github.com/zou3519	2022-12-28 03:53:47 +00:00
Edward Z. Yang	ca39c5b04e	Fix conda install on distributions with strict POSIX sh (#91371 ) See also https://github.com/conda/conda/issues/10431 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91371 Approved by: https://github.com/albanD	2022-12-28 00:25:03 +00:00
Edward Z. Yang	2e79d46708	Revise error reporting when TorchInductor cannot access /tmp folder (#91385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91385 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/anijain2305	2022-12-28 00:23:44 +00:00
Andrew Gu	0b709b4816	[FSDP][Easy] Fix context manager syntax (#91410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91410 Approved by: https://github.com/kit1980	2022-12-28 00:17:55 +00:00
Richard Zou	e8393131ee	[generate_vmap_rule] support for jvp (#91211 ) Support for jvp is very similar to support for backward(): - We need to vmap over a version of the original autograd.Function's jvp method that does not take ctx as input. - On the output, we need to reductify to ensure the output tangent has the same shape as the output. This reductify does not have the extra reduction semantics, because PyTorch forward-mode AD requires the output tangent to have the same exact shape as the output. - setup_context needs to tell us the bdims of the saved_tensors (necessary for vmap over jvp_no_context), as well as the output shapes (necessary for reductify). Test Plan: - Added jvp support to the *GenVmapAutogradFunction Pull Request resolved: https://github.com/pytorch/pytorch/pull/91211 Approved by: https://github.com/soulitzer	2022-12-27 23:25:59 +00:00
Richard Zou	48e63bf69f	[functorch] composition of three transform tests with jvp (#91206 ) This PR adds the following tests. They will be useful as test cases for generate_vmap_rule=True and jvp (to come soon) - test_jvpvmap - test_jvpvmapvmap - test_vmapjvpvmap - test_jvpjvpvmap - test_jvpvjpvmap Pull Request resolved: https://github.com/pytorch/pytorch/pull/91206 Approved by: https://github.com/soulitzer	2022-12-27 23:25:59 +00:00
Nikita Vedeneev	1768a28a20	`COO @ COO`: fix to always produce coalesced outputs. (#91094 ) Fixes [#90516](https://github.com/pytorch/pytorch/issues/90516) Fixes [#90538](https://github.com/pytorch/pytorch/issues/90538) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91094 Approved by: https://github.com/pearu	2022-12-27 21:32:14 +00:00
PyTorch MergeBot	67c53d50e5	Revert "Fix conda install on distributions with strict POSIX sh (#91371 )" This reverts commit 57dcd93c4103c6db043f341a0242596a42188081. Reverted https://github.com/pytorch/pytorch/pull/91371 on behalf of https://github.com/kit1980 due to trunk / cuda11.6-py3.10-gcc7-sm86 / test (slow, 1, 2, linux.g5.4xlarge.nvidia.gpu) started to fail after this PR with mypy error	2022-12-27 19:51:59 +00:00
Kurt Mohler	81b3df4fb0	Fix dtype mismatch for unallocated storage deserialization (#91285 ) Fixes #90497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91285 Approved by: https://github.com/ezyang	2022-12-27 19:31:09 +00:00
Kurt Mohler	93a810b045	Add dim checks for internal `embedding_bag` functions (#85433 ) Fixes #85213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85433 Approved by: https://github.com/malfet	2022-12-27 19:27:33 +00:00
Atul Jangra	467d269ad1	Minor fix in package exporter (#90306 ) Summary: As title. Saw this while working on another diff. `storage` won't be defined in the `else` case. But this causes pyre to freak out. Test Plan: Unit tests. Differential Revision: D41751229 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90306 Approved by: https://github.com/PaliC	2022-12-27 18:01:59 +00:00
Richard Zou	06bdd491fb	[vmap] fix reduction boxed batching rules (#91109 ) Fixes https://github.com/pytorch/pytorch/issues/91041 There's a bug in our boxed reduction batching rules for a very specific case: vmap over a Tensor of shape [1] for an operation where the output rank is supposed to be less than the input rank, e.g. ``` x = torch.tensor([10.], device=device) y = vmap(lambda x: x.sum(0))(x) ``` The boxed reduction batching rule handles three types of "reduction" operations: - reduction operations with an optional keepdim argument, which specifies if the output should have the same or smaller rank than the input - reduction operations without a keepdim arg that morally have keepdim=True (like cumsum -- which never actually modifies the rank of the tensor but is still a "reduction" since it sums a bunch of things together) - reduction operations without a keepdim arg that morally have keepdim=False. (just torch.count_nonzero). Furthermore, PyTorch has special handling for scalar tensors (e.g. tensors of shape []). It is valid to do `torch.sum(torch.tensor(10.), dim=0)`. This PR updates the `boxed_reduction_batch_rule` to handle the interaction between the three kinds of reduction and the scalar tensor cases correctly. Concretely, it: - introduces additional templates to `boxed_reduction_batch_rule` for what type of "keepdim" reduction this is. - splits the old REDUCTION_BOXED macro (which was a good default) into REDUCTION_NO_KEEPDIM_ARG and REDUCTION_WITH_KEEPDIM_ARG (which are also opionated defaults) and uses them. Test Plan: - Given an input of shape [], our vmap OpInfo test suite only produces a Tensor of shape [B] with B = 2. At first glance this doesn't look sufficient to test this case (vmap over Tensor[1]), but the claim is that it is because the boxed_reduction_batch_rule is agnostic to the shape of the dimension being vmapped over. Previously it was not due to the semantics of `squeeze`; this PR adds internal asserts to make it agnostic. - there is a light test for vmap over the Tensor of shape [1] for torch.sum as a sanity check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91109 Approved by: https://github.com/samdow	2022-12-27 14:40:15 +00:00
lezcano	255d14947d	Fix resource consumption in reductions (#89144 ) Reductions along a (large enough) contiguous dimension vectorise the loading of the inputs. This vectorisation was not taken into account when computing the necessary resources for the kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89144 Approved by: https://github.com/zasdfgbnm, https://github.com/ngimel	2022-12-27 12:02:14 +00:00
Jasha	1c681f4bd8	Fix distutils.LooseVersion DeprecationWarning (#88524 ) Fixes #84712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88524 Approved by: https://github.com/MaKaNu, https://github.com/milutter, https://github.com/soumith	2022-12-27 11:46:00 +00:00
Aaron Gokaslan	97db9fde69	Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178 ) …c10d Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178 Approved by: https://github.com/ezyang	2022-12-27 07:34:12 +00:00
Peter Bell	bb24185ff4	Fix _check_no_differentiable_outputs for forward ad (#91391 ) This `is_forward_ad` isn't propagated, which leads to this line creating a slow-gradcheck failure on master: ``` if not is_forward_ad and any(o.is_complex() for o in outputs): raise ValueError("Expected output to be non-complex. get_numerical_jacobian no " "longer supports functions that return complex outputs.") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91391 Approved by: https://github.com/albanD	2022-12-27 03:52:05 +00:00
Jane Xu	a061f139dc	[optim] Adam defaults to fused when CUDA + differentiable=False (#90865 ) Step 1 in faster default optimizers. Preliminary benchmarks show gaps in improvement on CUDA for BERT_pytorch and resnet18: ![image](https://user-images.githubusercontent.com/31798555/207707118-14221802-77ce-4ee0-96e3-04638c07924c.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90865 Approved by: https://github.com/albanD	2022-12-27 01:28:47 +00:00
Vlad Lialin	0b255b3f80	Better __repr__ for ModuleList (#90452 ) ## Problem When models have a lot of complex repeated layers, `print(module)` output becomes unfeasible to work with. For example, current output of `__repr__` for `t5-small` is `715 ` lines long. ## Solution Using better `__repr__` it becomes `135`. For `t5-large`, current `__repr__` prints `1411` lines. Better `__repr__` — `135`. Same numer as for t5-small, because most of the layers are just repeated. For `EleutherAI/gpt-j-6B` number of lines reduces form `483` to just `24`. Here's how it works: when ModuleList items have exactly the same `__repr__` instead of printing both of them, it prints f`N x {repr(item)}`. Current code supports cases when the same ModuleList has multiple repeating items, which is especially useful when first/last layer of a block is different from the reset of them. Better `__repr__` should make model prints smaller, more beautiful and significantly more useful by highlighting the difference between repeated blocks instead of losing it in a wall of text. ## Motivating real-life example. You can try it out in this [colab notebook](https://colab.research.google.com/drive/1PscpX_K1UemIDotl2raC4QMy_pTqDq7p?usp=sharing). Current `__repr__` of gpt-j-6b output it too big to add it to this PR description: ``` GPTJModel( (wte): Embedding(50400, 4096) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0): GPTJBlock( (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) (attn): GPTJAttention( (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) (k_proj): Linear(in_features=4096, out_features=4096, bias=False) (v_proj): Linear(in_features=4096, out_features=4096, bias=False) (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (out_proj): Linear(in_features=4096, out_features=4096, bias=False) ) (mlp): GPTJMLP( (fc_in): Linear(in_features=4096, out_features=16384, bias=True) (fc_out): Linear(in_features=16384, out_features=4096, bias=True) (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) (1): GPTJBlock( (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) (attn): GPTJAttention( (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) (k_proj): Linear(in_features=4096, out_features=4096, bias=False) (v_proj): Linear(in_features=4096, out_features=4096, bias=False) (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (out_proj): Linear(in_features=4096, out_features=4096, bias=False) ) (mlp): GPTJMLP( (fc_in): Linear(in_features=4096, out_features=16384, bias=True) (fc_out): Linear(in_features=16384, out_features=4096, bias=True) (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) (2): GPTJBlock( ... ``` Better `__repr__` output looks like this: ``` GPTJModel( (wte): Embedding(50400, 4096) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-27): 28 x GPTJBlock( (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) (attn): GPTJAttention( (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) (k_proj): Linear(in_features=4096, out_features=4096, bias=False) (v_proj): Linear(in_features=4096, out_features=4096, bias=False) (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (out_proj): Linear(in_features=4096, out_features=4096, bias=False) ) (mlp): GPTJMLP( (fc_in): Linear(in_features=4096, out_features=16384, bias=True) (fc_out): Linear(in_features=16384, out_features=4096, bias=True) (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90452 Approved by: https://github.com/albanD	2022-12-26 17:05:14 +00:00
Edward Z. Yang	57dcd93c41	Fix conda install on distributions with strict POSIX sh (#91371 ) See also https://github.com/conda/conda/issues/10431 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91371 Approved by: https://github.com/albanD	2022-12-26 02:39:08 +00:00
lezcano	3f4e87beaf	Populate the eviction_policy field for load/store properly (#91316 ) This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91316 Approved by: https://github.com/ngimel	2022-12-26 00:50:05 +00:00
lezcano	772684c9ce	Do not generate default value when it's zero (#91315 ) This is more of a cosmetic change than anything really. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91315 Approved by: https://github.com/ngimel	2022-12-26 00:50:05 +00:00
Salahuddin	f8b28799f8	Softmax added to tensor, torch and docs (#91292 ) Fixes #91107 Added `softmax` docs in - `pytorch/torch/_tensor_docs.py` - `pytorch/torch/_torch_docs.py ` - `pytorch/docs/XXX.rst` files. Here XXX represents all those files where I made the change Although I have added `softmax` in `docs` directory, I was not sure which files/folders required the edits so there could be issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/91292 Approved by: https://github.com/lezcano	2022-12-25 12:59:45 +00:00
Yanbo Liang	789b1437e9	Fix meta registration for aten._cudnn_rnn (#91333 ) Found this issue from [weekly running 7k github models](https://github.com/pytorch/torchdynamo/issues/1884). This caused regression on pass rate, there are 25 models failed due to this issue. The reason is argument ```cx``` of ```aten._cudnn_rnn``` can be ```None```, but it doesn't handle well in meta registration, so throws the following error: ``` Traceback (most recent call last): File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1059, in run_node return nnmodule(args, kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/nn/modules/module.py", line 1482, in _call_impl return forward_call(args, *kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/nn/modules/rnn.py", line 477, in forward result = _VF.rnn_tanh(input, hx, self._flat_weights, self.bias, self.num_layers, File "/scratch/ybliang/work/repos/pytorch/torch/_subclasses/fake_tensor.py", line 916, in __torch_dispatch__ r = func(args, *kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_ops.py", line 284, in __call__ return self._op(args, **kwargs or {}) File "/scratch/ybliang/work/repos/pytorch/torch/_meta_registrations.py", line 2108, in _cudnn_rnn cy = cx.new_empty(0 if cx is None else cell_shape) AttributeError: 'NoneType' object has no attribute 'new_empty' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91333 Approved by: https://github.com/ezyang	2022-12-23 22:59:31 +00:00
Huy Do	df46ba4026	Use python 3.9 for iOS build and test (#91366 ) Since yesterday, Miniconda3-latest-MacOSX-x86_64.sh has changed to python 3.10 as the default, and it breaks iOS workflow: * Breaking with python 3.10 https://github.com/pytorch/pytorch/actions/runs/3763269382/jobs/6396697341 * Working with python 3.9 https://github.com/pytorch/pytorch/actions/runs/3761903011/jobs/6394085845 Fun fact, both examples above come from the same commit `f471770fd4` (one was in periodic, the other was in trunk) Miniconda3-py39_4.12.0-MacOSX-x86_64.sh is the same miniconda installation that we use in https://github.com/pytorch/test-infra/tree/main/.github/actions/setup-miniconda Note: @remidomingues is trying to add cache support for iOS in on https://github.com/pytorch/pytorch/pull/91359. The PR is still under review. But once that is merged, this issue won't happen again. So this is a temporary fix to keep trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91366 Approved by: https://github.com/atalman	2022-12-23 22:08:25 +00:00
Ikko Ashimine	a188e6ddc0	Fix typo in troubleshooting.rst (#91301 ) enviornment -> environment Pull Request resolved: https://github.com/pytorch/pytorch/pull/91301 Approved by: https://github.com/msaroufim	2022-12-23 21:39:38 +00:00
Radek Bartoň	5725a44080	Remove Windows compilation dependencies installation from CI/CD scripts (#89909 ) They should be already installed in the runner VM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89909 Approved by: https://github.com/huydhn	2022-12-23 17:40:19 +00:00
Denis Vieriu	bdbf188c80	[MPS] Exclude int64 dtype from reduction ops (#91272 ) Reduction ops don't support int64 data type. This PR takes care to assert when int64 is used for min / max reductions ops. All other integer dtypes are casted to int32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91272 Approved by: https://github.com/razarmehr, https://github.com/malfet	2022-12-23 17:30:42 +00:00
Huy Do	0745242ca5	Fix wrong committer when rebase and merge (#91330 ) When using in the context of the merge workflow, the committer's name and email have already been set as part of the workflow, i.e. https://github.com/pytorch/pytorch/actions/runs/3754075933/jobs/6377965897: ``` git config --global user.email "pytorchmergebot@users.noreply.github.com" git config --global user.name "PyTorch MergeBot" ``` Trying to overwrite this in tryrebase's ghstack logic would lead to the wrong committer showing up. The fix check if the email and name have already been set so that the code doesn't overwrite them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91330 Approved by: https://github.com/kit1980, https://github.com/clee2000, https://github.com/malfet	2022-12-23 17:22:49 +00:00
Yeounoh Chung	69cca4f3ae	Update xla base tag v06 (#90939 ) We have installed the new `sympy` package requirement in the XLA CI base image. Bumping up the version tag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90939 Approved by: https://github.com/seemethere, https://github.com/malfet	2022-12-23 17:17:43 +00:00
Ramin Azarmehr	6485d2609a	[MPS] Fix data type issues in Binary Ops (#91151 ) - Cast to unsigned type when comparing signed vs. unsigned integers - Refactor and cleanup logaddexp() ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/91151 Approved by: https://github.com/malfet	2022-12-23 17:11:55 +00:00
Chien-Chin Huang	d08e3d2304	[Composable API] Apply ufmt to _composable and the corresponding test folders (#91255 ) This PR apply ufmt to format `_composable` related code. This is a request from https://github.com/pytorch/pytorch/pull/91234 to separate formatting changes as a new PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91255 Approved by: https://github.com/awgu	2022-12-23 16:08:27 +00:00
Howard Huang	99aec69f58	[BE] remove Backend.TCP (#91314 ) Remove Backend.TCP which is unused. Fixes a task in https://github.com/pytorch/pytorch/issues/90544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91314 Approved by: https://github.com/awgu	2022-12-23 15:48:29 +00:00
Jeff Daily	f62a3cabfc	[ROCm] enable CI after host upgrades to ROCm 5.3 and ubuntu 22.04 (#91339 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91339 Approved by: https://github.com/kit1980	2022-12-23 05:41:43 +00:00
Eli Uriegas	f471770fd4	Add bad status management for get_workflow_job_id (#91145 ) To help resolve issues like: ``` ++ python3 .github/scripts/get_workflow_job_id.py 3736406815 i-08b8fd3e605729ed9 + GHA_WORKFLOW_JOB_ID= Warning: Attempt 2 failed. Reason: Child_process exited with error code 1 ``` This should only happen when github actions is experiencing degraded service Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91145 Approved by: https://github.com/malfet	2022-12-22 23:33:43 +00:00
daniellepintz	94a6d72032	Update doc of clip grad (#91312 ) Replaces #85772 that has a broken internal state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91312 Approved by: https://github.com/soulitzer	2022-12-22 22:34:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	76a3869fc6	Support functionalization on torch.cond (#89966 ) This PR adds functionalization path for torch.cond. As it is the first pass, we only functionalize for very restrictive use cases. We explicitly restrict following: - Output of each branch aliasing input - In-place mutation on inputs given to each branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/89966 Approved by: https://github.com/zou3519	2022-12-22 22:01:47 +00:00
Mengchi Zhang	d1123c94a7	[pytorch] Update troubleshooting_url (#91298 ) Summary: Update new troubleshooting_url. Old one does not exist. Test Plan: None Differential Revision: D42205626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91298 Approved by: https://github.com/jianyuh	2022-12-22 21:29:54 +00:00
Denis Vieriu	4477a5b691	[MPS] Register unfold key for MPS (#91266 ) Register unfold key for MPS (uses generic implementation that's already existent). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91266 Approved by: https://github.com/razarmehr	2022-12-22 21:21:04 +00:00
Iris	e8e3980e65	[Checkpoint] Update DCP init to include DefaultSavePlanner/DefaultLoadPlanner (#91269 ) Adding the two APIs to dcp package `__init__.py`, as users are recommended to extend DefaultSavePlanner/DefaultLoadPlanner instead of the planner interface directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91269 Approved by: https://github.com/fduwjj	2022-12-22 21:05:11 +00:00
Iris	0149467677	[Checkpoint] Update docstring for DCP ``save_state_dict` `and` `load_state_dict`` (#91209 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91209 Approved by: https://github.com/fduwjj	2022-12-22 20:49:42 +00:00
Elias Ellison	8b3d31cfc5	Add A ValueRange Analysis Pass to convert int64 indexing to int32 (#91028 ) Builds up sympy expressions computing the lower and upper bound of ranges, and then finds `op.to_dtype(x, torch.int64)` nodes whose dominated values can all be computed in a lower precision. I haven't gotten all the way to work with dynamic shapes but it should be a fairly small change. There's still additional work to get torchinductor to work with large tensors (see https://github.com/pytorch/torchdynamo/issues/1819) because we would need to add explicit dtype annotations to int64 which we're not doing right now. Fix for https://github.com/pytorch/torchdynamo/issues/1293. Performance Before OpBench aten.upsample_bilinear2d.vec float32: (25th %, 50th %, 75th %) Before [0.7521964636710751, 0.8645357996607477, 2.8746003906598494] After: [0.9511363478204263, 1.0295566597806718, 3.2662165264101755] Pull Request resolved: https://github.com/pytorch/pytorch/pull/91028 Approved by: https://github.com/jansel	2022-12-22 20:04:26 +00:00
Aidyn-A	b95e1d76a8	[CUDA12] Conditionally set device in autograd engine (#91191 ) CUDA 12 introduces behavioral changes in `cudaSetDevice`. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every time `cudaSetDevice` is called for the first time it creates a CUDA context. See issue #91122. The autograd engine iterates over all devices and sets them: `f8b348c1fc/torch/csrc/autograd/engine.cpp (L1399-L1402)` `f8b348c1fc/torch/csrc/autograd/engine.cpp (L349)` Which causes pollution of CUDA contexts on sibling devices. This PR introduces a workaround this issue by conditionally setting the device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91191 Approved by: https://github.com/ngimel	2022-12-22 19:54:45 +00:00
kshitij12345	4437d0d161	[functorch] vmap: chunk_size support (#91157 ) Ref: https://github.com/pytorch/functorch/issues/680 We introduce a kwarg `chunk_size` in vmap. Also, we leverage most of the code from `chunk_vmap` (except for chunking the input based on `chunk_size`) Benchmarks from https://github.com/pytorch/functorch/pull/774 apply. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91157 Approved by: https://github.com/zou3519	2022-12-22 19:45:45 +00:00
Brian Hirsh	c47bdd7522	_scatter ops should preserve input stride/storage_offset (#91029 ) It turns out that we do* need to update *_scatter ops to return the exact same strides as their inputs. I added a test to `test/test_functionalization.py`, which now trips thanks to Ed's functionalization stride debugging check. It only actually ends up tripping silent correctness if you try to .backward() on that function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91029 Approved by: https://github.com/ezyang	2022-12-22 19:41:53 +00:00
Animesh Jain	a32916190d	buck-related minifier work (#91215 ) Summary: Extending the minifier to generate buck target Test Plan: N/A Differential Revision: D42173893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91215 Approved by: https://github.com/bertmaher, https://github.com/ngimel	2022-12-22 19:33:50 +00:00
Nikita Shulga	d397f414bd	[BE] Reformat ReduceOps (#91221 ) Use curly braces even after single line `if` Use whitespace between `if` and condition Use `c10::irange` Also, use `c10::multiply_integers` instead of explicit for loop of elements of `IntArrayRef` Do not pass `num_input_dims` to `set_apparent_shapes` as it is always equal to the length of `input_shape` array Pull Request resolved: https://github.com/pytorch/pytorch/pull/91221 Approved by: https://github.com/kit1980, https://github.com/huydhn	2022-12-22 19:29:05 +00:00
albanD	c7302075f3	Fix passing frame to callback (#91170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91170 Approved by: https://github.com/ezyang	2022-12-22 19:05:18 +00:00
PyTorch MergeBot	eadd557266	Revert "use scatter_add for index_add when dim is the most inner dim (#88729 )" This reverts commit 68e9da68cbeb1288b904022d237c32e88e0372fd. Reverted https://github.com/pytorch/pytorch/pull/88729 on behalf of https://github.com/atalman due to Break internal build	2022-12-22 18:06:45 +00:00
Xiaodong Wang	bacd2ced4f	[CUDA12] Clean up deprecated APIs (#91050 ) See #91122 Summary: Some APIs are deprecated in newer version of CUDA. * cudaGraphInstantiate: From: ``` cudaGraphInstantiate ( cudaGraphExec_t* pGraphExec, cudaGraph_t graph, cudaGraphNode_t* pErrorNode, char* pLogBuffer, size_t bufferSize ) ``` To ``` __host__cudaError_t cudaGraphInstantiate ( cudaGraphExec_t* pGraphExec, cudaGraph_t graph, unsigned long long flags = 0 ) ``` * cudaProfilerInitialize: deprecated in cuda 11 and removed in cuda 12 Test Plan: GH CI Differential Revision: D41469051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91050 Approved by: https://github.com/jianyuh	2022-12-22 17:51:01 +00:00
Huy Do	e40e4d36c9	Fix test_profiler_seq_nr flakiness (on macos) (#91019 ) Fixes https://github.com/pytorch/pytorch/issues/66893 On MacOS, two `aten::sum` calls are reported sometimes where there should be only one. This can be easily reproduced by running `pytest test_autograd.py -k test_profiler_seq_nr --verbose --flake-finder` to see the flakiness. The profile result when the test fails is as follows (sorted by CPU): ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::randn 16.67% 3.000us 27.78% 5.000us 2.500us 2 aten::sum 16.67% 3.000us 27.78% 5.000us 2.500us 2 aten::normal_ 11.11% 2.000us 11.11% 2.000us 1.000us 2 aten::add 11.11% 2.000us 11.11% 2.000us 2.000us 1 autograd::engine::evaluate_function: torch::autograd... 11.11% 2.000us 27.78% 5.000us 2.500us 2 torch::autograd::AccumulateGrad 11.11% 2.000us 16.67% 3.000us 1.500us 2 aten::ones_like 5.56% 1.000us 5.56% 1.000us 1.000us 1 autograd::engine::evaluate_function: SumBackward0 5.56% 1.000us 11.11% 2.000us 2.000us 1 aten::expand 5.56% 1.000us 5.56% 1.000us 1.000us 1 aten::copy_ 5.56% 1.000us 5.56% 1.000us 0.500us 2 aten::empty 0.00% 0.000us 0.00% 0.000us 0.000us 2 aten::as_strided 0.00% 0.000us 0.00% 0.000us 0.000us 2 aten::fill_ 0.00% 0.000us 0.00% 0.000us 0.000us 2 aten::empty_like 0.00% 0.000us 0.00% 0.000us 0.000us 1 aten::empty_strided 0.00% 0.000us 0.00% 0.000us 0.000us 3 SumBackward0 0.00% 0.000us 5.56% 1.000us 1.000us 1 autograd::engine::evaluate_function: AddBackward0 0.00% 0.000us 0.00% 0.000us 0.000us 1 AddBackward0 0.00% 0.000us 0.00% 0.000us 0.000us 1 aten::new_empty_strided 0.00% 0.000us 0.00% 0.000us 0.000us 2 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 18.000us ``` When it happens, the two `aten::sum` calls have different inputs: ``` aten::sum 4.35% 1.000us 13.04% 3.000us 3.000us 1 [[10, 10], []] aten::sum 8.70% 2.000us 8.70% 2.000us 2.000us 1 [[10, 10], [], [], []] ``` I'm not sure what is the internal difference between `z.sum()` and `z.sum(dim=None)` here on MacOS, I thought they are the same. ### Testing `pytest test_autograd.py -k test_profiler_seq_nr --verbose --flake-finder` to run the test 50 times, all pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91019 Approved by: https://github.com/malfet	2022-12-22 17:37:45 +00:00
Bin Bao	07c61685c8	[inductor] CI improvments (#91283 ) Summary: 1) Setting torch.backends.cudnn.deterministic to True helps to eliminate the eager_variance failures seen on CI 2) Skip Triton failure instead of retry 3) Some minor script cleanup is also included in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91283 Approved by: https://github.com/anijain2305	2022-12-22 15:37:43 +00:00
Takeshi Watanabe	55749b9c41	[dynamo] Write full code of how to enable `output_code` (#91230 ) Ref https://github.com/pytorch/pytorch/pull/91223 Since it was trickier than I've expected Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91230 Approved by: https://github.com/soumith	2022-12-22 14:09:06 +00:00
Nikita Vedeneev	4c5928e387	Fix for `mul(compressed, wrapped scalar)` (#91239 ) Fixes https://github.com/pytorch/pytorch/issues/90819. The path with `Scalar` should have been picked up by the dispatcher, but still the path with a 0-dim wrapped scalar was broken. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91239 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2022-12-22 13:11:13 +00:00
Fabio Rocha	bc843682dd	[inductor] New approach for computing triton load/store masks (#91241 ) This PR is a new version of #89566, fixing a test failure. Couldn't get ghstack to colaborate on updating that PR after re-opening, so started a new one. This changes the way masks for loads/stores are computed in triton backend of inductor. New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and `r3` it adds `xmask`, `ymask` and `rmask` respectively. For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created. I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms. Relative to #89566, the only change is to not include the mask variables of arguments when the function being called is `tl.where`. The reason being that `tl.where` is often used precisely to make sure the output variable has valid values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91241 Approved by: https://github.com/ngimel	2022-12-22 11:54:48 +00:00
Nikita Shulga	fd3a7264ae	[MPS] Add `group_norm[fwd+backward]` and `mean_var` (take 2) (#91190 ) Use Prims to implement group_norm, group_norm_backward and mean_var Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in `15af4b1cee/torch/__init__.py (L1095)` is executed last during init process. Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass Fixes https://github.com/pytorch/pytorch/issues/88331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190 Approved by: https://github.com/albanD	2022-12-22 08:54:37 +00:00
Chien-Chin Huang	9b42e4ef73	[Composable API] Make _StateKey as a str subclass (#91279 ) The keys in object.__dict__ should be strings. Make the _StateKey be a str subclass. Differential Revision: [D42200244](https://our.internmc.facebook.com/intern/diff/D42200244/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91279 Approved by: https://github.com/awgu, https://github.com/mrshenli	2022-12-22 06:01:06 +00:00
Sergei Vorobev	39306c1dfb	Use `@pytorch//` in bazel build files (#89660 ) This change aims to make bazel build more embeeding-friendly. Namely, when PyTorch is included as an external repo in another project, it is usually included like this ``` native.local_repository( name = "pytorch", path = ..., repo_mapping = repo_mapping, ) ``` Or ``` http_archive( name = "pytorch", urls = ... repo_mapping = repo_mapping, ) ``` In this case, references to `@//` would resolve to the top-level WORKSPACE that includes PyTorch. That makes upgrades harder because we need to carry around this patch. Note that under some edge-case circumstances even `//` resolves to the top-level `WORKSPACE`. This change makes the embedding of the bazel build easier without compromising anything for the main repo, since the `@pytorch//` still refers to the same thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89660 Approved by: https://github.com/kit1980	2022-12-22 05:14:55 +00:00
Chien-Chin Huang	6cea4f3d57	[FSDP][optim_state_dict][7/N] Make FSDP support NamedOptimizer (#91160 ) What does this PR do? This PR refactors FSDP optimizer state_dict APIs to accept `NamedOptimizer` as the input optimizer. The key difference is that the state_dict returned by `NamedOptimizer` is already keyed as FQN. This PR majorly changes the internal mapping to allows the optimizer state_dict to be keyed as FQN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91160 Approved by: https://github.com/fduwjj, https://github.com/rohan-varma	2022-12-22 04:35:26 +00:00
PyTorch MergeBot	71318742f9	[vision hash update] update the pinned vision hash (#91284 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91284 Approved by: https://github.com/pytorchbot	2022-12-22 03:33:06 +00:00
Shen Li	a0554261a1	Restore RNG states for composable reentrant activation checkpointing (#91265 ) This allows ops like randperm to behave the same during re-computation. Differential Revision: [D42196758](https://our.internmc.facebook.com/intern/diff/D42196758/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91265 Approved by: https://github.com/awgu	2022-12-22 03:15:55 +00:00
Huy Do	8f16524598	Run test_spectral_ops serially to fix CUDA illegal memory access (#91264 ) Fixes https://github.com/pytorch/pytorch/issues/88916 * Running this test sequentially is not flaky after 1000 reruns `pytest --verbose test_spectral_ops.py -k test_fft_round_trip_cuda_float32 --flake-finder --flake-runs=1000` * On the other hand, the curious thing is that when I run this same command on an active runner with some testing processs running in the background, the reruns could fail with CUDA illegal memory access error (hard to reproduce though) https://paste.sh/6sZdRn95#pve73riXC5XehCLqxlCbnjea. This points to the fact that running the `test_spectral_ops` test in parallel with others might be the surface-level cause of flakiness So this PR adds the test to the serial list instead. This shouldn't cause any issue w.r.t TTS because the test takes only half a minute at most to finish. ``` +---------------------+-------------------------------------------------+-------------+---------------------+ \| file \| base_name \| test_config \| time \| +---------------------+-------------------------------------------------+-------------+---------------------+ \| "test_spectral_ops" \| "cuda11.6-py3.10-gcc7-sm86" \| "default" \| 5.991666666666661 \| \| "test_spectral_ops" \| "cuda11.6-py3.10-gcc7-sm86" \| "slow" \| 0.18433333333333346 \| \| "test_spectral_ops" \| "linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck" \| "default" \| 9.866000000000003 \| \| "test_spectral_ops" \| "linux-bionic-cuda11.6-py3.10-gcc7" \| "default" \| 10.591333333333337 \| \| "test_spectral_ops" \| "linux-bionic-cuda11.6-py3.7-gcc7-debug" \| "default" \| 11.395000000000003 \| \| "test_spectral_ops" \| "linux-bionic-cuda11.7-py3.10-gcc7" \| "default" \| 9.424 \| \| "test_spectral_ops" \| "linux-bionic-cuda11.7-py3.7-gcc7-debug" \| "default" \| 8.889000000000003 \| \| "test_spectral_ops" \| "linux-bionic-py3.7-clang9" \| "crossref" \| 6.280333333333329 \| \| "test_spectral_ops" \| "linux-bionic-py3.7-clang9" \| "default" \| 12.182999999999998 \| \| "test_spectral_ops" \| "linux-bionic-py3.7-clang9" \| "dynamo" \| 11.124999999999984 \| \| "test_spectral_ops" \| "linux-bionic-py3.7-clang9-slow" \| "slow" \| 0.1916666666666668 \| \| "test_spectral_ops" \| "linux-focal-py3.7-clang7-asan" \| "default" \| 20.899666666666658 \| \| "test_spectral_ops" \| "linux-focal-py3.7-gcc7" \| "default" \| 5.097999999999996 \| \| "test_spectral_ops" \| "linux-focal-rocm5.3-py3.8-slow" \| "slow" \| 0.23700000000000018 \| \| "test_spectral_ops" \| "macos-12-py3-arm64" \| "default" \| 2.8396666666666626 \| \| "test_spectral_ops" \| "macos-12-py3-x86-64" \| "default" \| 8.838999999999997 \| \| "test_spectral_ops" \| "parallelnative-linux-focal-py3.7-gcc7" \| "default" \| 5.016999999999998 \| \| "test_spectral_ops" \| "win-vs2019-cpu-py3" \| "default" \| 8.351666666666665 \| \| "test_spectral_ops" \| "win-vs2019-cuda11.6-py3" \| "default" \| 27.121666666666687 \| \| "test_spectral_ops" \| "win-vs2019-cuda11.7-py3" \| "default" \| 24.567000000000025 \| +---------------------+-------------------------------------------------+-------------+---------------------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91264 Approved by: https://github.com/clee2000	2022-12-22 02:39:33 +00:00
Sergii Dymchenko	365071c73c	Fix non-existing parameters in docstrings in torch/distributed (#91116 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116 Approved by: https://github.com/huydhn	2022-12-22 02:37:31 +00:00
Wei Wang	b50f379cec	Remove inductor performance from ciflow/nightly as infra is not ready to handle these jobs… (#91271 ) … yet. https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml currently shows there are several commits waiting for A100 runners but the infra is not able to automatically respond to these dynamic requests. Therefore disabling ciflow/nightly tag and only use scheduled and workflow_dispatch. Also remove postnightly filter as the [postnightly pull request ](https://github.com/pytorch/pytorch/pull/27167) is no longer running ci tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91271 Approved by: https://github.com/kit1980, https://github.com/izaitsevfb, https://github.com/malfet	2022-12-22 01:56:08 +00:00
ecao	68e9da68cb	use scatter_add for index_add when dim is the most inner dim (#88729 ) ### Motivation When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations. `scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel. ### Testing - Single core: Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.82E-03 \| 2.11E-03 [10, 128, 50, 50] \| 0.023604 \| 0.023794 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 9.30E-04 \| 1.66E-03 [10, 128, 50, 50] \| 0.005995 \| 0.010003 - Single socket (28 cores): Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.96E-03 \| 2.52E-03 [10, 128, 50, 50] \| 0.012208 \| 0.012568 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 7.44E-05 \| 1.33E-04 [10, 128, 50, 50] \| 0.000333 \| 0.000469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-12-22 01:13:35 +00:00
ecao	59a5be3b45	add mixed data type support for GroupNorm backward on CPU (#88663 ) ### Motivation Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16. Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm. ### Testing Single socket (28cores): * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 3.08E-05 \| 3.50E-05 \| 8.06E-05 \| 7.69E-05 [10, 128, 50, 50] \| 0.000121 \| 0.000114 \| 0.000358 \| 0.000203 * Channels Last (inputs and outputs will be converted to contiguous): shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 4.04E-05 \| 4.41E-05 \| 0.000226 \| 0.000305 [10, 128, 50, 50] \| 0.000169 \| 0.000166 \| 0.001628 \| 0.001169 Single core: * Contiguous: shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.38E-04 \| 2.51E-04 \| 5.94E-04 \| 4.50E-04 [10, 128, 50, 50] \| 0.00171 \| 0.001395 \| 0.0044455 \| 0.00243 * Channels Last (inputs and outputs will be converted to contiguous): shape \| forward / s \| forward / s \| backward / s \| backward / s -- \| -- \| -- \| -- \| -- \| fp32 \| mixed fp32 bf16 \| fp32 \| mixed fp32 bf16 [10, 128, 20, 20] \| 2.28E-04 \| 3.26E-04 \| 0.0016528 \| 0.003165 [10, 128, 50, 50] \| 0.001788 \| 0.001302 \| 0.0276621 \| 0.019447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88663 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/malfet	2022-12-22 01:12:42 +00:00
pbialecki	8e55d5831a	add cu118 workflows for Windows (#91216 ) CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/91216 Approved by: https://github.com/atalman	2022-12-22 01:11:24 +00:00
Ramin Azarmehr	014d7802c8	[MPS] Fix the error with high watermark value on x86 (#91268 ) Fixes the error with high watermark value on x86 (`RuntimeError: invalid high watermark ratio 1.7`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91268 Approved by: https://github.com/razarmehr	2022-12-22 00:35:08 +00:00
Joel Schlosser	85e393bade	Fix for RNN/LSTM/GRU modules to work with stateless.functional_call() (#91111 ) Fixes #90500 The change here checks for parameter changes at the beginning of each `forward()` call; if the parameters are found to be different tensors than last time, `self._flat_weights` is re-initialized with the new values. Thus, swapping parameter values using `stateless.functional_call()` will re-initialize `self._flat_weights` during the `forward()` call, and the provided parameters will be used for module computation as expected. NB: There are still some changes needed for symbolic shapes to work with `nn.GRU` (will address in a follow-up PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91111 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-12-21 23:40:08 +00:00
Salil Desai	300f777796	[Vulkan] Use EXPECT_EQ instead of ASSERT_TRUE in vulkan_api_test querypool_flushed_shader_log (#91259 ) Summary: After this change, if the querypool_flushed_shader_log test fails: 1) The test continues after the first failure and checks all three (Because ASSERT was changed to EXPECT) 2) The op names which are compared to vulkan.add, vulkan.sub, and vulkan.mul are shown (rather than not showing what the wrong op name was) (Because we use ..._EQ(a, b) instead of just checking ...(a == b)) This change makes it easier to debug future failures to querypool_flushed_shader_log (it helped me when one of my diffs broke the test) Test Plan: Vulkan API Test - https://www.internalfb.com/intern/aibench/details/959371570734292 Reviewed By: SS-JIA Differential Revision: D42186371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91259 Approved by: https://github.com/SS-JIA	2022-12-21 23:34:24 +00:00
Jerry Zhang	2a23dfe8ed	[quant] Support lowering for quantized embedding byte operator (#91159 ) Summary: This PR adds lowering for embedding in quantization in executorch flow Test Plan: buck run executorch/exir/tests:quant_fusion_pass -- "executorch.exir.tests.test_quant_fusion_pass.TestQuantFusionPass.test_embedding_byte" Reviewed By: qihqi Differential Revision: D41673139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91159 Approved by: https://github.com/vkuzo	2022-12-21 22:52:24 +00:00
PyTorch MergeBot	b68fd7e319	Revert "Store source, not sname, in Symbol (#91057 )" This reverts commit 88c581be87ac59ea1251f35a57b610ae81b9362d. Reverted https://github.com/pytorch/pytorch/pull/91057 on behalf of https://github.com/atalman due to causing internal build failures	2022-12-21 22:33:15 +00:00
Elias Ellison	6e0cd8b91e	[Resubmit] Require inductor to match stride order (#91185 ) Resubmitting https://github.com/pytorch/pytorch/pull/90563 because I had a commit in that stack which didn't use my CLA-approved git username. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91185 Approved by: https://github.com/desertfire, https://github.com/anijain2305	2022-12-21 21:49:37 +00:00
Chien-Chin Huang	1ab6ac4682	[FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to support hooks (#90798 ) What does this PR do? This PR splits the FSDP optim_state_dict APIs into common implementation parts that are shared for different frontend APIs (we have many now and will consolidate them gradually). This PR also add `_optim_state_dict_post_hook` and `_load_optim_state_dict_pre_hook` for the integration with `NamedOptimzer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90798 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-12-21 21:38:14 +00:00
soulitzer	d19988093d	[autograd Function] Return input as-is if marked dirty even when requires_grad=False (#91214 ) Fixes https://github.com/pytorch/pytorch/issues/90209 Somewhat related: https://github.com/pytorch/pytorch/issues/71119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91214 Approved by: https://github.com/albanD	2022-12-21 21:20:56 +00:00
Richard Zou	fb2e1878cb	[torch.func] alias torch.func.vmap as torch.vmap (#91026 ) This PR also redirects torch.vmap to torch.func.vmap instead of the old vmap prototype. Test Plan: - tests - view docs preview Pull Request resolved: https://github.com/pytorch/pytorch/pull/91026 Approved by: https://github.com/albanD, https://github.com/samdow	2022-12-21 20:51:49 +00:00
bowen0701	e803d336eb	Fix missing indentation in serialization.rst (#91253 ) Fixes #ISSUE_NUMBER In serialization.rst, fix class ControlFlowModule's forward(): missing indentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91253 Approved by: https://github.com/kit1980	2022-12-21 20:14:44 +00:00
Jesse Cai	48511eca82	[pruning][docs] Update README.md for structured pruning (#90403 ) Summary: I wrote a tutorial of how to use structured pruning flow as part of BE week Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90403 Approved by: https://github.com/HDCharles	2022-12-21 20:07:06 +00:00
PyTorch MergeBot	6a3ddd0171	Revert "Don't graph break on patched module methods or aliased methods (#91018 )" This reverts commit d6fc2d82ca616f87d9fef49e84e6d4ff6976292f. Reverted https://github.com/pytorch/pytorch/pull/91018 on behalf of https://github.com/kit1980 due to After this PR, inductor / cuda11.6-py3.10-gcc7-sm86 / test fails every time with CUDA out of memory during OPTForCausalLM	2022-12-21 19:54:15 +00:00
Denis Vieriu	81a9a0ac07	[MPS] Fix gather for uint8 dtype in index_select (#91047 ) Use int8 instead of uint8 for MPS Gather/Scatter (uint8 is broken in macOS Monterey) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91047 Approved by: https://github.com/razarmehr	2022-12-21 19:48:46 +00:00
Douwe den Blanken	b285f1080f	Fix small typo in comment (#91247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91247 Approved by: https://github.com/albanD	2022-12-21 19:45:39 +00:00
Sadra Barikbin	97f514f38e	Fix two typos in `torch.distributed.distributed_c10d.py::broadcast_object_list` (#91237 ) Fixes #91236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91237 Approved by: https://github.com/malfet, https://github.com/H-Huang	2022-12-21 19:45:08 +00:00
Jane Xu	e3383d296f	[optim][fix] test_fused_optimizers did not test fused before (#91228 ) I realized test_fused_optimizers used a helper that was written for foreach, so we were not testing fused at all. This PR fixes that test so we actually test fused adam. The explicitly adding fused=False is to set the stage for my later changes (but should be a no-op here). Pull Request resolved: https://github.com/pytorch/pytorch/pull/91228 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-12-21 19:42:24 +00:00
albanD	c7f1974cf1	Fix FastToLocals call by copy pasting (#91168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91168 Approved by: https://github.com/ezyang	2022-12-21 19:39:04 +00:00
albanD	5e77971a6e	Fix all simple compilation issues in eval_frame.c (#91166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91166 Approved by: https://github.com/ezyang	2022-12-21 19:39:04 +00:00
albanD	b7f48d71fe	Upgrade lintrunner numpy to a version supported by 3.11 (#91164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91164 Approved by: https://github.com/ezyang	2022-12-21 19:39:04 +00:00
albanD	c0e7d8f84c	Use python compat from python/pythoncapi_compat (#91163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91163 Approved by: https://github.com/ezyang	2022-12-21 19:39:04 +00:00
PyTorch MergeBot	645eda0a00	Revert "[MPS] Add `group_norm[fwd+backward]` and `mean_var` (#91190 )" This reverts commit 371716eb36b7447003f1643f14ff1c5998a9302c. Reverted https://github.com/pytorch/pytorch/pull/91190 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names because of underscore _ops	2022-12-21 19:37:43 +00:00
Eddie Yan	8b617f813d	[cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172 ) Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172 Approved by: https://github.com/ngimel	2022-12-21 18:58:28 +00:00
Jeff Daily	1c7e81576a	Temporarily disable ROCm periodic tests (#91256 ) There is an ongoing maintenance and everything fails. Prior PR #91217 did not also disable periodic jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91256 Approved by: https://github.com/kit1980	2022-12-21 18:42:56 +00:00
Ramin Azarmehr	eeb9154b27	[MPS] Add MPSHooks interface to enable accessing MPS functions globally (#91104 ) This PR is a prerequisite to the upcoming MPSGenerator changes required for Random Ops. Add `MPSHooksInterface.cpp` to `aten_cpu_source_non_codegen_list` Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91104 Approved by: https://github.com/kulinseth, https://github.com/malfet	2022-12-21 17:37:09 +00:00
Nikita Shulga	371716eb36	[MPS] Add `group_norm[fwd+backward]` and `mean_var` (#91190 ) Use Prims to implement group_norm, group_norm_backward and mean_var Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in `15af4b1cee/torch/__init__.py (L1095)` is executed last during init process. Depends on https://github.com/pytorch/pytorch/pull/91203 Fixes https://github.com/pytorch/pytorch/issues/88331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190 Approved by: https://github.com/albanD	2022-12-21 17:33:27 +00:00
William Wen	d6fc2d82ca	Don't graph break on patched module methods or aliased methods (#91018 ) See added tests for the cases that were fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91018 Approved by: https://github.com/Morgan77523, https://github.com/anijain2305	2022-12-21 16:29:15 +00:00
Mark Saroufim	15af4b1cee	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos, https://github.com/malfet	2022-12-21 11:56:58 +00:00
Kazuki Sakamoto	bfdc0358dc	Compile fix for Clang + libc++ (#91212 ) Summary: LLVM 15 has a compile issue with the deprecated __has_trivial_copy. Update the GCC ifdef logic to exclude Clang + libc++. ``` caffe2/c10/util/Optional.h:536:13: error: builtin __has_trivial_copy is deprecated; use __is_trivially_copyable instead [-Werror,-Wdeprecated-builtins] C10_IS_TRIVIALLY_COPYABLE(T) && ^ caffe2/c10/macros/Macros.h:438:38: note: expanded from macro 'C10_IS_TRIVIALLY_COPYABLE' #define C10_IS_TRIVIALLY_COPYABLE(T) __has_trivial_copy(T) ``` Test Plan: CI Reviewed By: kit1980 Differential Revision: D42180203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91212 Approved by: https://github.com/kit1980, https://github.com/soumith	2022-12-21 11:19:58 +00:00
min-jean-cho	6d2b0cbb40	[Re-landing 86706] [JIT] Frozen Graph Linear-BatchNormNd Folding (#91020 ) Re-landing #86706 This PR adds linear-batchnormNd folding for JIT frozen graphs. Performance benchmark A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine. FP32, JIT without linear-bn folding ![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png) with linear-bn folding ![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91020 Approved by: https://github.com/davidberard98	2022-12-21 08:00:32 +00:00
Rohan Varma	e8bf7c21e4	Integrate apply_optim_in_backward with DDP (#89194 ) Allow _apply_optim_in_backward to work with DDP. Example: ``` dist.init_process_group("nccl", rank=rank, world_size=2) torch.cuda.set_device(rank) e = enc().cuda(rank) _apply_optimizer_in_backward( optimizer_class=torch.optim.SGD, params=e.parameters(), optimizer_kwargs={"lr": 0.03}, ) e = DDP(e, device_ids=[rank]) inp = torch.randn(1, 10, device=rank) e(inp).sum().backward() ``` Constraints: 1. Custom communication hook not yet supported 2. _apply_optim_in_backward needs to be called _before_ wrapping model in DDP. 3. DDP will remove the gradient hooks _apply_optim_in_backward registers, so these gradient hooks will not be fired and cannot be used. 4. All DDP managed parameters have grads set to None by default once optimizer is applied. There is no support for setting only some parameter grads to None, this must be done manually by user (and DDP_OVERLAPPED_OPTIM_SET_GRADS_TO_NONE=0 needs to be set.) Differential Revision: [D41329694](https://our.internmc.facebook.com/intern/diff/D41329694/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41329694/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89194 Approved by: https://github.com/zhaojuanmao	2022-12-21 07:35:19 +00:00
Bin Bao	8992eec781	[inductor] Update how REQUIRE_HIGHER_TOLERANCE is handled (#91227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91227 Approved by: https://github.com/kit1980	2022-12-21 05:43:39 +00:00
Li-Huai (Allan) Lin	b7f35e4104	[MPS] Fix index_add with non-f32 inputs (#88542 ) The `multiplicationWithPrimaryTensor` and/or `scatterWithDataTensor` api has issues with handling two f16 tensor inputs, resulting in zeros outputs. With int16 or int64 inputs, there are issues as well. This PR conditionally casts inputs to f32 if they're not and then casts the output back to the source's datatype. Fixes #82645. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88542 Approved by: https://github.com/kulinseth	2022-12-21 05:31:03 +00:00
Takeshi Watanabe	0476201482	Update debug option for torch._dynamo (#91223 ) Seems outdated from https://www.youtube.com/watch?v=egZB5Uxki0I Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91223 Approved by: https://github.com/ngimel	2022-12-21 05:06:42 +00:00
soulitzer	b66862ba87	[autograd Function] Don't materialize forward grad for non-differentiable types (#91183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91183 Approved by: https://github.com/zou3519	2022-12-21 05:05:44 +00:00
Edward Z. Yang	88c581be87	Store source, not sname, in Symbol (#91057 ) I'm going to need this in the follow up PR. Instead of storing only Source.name() in Symbol, I now store a full on Source. Lots of replumbing reoccurs. In particular: - Move Source to torch._guards to break cycles - I have to add TensorPropertySource and NegateSource to handle x.size()[0] and -x codegen that I was doing with string manipulation previously - I tighten up invariants so that I never pass source=None; instead I pass ConstantSource (these are constant sources right) and test for that rather than source being missing. I think this is more parsimonious - Some mypy wobbles from new imports I didn't move LocalSource and friends to torch._guards, but I ended up needing to access them in a few places. The main annoyance with moving these is that then I also need to move the bytecode codegen stuff, and that's not so easy to move without bringing in the kitchen sink. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91057 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2022-12-21 04:51:51 +00:00
Chris Zheng	5d37890b8e	Update torchrun and TorchElastic to take optional `local_addr` param to allow skip local IP lookup if specified (#88922 ) Summary: Update dynamic renderzvous nodes to use rendezvous hostname if provided. For PR: https://github.com/pytorch/pytorch/issues/85300 Before: For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address. For example, https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256 ``` return _NodeDesc(socket.getfqdn(), os.getpid(), local_id) ``` Now: If user specifies the hostname, each node will respect the given hostname. For example, `socket.getfqdn(<hostname>) ` Test Plan: Unit tests. Differential Revision: D41204028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922 Approved by: https://github.com/d4l3k	2022-12-21 03:55:01 +00:00
Edward Z. Yang	57390116e0	Restructure ShapeEnv so it uses GuardBuilder.SHAPE_ENV directly (#91055 ) The idea is to make ShapeEnv guards less of a one-off special snowflake, and integrate it more closely with the regular builder infrastructure. But it is not so easy: the shape env code has to live after tensor match code, because we need to know that the values in question are tensors before we start matching on them. So we introduce a new `shape_env_code` field to put the special shape env code, so we can add it to the final constructed code after tensor. Everything else works the obvious way. There's a new ShapeEnvSource for constructing the singleton SHAPE_ENV guard that drives the shape env guard construction. I added some more docs and also made the printed code for guards include the enclosing lambda for more clarity. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91055 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2022-12-21 03:50:47 +00:00
Mengwei Liu	2f154f68ea	[torchgen] Add CI job to make sure torchgen works for Executorch op registration (#89596 ) ## Job Test running on most CI jobs. ## Test binary * `test_main.cpp`: entry for gtest * `test_operator_registration.cpp`: test cases for gtest ## Helper sources * `operator_registry.h/cpp`: simple operator registry for testing purpose. * `Evalue.h`: a boxed data type that wraps ATen types, for testing purpose. * `selected_operators.yaml`: operators Executorch care about so far, we should cover all of them. ## Templates * `NativeFunctions.h`: for generating headers for native functions. (not compiled in the test, since we will be using `libtorch`) * `RegisterCodegenUnboxedKernels.cpp`: for registering boxed operators. * `Functions.h`: for declaring operator C++ APIs. Generated `Functions.h` merely wraps `ATen/Functions.h`. ## Build files * `CMakeLists.txt`: generate code to register ops. * `build.sh`: driver file, to be called by CI job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89596 Approved by: https://github.com/ezyang	2022-12-21 03:07:32 +00:00
Digant Desai	37ea99cd25	[QNNPACK] Add more unaligned attributes (#91208 ) Summary: Bypass "Runtime error: store to misaligned address [...] for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment" for q8conv. Reviewed By: scramsby Differential Revision: D42179009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91208 Approved by: https://github.com/kimishpatel	2022-12-21 03:01:11 +00:00
Ramin Azarmehr	a274b5b99e	[MPS] Fix data type issues in Unary ops (#91120 ) Refactored sigmoid() and log1p() Pull Request resolved: https://github.com/pytorch/pytorch/pull/91120 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth	2022-12-21 02:42:59 +00:00
Nikita Shulga	c8546c930f	[BE] Use `aten` global in `torch._refs` (#91189 ) Similar to pattern used in `torch._decomp` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91189 Approved by: https://github.com/ngimel	2022-12-21 02:28:51 +00:00
Nikita Shulga	46f64117db	[BE] Use `aten` global var (#91188 ) s/torch.ops.aten/aten/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/91188 Approved by: https://github.com/ngimel	2022-12-21 02:28:51 +00:00
Nikita Shulga	dd735b96df	[MPS] Fix `torch.std`/`torch.var` default/correction handling (#91203 ) If `torch.std`, `torch.var` are invoked without any arguments, it should be assumed that `unbiased` is `True`. Also, if `correction` parameter is specified it should be use in correction computation. Test by adding `std` and `var` to consistency tests Fixes https://github.com/pytorch/pytorch/issues/91198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91203 Approved by: https://github.com/kit1980	2022-12-21 02:23:50 +00:00
Peter Bell	e670c261c5	Decompose fill, zero, and zeros_like (#90968 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90968 Approved by: https://github.com/ngimel	2022-12-21 00:59:50 +00:00
Sergii Dymchenko	eeacb6ae04	Temporarily disable ROCm tests (#91217 ) There is an ongoing maintenance and everything fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91217 Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/malfet	2022-12-21 00:38:34 +00:00
Richard Zou	2f37804cae	[generate_vmap_rule] Add generate_vmap_rule to autograd.Function (#90966 ) Design document: https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit This PR adds a `generate_vmap_rule` option (default False) to autograd.Function. By setting it to True, a user promises to us that their autograd.Function's {forward, backward, jvp}, if defined, only uses PyTorch operations, in addition to the other limitations of autograd.Function+functorch (such as the user not capturing any Tensors being transformed over from outside of the autograd.Function). Concretely, the approach is: - we update `custom_function_call` to accept an additional `generate_vmap_rule` argument. - The vmap rule for `custom_function_call` and `generate_vmap_rule=True` is: we construct a vmapped version of the autograd.Function and dispatch on it. - The vmapped version of the autograd.Function can be thought of like the following: if we have an autograd.Function Foo, then VmappedFoo.apply(in_dims, ...) has the same semantics as vmap(Foo.apply, in_dims...) - VmappedFoo's forward, setup_context, and backward staticmethod are vmapped versions of Foo's staticmethods. - See the design doc for more motivation and explanation Test Plan: - This PR introduces additional autograd.Function with the suffix "GenVmap" to autograd_function_db. - There are also some minor UX tests Future: - jvp support - likely more testing to come, but please let me know if you have cases that you want me to test here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90966 Approved by: https://github.com/soulitzer	2022-12-21 00:34:44 +00:00
Richard Zou	2a55984139	[generate_vmap_rule] reductify_leaf helper function (#90965 ) As seen in https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit `reductify_leaf(grad_input, ...)` is a helper function that processes a single grad_input Tensor. The reason why we need it is: - the grad_input has some optional bdim - the input has some optional bdim - if these are different, we need to coerce the grad_input into having the same shape as the input, either by reducing or expanding the grad_input. Note that there is a special case in autograd that the user is allowed to return a grad_input Tensor that is an expanded version of the original input tensor. In this case, autograd automatically reduces grad_input to the same shape as the input. Unfortunately this logic doesn't work when bdims are involved, so we manually handle it in `reductify_leaf`. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/90965 Approved by: https://github.com/soulitzer	2022-12-21 00:34:44 +00:00
Richard Zou	53c94ef1bb	[generate_vmap_rule] Add mechanism to override ctx.saved_tensors (CtxWithSavedTensors) (#90964 ) As seen in https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit#heading=h.r3ckcnsh1cxt This PR creates CtxWithSavedTensors. You can wrap a ctx object in the backward pass of autograd.Function in CtxWithSavedTensors and specify the saved_tensors attribute. CtxWithSavedTensor acts like the original ctx object (all other attribute accesses are forwarded to the original ctx object) but it has a custom saved_tensors field. Test Plan: - tests that you can use CtxWithSavedTensors to get a new object with your own saved_tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90964 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-12-21 00:34:43 +00:00
Richard Zou	31981d0139	[generate_vmap_rule] add restore_vmap helper function (#90963 ) As seen in https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit `restore_vmap` is a private helper function. It is vmap but has the following differences: - instead of returning outputs, it returns an (outputs, out_dims) tuple. out_dims is a pytree of shape shape as outputs and contains Optional[int] specifying where the vmapped dimension, if it exists, is in the corresponding output. - does no validation on in_dims or inputs (vmap expects at least one Tensor to be vmapped). restore_vmap allows for no inputs to have the vmap dimension - does no validation on outputs (vmap expects only Tensor outputs) restore_vmap allows for return of arbitrary outputs (not just Tensors) Test Plan: - added some simple test to test restore_vmap - I am OK with restore_vmap not being a part of vmap right now -- the implementation of vmap rarely changes and it is a bit difficult to refactor vmap in a way that restore_vmap is a subroutine. Other questions: - Bikeshedding the `restore_vmap` name Pull Request resolved: https://github.com/pytorch/pytorch/pull/90963 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-12-21 00:34:41 +00:00
PyTorch MergeBot	94262efc7d	Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 )" This reverts commit d6dd2e97da619319a103d1061290fe33ce33b6a4. Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds	2022-12-21 00:02:38 +00:00
Edward Z. Yang	e932c3e547	Delete dead intermediary_symbols (#91070 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91070 Approved by: https://github.com/soumith	2022-12-20 23:51:44 +00:00
Jeff Daily	e5a748fef8	[Nested Tensor] do not use at::cuda::getDefaultCUDAStream(), again (#91180 ) Otherwise, Nested Tensor kernels won't sync with current stream, resulting in flaky unit tests in test_nestedtensor.py. This is the second time the wrong streams have been used in NestedTensor code. See #84134 for another example. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91180 Approved by: https://github.com/mikaylagawarecki	2022-12-20 23:44:59 +00:00
Edward Z. Yang	1c46a32b67	Minor typing improvements (#91068 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91068 Approved by: https://github.com/Skylion007, https://github.com/soumith	2022-12-20 23:43:11 +00:00
Felix Divo	7fecba7bdb	Doc improvement in LKJCholesky distribution (#91091 ) Better structure & formatting. Added more info to reference. The change can be viewed here: https://docs-preview.pytorch.org/91091/distributions.html?highlight=lkjcholesky#torch.distributions.lkj_cholesky.LKJCholesky Pull Request resolved: https://github.com/pytorch/pytorch/pull/91091 Approved by: https://github.com/kit1980	2022-12-20 23:38:57 +00:00
richardachen	dafd0432ee	Update __init__.py (#91196 ) Fixes #91080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91196 Approved by: https://github.com/janeyx99	2022-12-20 23:38:25 +00:00
Xilun Wu	712170e929	[threaded pg] adapt test_pointwise_ops.py (#90713 ) Differential Revision: [D42153660](https://our.internmc.facebook.com/intern/diff/D42153660) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90713 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
Xilun Wu	a6dcebf997	[threaded pg] make exception handling consistent with MultiProcessTestCase (#90712 ) Differential Revision: [D42153661](https://our.internmc.facebook.com/intern/diff/D42153661) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90712 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
Xilun Wu	34da446072	[threaded pg] add assertion util to MultiThreadedTestCase (#90595 ) Differential Revision: [D42153662](https://our.internmc.facebook.com/intern/diff/D42153662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90595 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
fduwjj	c7e7ea92e2	[NamedOptimizer][2/N] Prepare the enablement of state_dict for FSDP (#91147 ) 1. Add param_group check logic and unit test 2. Remove unnecessary check for conditional param update 3. Return the param_group from the inner optimizer so that when param_group is None or not all params are specified, we still return the expected result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91147 Approved by: https://github.com/fegin	2022-12-20 23:23:04 +00:00
Jithun Nair	c248f2f379	[ROCm] Modify GPUs visibility code when starting docker container (#91031 ) Use ROCR_VISIBLE_DEVICES to limit GPU visibility, in preparation for CI node upgrade to ROCm5.3 KFD and UB22.04. ### PROBLEM After upgrading some of our CI nodes to UB22.04 and ROCm5.3KFD, rocminfo doesn't work inside the docker container if we use the following flags: `--device=/dev/dri/renderD128 --device=/dev/dri/renderD129`. It gives the error: ``` + rocminfo ROCk module is loaded Failed to set mem policy for GPU [0x6b0d] hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1140 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. ``` ### WORKAROUND Use `--device=/dev/dri` instead, and use `ROCR_VISIBLE_DEVICES` to limit GPU visibility inside container. ### BACKGROUND OF ORIGINAL CODE We introduced these flags to prepare for 2 runners per CI node, to split up the GPU visibility among the runners: https://github.com/pytorch/pytorch/blame/master/.github/actions/setup-rocm/action.yml#L58 That effort - 2 runners per CI node - is still pending, and we might need to revisit this patch when we try to enable that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91031 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2022-12-20 23:23:00 +00:00
richardachen	f460893cec	Update optim.rst (#91195 ) Fixes #91080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91195 Approved by: https://github.com/kit1980	2022-12-20 23:22:25 +00:00
Natalia Gimelshein	c43209db4d	use libdevice for tanh (#90889 ) Per title I see slight differences in perf with this implementation, where standalone tanh is slightly slower for a tensor of 4000000 elements (20.4 us instead of 19.4us), other sizes are within noise. @bertmaher could you check if it affects your benchmarks? Pull Request resolved: https://github.com/pytorch/pytorch/pull/90889 Approved by: https://github.com/bertmaher, https://github.com/anijain2305	2022-12-20 23:21:37 +00:00
jjsjann123	192a11d49c	refactor the dfs cyclic search from recursion to iterative approach (#91042 ) Follow up on PR #86511 Python's 1000 limit on recursion depth is not practical for us to run cyclic check on larger graphs. This refactor avoids that issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91042 Approved by: https://github.com/kit1980	2022-12-20 23:15:30 +00:00
lezcano	e6fcf7ad9d	Remove breakpoint (#91128 ) This was left in https://github.com/pytorch/pytorch/pull/90026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91128 Approved by: https://github.com/kit1980	2022-12-20 22:14:35 +00:00
Elias Ellison	cdbca3563e	Small operatorbench changes (#91027 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91027 Approved by: https://github.com/desertfire	2022-12-20 21:59:52 +00:00
Sergii Dymchenko	83f4e30ea7	Use deque instead of list for BFS (#91139 ) Using list with `pop(0)` makes the search running time quadratic instead of linear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91139 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2022-12-20 21:40:43 +00:00
Xiao Wang	649d0b6ae7	Add env var PYTORCH_TEST_RUN_EVERYTHING_IN_SERIAL=1 that allows running unit test suites in serial (#90981 ) Running unit test suites in parallel sometimes creates unexpected errors. This PR adds an option that allows unit test suites to be executed in serial, by setting PYTORCH_TEST_RUN_EVERYTHING_IN_SERIAL=1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90981 Approved by: https://github.com/malfet, https://github.com/ptrblck	2022-12-20 21:20:59 +00:00
Michael Lazos	2f5759eaba	Disable non-deterministic models for optimizers (#91149 ) These two models are non-deterministic even with constant inputs + weights and sometimes fail with variations between the fp64 and fp32 models in CI very rarely as a result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91149 Approved by: https://github.com/desertfire	2022-12-20 20:19:54 +00:00
Howard Huang	f8b348c1fc	Update ProcessGroupRoundRobin (#91172 ) Summary: Temporary fix to unblock jobs in https://fb.workplace.com/groups/300451907202972/permalink/906337097050850/ Real fix would be to remove use of _round_robin_process_group API and update corresponding references (e.g. PyText) Test Plan: sandcastle Differential Revision: D42169592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91172 Approved by: https://github.com/awgu	2022-12-20 19:53:34 +00:00
clee2000	5ed5dfd915	Don't run ios jobs on forks (#91112 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91112 Approved by: https://github.com/huydhn	2022-12-20 19:13:13 +00:00
clee2000	34717b3ea8	nn/test_convolution to run in serial (#91113 ) unfortunately it takes 50 minutes on slow gradcheck but thats on periodic ends up taking up >6000 MB of space (7440 available) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91113 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2022-12-20 19:12:43 +00:00
Eddie Yan	dabf515c18	[cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default (#91117 ) Re-opening following #91025 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117 Approved by: https://github.com/ngimel	2022-12-20 18:52:29 +00:00
albanD	28ceccec21	cleanup old python_compat code (#91162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91162 Approved by: https://github.com/ezyang	2022-12-20 18:13:19 +00:00
Huy Do	346fd04076	Set cmake PATH on macos to address libzstd flakiness (#91142 ) This is to address the recent flakiness issue on MacOS ARM64 https://hud.pytorch.org/failure/Library%20not%20loaded%3A%20%40rpath%2Flibzstd.1.dylib. From what I see, the immediate cause is that `cmake` exec under `/Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/` is used instead of the expected one under the temp CONDA_ENV, i.e. `/Users/ec2-user/runner/_work/_temp/conda_environment_3736476178/bin`. I'm not quite sure what is the reason behind this flaky behavior, so I want to try a catch-all fix by setting the cmake PATH correctly This PR also prints some debugging information w.r.t cmake PATH, and cleans up some legacy code in `macos-test.sh` script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91142 Approved by: https://github.com/ZainRizvi	2022-12-20 17:35:05 +00:00
Bin Bao	84e73e1269	[inductor] small CI improvements (#91140 ) Summary: 1) Increase timm_model download retry times; 2) Skip certain random triton failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91140 Approved by: https://github.com/williamwen42	2022-12-20 17:26:12 +00:00
Huy Do	6a757f1cbb	Cleanup Windows pip dependencies (#88862 ) The new Windows AMI from https://github.com/pytorch/test-infra/pull/1065 is now ready. All Windows pip dependencies are now part of the Windows AMI and can be cleaned up from the CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/88862 Approved by: https://github.com/ZainRizvi	2022-12-20 17:19:24 +00:00
Ramin Azarmehr	b63f0311a5	[MPS] Add floor_divide() op and its test case (#91126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91126 Approved by: https://github.com/malfet	2022-12-20 17:02:29 +00:00
Andrew Gu	aec09eeb3a	[FSDP][7/N] Support `replicate` in `fully_shard` (#91044 ) This PR supports nesting `replicate` in `fully_shard`. - The PR achieves this by treating `replicate`-annotated modules are ignored modules. This means that all submodules in the `replicate`-annotated module's subtree are ignored, including nested `fully_shard`-annotated modules, which is the desired behavior. --- This PR reworks some tree traversal. One end goal is for `state._handles` to follow the same order for both the wrapper and composable paths. This implies that `_get_fsdp_handles()` returns the same value for both paths. - The helper function `_get_fully_sharded_module_to_states()` now follows a left-to-right DFS from each fully sharded module instead of a BFS. The left-to-right DFS follows `.modules()` order. - The composable auto "wrap" initialization function `_init_param_handles_from_module()` follows the reverse left-to-right DFS order. As noted in the code comments, this initialization order is a valid reverse topological sort, but it differs from the wrapper path. This is the _only_ difference with respect to initialization order through the entire process. ``` mod: Module( submod1: Submodule() submod2: Submodule( subsubmod: Subsubmodule(), ), ) ``` For left-to-right DFS, the order is `mod`, `submod1`, `submod2`, `subsubmod`. (For context, right-to-left DFS would be `mod`, `submod2`, `subsubmod`, `submod1`. In other words, the left-to-right vs. right-to-left corresponds to `.children()` vs. `reversed(.children())` respectively.) Then, reverse left-to-right DFS is `subsubmod`, `submod2`, `submod1`, `mod`, which is a valid initialization order. However, the wrapper auto wrap initialization order would be `submod1`, `subsubmod`, `submod2`, `mod` since it directly follows a left-to-right DFS and initializes as a part of the recursive DFS logic. - At the end of `_init_param_handles_from_module()`, we reverse the newly populated `state._handles`, so this is the reverse reverse left-to-right DFS order, which is equivalent to the left-to-right DFS order. Thus, `state._handles` has the same order for both paths. Another goal is for `_get_fsdp_states()` to not traverse into any submodule that is annotated with an API that is not compatible with `fully_shard` (e.g. `replicate`). To achieve this while preserving that `_get_fsdp_states()` follows `.modules()` order, we again use a left-to-right DFS. The reason the DFSs may look strange is because I implemented them non-recursively, which requires a stack. - `test_get_fully_sharded_module_to_states()` in `test_utils.py` checks the traversal order of `_get_fully_sharded_module_to_states()`. - `test_policy()` in `test_fully_shard.py` checks the traversal order returned by `_get_fsdp_handles()`. --- Due to a circular dependency issue, we must move the graph/tree traversal helpers to their own file `_traversal_utils.py`, and any usages must import the entire file like `import torch.distributed.fsdp._traversal_utils as traversal_utils` instead of `from torch.distributed.fsdp._traversal_utils import ...`. The cycle comes from the fact that the traversals require `_composable()`, which requires `_get_registry()` from `composable/contract.py`, which when imported, imports `composable/fully_shard.py`, which requires the traversals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91044 Approved by: https://github.com/mrshenli	2022-12-20 16:49:18 +00:00
Andrew Gu	e81ccfd1ed	[FSDP][6/N] Add note explaining idioms for `_FSDPState` traversal (#90959 ) This adds a note to explain how to do traversal in the new code base. These traversal helper methods were introduced in [1/N], [3/N], and [5/N]. I am working on updating the traversal helpers to account for other composable APIs (e.g. `replicate`). The rule is that the traversal should not proceed into an incompatible API's tree. This will be needed for `fully_shard` to be above `replicate`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90959 Approved by: https://github.com/mrshenli	2022-12-20 16:49:18 +00:00
Andrew Gu	32fde53713	[FSDP][5/N] Add manual "wrapping" support for `fully_shard` (#90874 ) This PR adds manual "wrapping" support for `fully_shard`. For example, for ``` fully_shard(mod.sub) fully_shard(mod) ``` `mod.sub` and `mod` will share the same FSDP data structures. To have parity with wrapper FSDP, this PR only checks support for when each manual application of `fully_shard` passes `policy=None`. Hybrid auto / manual wrapping is not in scope for this PR since it is not supported for wrapper FSDP either. I can follow up to either add support properly or raise and error early. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90874 Approved by: https://github.com/mrshenli	2022-12-20 16:49:15 +00:00
Andrew Gu	da9af9868e	[FSDP][4/N] Refactor func to share state/init handle attrs (#90871 ) For `limit_all_gathers`, if we do not enforce that they all have the same value, then the entire semantics guaranteed by the `bool` can be violated. It could be as if none of them set that value to be `True`. For `use_orig_params`, optimizer state dict assumes that the value is the same for all FSDP instances. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90871 Approved by: https://github.com/mrshenli	2022-12-20 16:49:13 +00:00
PyTorch MergeBot	3194281ca7	Revert "use scatter_add for index_add when dim is the most inner dim (#88729 )" This reverts commit 13dbad63696f0ad39d63e4457eeebf800fb80dff. Reverted https://github.com/pytorch/pytorch/pull/88729 on behalf of https://github.com/desertfire due to causing inductor test failure	2022-12-20 15:19:54 +00:00
Michael Lazos	07c340bb2a	Remove debug code (#91148 ) Removes some debug code Pull Request resolved: https://github.com/pytorch/pytorch/pull/91148 Approved by: https://github.com/desertfire, https://github.com/williamwen42	2022-12-20 15:00:55 +00:00
pbialecki	2d68cc4bc2	Add cu118 workflows (#90826 ) CC @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/90826 Approved by: https://github.com/atalman	2022-12-20 14:34:18 +00:00
William Wen	289f06434c	[dynamo] check buffers when checking accuracy (#91037 ) Tested by running `python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=inductor_torchbench_float32_training_cuda_performance.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 --cold_start_latency` and breakpointing after the changes to inspect buffers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91037 Approved by: https://github.com/anijain2305	2022-12-20 13:57:25 +00:00
albanD	17b80bfaf3	Update patch release cherry pick condition (#90220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90220 Approved by: https://github.com/ezyang, https://github.com/seemethere	2022-12-20 13:56:43 +00:00
albanD	0eb45d546c	Bind autograd current Node for debugging purposes (#90867 ) This allows to know at any point during the backward pass what is running and where the Node currently running was created at: ```python import torch from torch.utils._python_dispatch import TorchDispatchMode from torch.autograd import detect_anomaly class MyMode(TorchDispatchMode): def __torch_dispatch__(self, func, types, args, kwargs=None): node = torch._C._current_autograd_node() print(f"Running {func} from within {node}") if node is not None: print("The Node was created at:") print("\n ".join(node.metadata["traceback_"])) return func(args, *kwargs or {}) with MyMode(), detect_anomaly(): print("FW") a = torch.rand(10, requires_grad=True) b = a.mul(2) b = b.div(3) b = b.sum() print("BW") b.backward() ``` Gives ``` $ python foo.py foo.py:15: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging. with MyMode(), detect_anomaly(): FW Running aten.rand.default from within None Running aten.mul.Tensor from within None Running aten.div.Tensor from within None Running aten.sum.default from within None BW Running aten.ones_like.default from within None Running aten.expand.default from within <SumBackward0 object at 0x7fa40c0c6dc0> The Node was created at: File "foo.py", line 20, in <module> b = b.sum() Running aten.isnan.default from within <SumBackward0 object at 0x7fa40c0c6500> The Node was created at: File "foo.py", line 20, in <module> b = b.sum() Running aten.any.default from within <SumBackward0 object at 0x7fa32b23a780> The Node was created at: File "foo.py", line 20, in <module> b = b.sum() Running aten._local_scalar_dense.default from within <SumBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 20, in <module> b = b.sum() Running aten.div.Tensor from within <DivBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 19, in <module> b = b.div(3) Running aten.isnan.default from within <DivBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 19, in <module> b = b.div(3) Running aten.any.default from within <DivBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 19, in <module> b = b.div(3) Running aten._local_scalar_dense.default from within <DivBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 19, in <module> b = b.div(3) Running aten.mul.Tensor from within <MulBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 18, in <module> b = a.mul(2) Running aten.isnan.default from within <MulBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 18, in <module> b = a.mul(2) Running aten.any.default from within <MulBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 18, in <module> b = a.mul(2) Running aten._local_scalar_dense.default from within <MulBackward0 object at 0x7fa40c0c9190> The Node was created at: File "foo.py", line 18, in <module> b = a.mul(2) Running aten.detach.default from within <AccumulateGrad object at 0x7fa40c0c9730> The Node was created at: File "foo.py", line 18, in <module> b = a.mul(2) Running aten.detach.default from within <AccumulateGrad object at 0x7fa40c0c94b0> The Node was created at: File "foo.py", line 18, in <module> b = a.mul(2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90867 Approved by: https://github.com/soulitzer	2022-12-20 13:41:43 +00:00
ecao	13dbad6369	use scatter_add for index_add when dim is the most inner dim (#88729 ) ### Motivation When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations. `scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel. ### Testing - Single core: Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.82E-03 \| 2.11E-03 [10, 128, 50, 50] \| 0.023604 \| 0.023794 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 9.30E-04 \| 1.66E-03 [10, 128, 50, 50] \| 0.005995 \| 0.010003 - Single socket (28 cores): Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.96E-03 \| 2.52E-03 [10, 128, 50, 50] \| 0.012208 \| 0.012568 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 7.44E-05 \| 1.33E-04 [10, 128, 50, 50] \| 0.000333 \| 0.000469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-12-20 13:12:36 +00:00
Jianyu Huang	63b8ecc415	[CUDA12] Make PyTorch compatible with CUDA 12 (#91118 ) Fix the failure when building PyTorch from source code using CUDA 12 ``` In file included from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAFunctions.h:12, from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAStream.h:10, from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAGraphsC10Utils.h:3, from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.h:5, from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:2: /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp: In member function ‘void at::cuda::CUDAGraph::capture_end()’: /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:75: warning: converting to non-pointer type ‘long long unsigned int’ from NULL [-Wconversion-null] AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0)); ^ /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAException.h:31:42: note: in definition of macro ‘C10_CUDA_CHECK’ C10_UNUSED const cudaError_t __err = EXPR; \ ^~~~ /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:5: note: in expansion of macro ‘AT_CUDA_CHECK’ AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0)); ^~~~~~~~~~~~~ /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:75: error: too many arguments to function ‘cudaError_t cudaGraphInstantiate(CUgraphExec_st*, cudaGraph_t, long long unsigned int)’ AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0)); ^ /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAException.h:31:42: note: in definition of macro ‘C10_CUDA_CHECK’ C10_UNUSED const cudaError_t __err = EXPR; \ ^~~~ /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:5: note: in expansion of macro ‘AT_CUDA_CHECK’ AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0)); ^~~~~~~~~~~~~ In file included from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAStream.h:6, from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAGraphsC10Utils.h:3, from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.h:5, from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:2: /usr/local/cuda/include/cuda_runtime_api.h:11439:39: note: declared here extern __host__ cudaError_t CUDARTAPI cudaGraphInstantiate(cudaGraphExec_t pGraphExec, cudaGraph_t graph, unsigned long long flags __dv(0)); ^~~~~~~~~~~~~~~~~~~~ ninja: build stopped: subcommand failed. ``` ``` /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp: In function ‘void torch::cuda::shared::initCudartBindings(PyObject*)’: /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:13: error: ‘cudaOutputMode_t’ was not declared in this scope py::enum_<cudaOutputMode_t>( ^~~~~~~~~~~~~~~~ /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:13: note: suggested alternative: ‘cudaGraphNode_t’ py::enum_<cudaOutputMode_t>( ^~~~~~~~~~~~~~~~ cudaGraphNode_t /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:29: error: template argument 1 is invalid py::enum_<cudaOutputMode_t>( ^ /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:38:30: error: ‘cudaKeyValuePair’ was not declared in this scope .value("KeyValuePair", cudaKeyValuePair) ^~~~~~~~~~~~~~~~ /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:39:21: error: ‘cudaCSV’ was not declared in this scope .value("CSV", cudaCSV); ^~~~~~~ /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:39:21: note: suggested alternative: ‘cudart’ .value("CSV", cudaCSV); ^~~~~~~ cudart /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:99:7: error: ‘cudaProfilerInitialize’ was not declared in this scope cudaProfilerInitialize); ^~~~~~~~~~~~~~~~~~~~~~ /home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:99:7: note: suggested alternative: ‘cudaProfilerStart’ cudaProfilerInitialize); ^~~~~~~~~~~~~~~~~~~~~~ cudaProfilerStart ninja: build stopped: subcommand failed. ``` After these fixes, we can see CUDA 12 is successfully built with OSS PyTorch instructions. USE_CUDA=1 python setup.py develop 2>&1 \| tee compile.log Pull Request resolved: https://github.com/pytorch/pytorch/pull/91118 Approved by: https://github.com/ngimel, https://github.com/brad-mengchi	2022-12-20 10:58:53 +00:00
JackCaoG	7c58f1d4e8	Update dynamo xla test to make it part of the xla CI (#91130 ) XLA side pr to enable the test https://github.com/pytorch/xla/pull/4370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91130 Approved by: https://github.com/shunting314	2022-12-20 09:29:44 +00:00
Iris	29b119d04d	[Checkpoint] Add test for fsdp model state saving and loading with/without resharding (#90950 ) As title. https://github.com/pytorch/pytorch/issues/90960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90950 Approved by: https://github.com/fduwjj	2022-12-20 08:13:53 +00:00
Rohan Varma	7330eabe36	fully_shard load state_dict (#90945 ) Ensures that load_state_dict for fully_shard works: - Don't add back FSDP prefix - Small fix to ensure mixed precision check for buffers work Follow ups: - state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations - No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90945 Approved by: https://github.com/awgu	2022-12-20 07:26:43 +00:00
PyTorch MergeBot	95a115dd07	Revert "use libdevice for tanh (#90889 )" This reverts commit 0148809131f494b842baf50d1f392f7404b87b44. Reverted https://github.com/pytorch/pytorch/pull/90889 on behalf of https://github.com/ngimel due to breaking test	2022-12-20 06:29:45 +00:00
Yanbo Liang	511fbad830	[Dynamo] Fix builder for class with metaclass (#90807 ) Fixes Meta internal user case: a class with metaclass can't be identified as ```UserDefinedClassVariable```. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90807 Approved by: https://github.com/jansel	2022-12-20 05:02:28 +00:00
PyTorch MergeBot	ef2bb9ca04	Revert "When nopython=True, Dynamo can't allow graph breaks. (#90970 )" This reverts commit 7e9bf2ed860b8b60d252eead4cc457c3fe5f1667. Reverted https://github.com/pytorch/pytorch/pull/90970 on behalf of https://github.com/kit1980 due to The inductor test fails on master every time after this PR	2022-12-20 04:43:26 +00:00
Wei Wang	0f57e7f2d9	Do not run inductor perf test with postnightly branch (#91133 ) Inductor performance test job would be triggered every night associated with the pull request push event from https://github.com/pytorch/pytorch/pull/27167 Since we are already running three times a day the job, there is no need to run this test with postnightly branch. Plus, this postnightly branch currently fails dozens of tests due to "docker argument too long error". Example workflow: https://github.com/pytorch/pytorch/actions/runs/3731250111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91133 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/seemethere, https://github.com/desertfire	2022-12-20 03:40:54 +00:00
Jiawen Liu	857ed2d7dd	[Inductor] Replace graph.eliminate_dead_code() with graph.erase_node() in Permute Fusion (#91014 ) Summary: As FX passes of permute fusion run before functionalization, it might be safer to replace `graph.eliminate_dead_code()` with `graph.erase_node()` to avoid cases that `graph.eliminate_dead_code()` might remove mutation nodes Test Plan: Unit Tests & CI Reviewed By: jansel Differential Revision: D41904755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91014 Approved by: https://github.com/jansel	2022-12-20 03:26:25 +00:00
Jason Ansel	d6dd2e97da	[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105 ) https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land. This is an exact copy that PR without using ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105 Approved by: https://github.com/ngimel	2022-12-20 02:38:23 +00:00
atalman	3bd37ff2d5	Removing invalid git option when updating submodules (#91132 ) Same as this: https://github.com/pytorch/builder/pull/1246 Related to following git commit: `51243f9f0f` Which makes jobs = 0 invalid. Nightlies for MacOS are failing because of this issue: https://github.com/pytorch/pytorch/actions/runs/3729522653/jobs/6325523414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91132 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2022-12-20 02:17:02 +00:00
Natalia Gimelshein	0148809131	use libdevice for tanh (#90889 ) Per title I see slight differences in perf with this implementation, where standalone tanh is slightly slower for a tensor of 4000000 elements (20.4 us instead of 19.4us), other sizes are within noise. @bertmaher could you check if it affects your benchmarks? Pull Request resolved: https://github.com/pytorch/pytorch/pull/90889 Approved by: https://github.com/bertmaher, https://github.com/anijain2305	2022-12-20 02:11:53 +00:00
Sergii Dymchenko	30edd39bdc	Fix non-existing parameters in docstrings in benchmarks (#91115 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115 Approved by: https://github.com/clee2000	2022-12-20 02:07:32 +00:00
Sergii Dymchenko	99bd8d12e1	Fix non-existing parameters in docstrings in misc places (#91121 ) This should be the last continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91121 Approved by: https://github.com/clee2000	2022-12-20 02:01:37 +00:00
lezcano	0210d508cc	Fix terminology within `linalg.slogdet` docs (#91129 ) This issue was raised in https://github.com/data-apis/array-api/pull/567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91129 Approved by: https://github.com/kit1980	2022-12-20 01:55:27 +00:00
Xia, Weiwen	a5eb564ba4	[Quant] lower fused LinearTanh for onednn backend (#89188 ) Summary Add fuser method and quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering are supported only in FX mode. Test plan python test_quantization.py TestFuseFx TestQuantizeFx Pull Request resolved: https://github.com/pytorch/pytorch/pull/89188 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-20 01:30:21 +00:00
XiaobingSuper	666d218055	TorchDynamo: set output stride using eager output for cat (#89477 ) For squeezenet1_1 and densenet121 model, the cat's post op is always convolution, for channels last path, the currently cat path always set the output format as contiguous format, but convolution's input requires channels last, there always has a memory copy before convolution. This PR use eaged model's output format to set the format to reduce the memory copy. Before: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/ik/cikrybpw4xhois4wll6h5afsswjrhpsb6gslcxrntzqtlyw2btey.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, const float* __restrict__ in_ptr2, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1, float* __restrict__ out_ptr2) { #pragma GCC ivdep for(long i0=0; i0<3; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<256; i1+=1) { { { auto tmp0 = in_ptr0[i0 + (3i1)]; out_ptr0[i1 + (256i0)] = tmp0; } } } } #pragma GCC ivdep for(long i0=0; i0<3; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<256; i1+=1) { { { auto tmp0 = in_ptr1[i0 + (3i1)]; out_ptr1[i1 + (256i0)] = tmp0; } } } } #pragma GCC ivdep for(long i0=0; i0<6; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<256; i1+=1) { { { auto tmp0 = in_ptr2[i1 + (256i0)]; out_ptr2[i0 + (6i1)] = tmp0; } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1 = args args.clear() buf2 = empty_strided((1, 6, 16, 16), (1536, 256, 16, 1), device='cpu', dtype=torch.float32) buf0 = as_strided(buf2, (1, 3, 16, 16), (1536, 256, 16, 1)) # alias buf1 = as_strided(buf2, (1, 3, 16, 16), (1536, 256, 16, 1), 768) # alias buf3 = empty_strided((1, 6, 16, 16), (1536, 1, 96, 6), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr())) del arg2_1 del arg3_1 del buf0 del buf1 del buf2 buf4 = aten.convolution(buf3, arg0_1, arg1_1, (1, 1), (0, 0), (1, 1), False, (0, 0), 1) assert_size_stride(buf4, (1, 3, 16, 16), (768, 1, 48, 3)) del arg0_1 del arg1_1 return (buf4, ) ``` after: ``` from ctypes import c_void_p, c_long import torch import random from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/ik/cikrybpw4xhois4wll6h5afsswjrhpsb6gslcxrntzqtlyw2btey.h" extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma GCC ivdep for(long i0=0; i0<256; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { { { auto tmp0 = in_ptr0[i1 + (3i0)]; out_ptr0[i1 + (6i0)] = tmp0; } } } } #pragma GCC ivdep for(long i0=0; i0<256; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { { { auto tmp0 = in_ptr1[i1 + (3i0)]; out_ptr1[i1 + (6i0)] = tmp0; } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1 = args args.clear() buf2 = empty_strided((1, 6, 16, 16), (1536, 1, 96, 6), device='cpu', dtype=torch.float32) buf0 = as_strided(buf2, (1, 3, 16, 16), (1536, 1, 96, 6)) # alias buf1 = as_strided(buf2, (1, 3, 16, 16), (1536, 1, 96, 6), 3) # alias kernel_cpp_0(c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr())) del arg2_1 del arg3_1 del buf0 del buf1 buf3 = aten.convolution(buf2, arg0_1, arg1_1, (1, 1), (0, 0), (1, 1), False, (0, 0), 1) assert_size_stride(buf3, (1, 3, 16, 16), (768, 1, 48, 3)) del arg0_1 del arg1_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89477 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2022-12-20 01:09:17 +00:00
Chris Zheng	b309599d1b	Add catch socket.gaierror for _matches_machine_hostname (#91119 ) Summary: Add catch `socket.gaierror` for _matches_machine_hostname Test Plan: Unit tests again Reviewed By: kurman Differential Revision: D42152245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91119 Approved by: https://github.com/kurman	2022-12-20 00:57:53 +00:00
Ramin Azarmehr	ebea45fe41	[MPS] Fix the assert in Garbage Collector (#91106 ) - Enable high watermark ratio to limit the memory allocations Pull Request resolved: https://github.com/pytorch/pytorch/pull/91106 Approved by: https://github.com/kulinseth	2022-12-20 00:53:24 +00:00
Brian Coutinho	1d3e7fcc3b	[pytorch profiler] Add step tracker logic to handle multiple sources of step increments (#90880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90880 # Summary Enables multiple step trackers. Currently we only had one place to mark that a step() has occurred in the program. This was via pytorch profiler step(). We are now working on adding an Optimizer step hook - https://github.com/pytorch/pytorch/issues/88446 - This could mean programs that already call profiler.step() every iteration can end up double incrementing steps - If a model uses multiple optimizers we can also have double or more counting of the step. ## Solution We fix this by adding a layer of abstraction before calling step() to the kineto library. The idea is to maintain steps per requester in a dictionary ``` { "ProfilerStep": 100, # triggered by profiler step() call "Optimizer1Step": 100, # Optimizer 1 or 2 are just examples, could be SGD, Adam etc "Optimizer2Step": 100, } ``` To figure out the global step count just take max on the dict values (100). ``` { "ProfilerStep": 100, "Optimizer1Step": 101, # Optimizer1 got incremented first say "Optimizer2Step": 100, } ``` Then global step count is 101 ## Calling kineto We only call the kineto step() function when global count increments. # Test Plan: Added a unit test buck2 run mode/dev-nosan caffe2/test:profiler Differential Revision: D41751157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90880 Approved by: https://github.com/chaekit	2022-12-20 00:48:01 +00:00
Richard Zou	41846e205e	[torch.func] Setup torch.func, populate it with all transforms (#91016 ) This PR sets up torch.func and populates it with the following APIs: - grad - grad_and_value - vjp - jvp - jacrev - jacfwd - hessian - functionalize - vmap It also renames all instances of `functorch` in the APIs for those docs to `torch.func`. We rewrite the `__module__` fields on some of the above APIs so that the APIs fit PyTorch's public api definition. - For an API to be public, it must have a `__module__` that points to a public PyTorch submodule. However, `torch._functorch.eager_transforms` is not public due to the leading underscore. - The solution is to rewrite `__module__` to point to where the API is exposed (torch.func). This is what both Numpy and JAX do for their APIs. - h/t pmeier in https://github.com/pytorch/pytorch/issues/90284#issuecomment-1348595246 for idea and code - The helper function, `exposed_in`, is confined to torch._functorch/utils for now because we're not completely sure if this should be the long-term solution. Implication for functorch.* APIs: - functorch.grad is the same object as torch.func.grad - this means that the functorch.grad docstring is actually the torch.func.grad docstring and will refer to torch.func instead of functorch. - This isn't really a problem since the plan on record is to deprecate functorch in favor of torch.func. We can fix these if we really want, but I'm not sure if a solution is worth maintaining. Test Plan: - view docs preview Future: - vmap should actually just be torch.vmap. This requires an extra step where I need to test internal callsites, so, I'm separating it into a different PR. - make_fx should be in torch.func to be consistent with `import functorch`. This one is a bit more of a headache to deal with w.r.t. public api, so going to deal with it separately. - beef up func.rst with everything else currently on the functorch documention website. func.rst is currently just an empty shell. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91016 Approved by: https://github.com/samdow	2022-12-20 00:00:52 +00:00
Richard Zou	cad1ce6158	Stop using :attr: in functorch docs (#91015 ) We're using :attr: wrong. :attr: refers to an attribute of a Python object, not the parameter to a function: - https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html#role-py-attr This leads to some weird things when moving to torch.func: sphinx decides to link torch.func for :attr:`func` Test Plan: - docs preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91015 Approved by: https://github.com/samdow	2022-12-20 00:00:52 +00:00
Wei-Sheng Chin	7e9bf2ed86	When nopython=True, Dynamo can't allow graph breaks. (#90970 ) I count the number of sub-graphs (for tiny-GPT2 in huggingface) by ``` class GraphCaptureCompiler: def __init__(self): self.captured_graphs = [] def compile(self, gm, example_inputs): self.captured_graphs.append(gm) return gm compiler = GraphCaptureCompiler() torch._dynamo.optimize(compiler, nopython=True)(Wrapper(fn))(*args) ``` Although `len(compiler.captured_graphs)` is 2, no error was thrown during the compilation. This observation conflicts with `nopython=True`. After some digging, I found a check is missed before making graph break. This PR adds it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90970 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-12-19 23:43:28 +00:00
Michael Gschwind	d1772aff60	Autocast support for scaled_dot_product_attention (#91066 ) Summary: Autocast support for scaled_dot_product_attention Test Plan: sandcastle and guthub cicd Differential Revision: D42085525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91066 Approved by: https://github.com/ngimel, https://github.com/drisspg	2022-12-19 23:42:26 +00:00
Edward Z. Yang	fadf222661	Propagate guard failures to userland (#91053 ) Previously we would abort() but this is annoying when you're running pytest or something. Don't hard crash. It would be nice to apply this treatment to the other uses of CHECK macro in this file, but it was just guards that was bothering me. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/91053 Approved by: https://github.com/jansel	2022-12-19 23:39:48 +00:00
William Wen	7bc3467fff	Delete dynamic_propagation config (#91040 ) Per https://github.com/pytorch/torchdynamo/issues/1949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91040 Approved by: https://github.com/jansel	2022-12-19 22:42:11 +00:00
William Wen	7ebc45eadd	[dynamo] Better error message for bad timm model name (#91049 ) Fixes https://github.com/pytorch/torchdynamo/issues/1995 Running `python benchmarks/dynamo/timm_models.py --performance --float32 -dcuda --output=out.csv --training --inductor --only bad_model_name` gives ``` Traceback (most recent call last): File "benchmarks/dynamo/timm_models.py", line 338, in <module> main(TimmRunnner()) File "/scratch/williamwen/work/pytorch/benchmarks/dynamo/common.py", line 1660, in main return maybe_fresh_cache(run, args.cold_start_latency and args.only)( File "/scratch/williamwen/work/pytorch/benchmarks/dynamo/common.py", line 833, in inner return fn(args, *kwargs) File "/scratch/williamwen/work/pytorch/benchmarks/dynamo/common.py", line 2000, in run ) = runner.load_model(device, model_name, batch_size=batch_size) File "benchmarks/dynamo/timm_models.py", line 215, in load_model raise RuntimeError(f"Failed to load model '{model_name}'") RuntimeError: Failed to load model 'bad_model_name' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91049 Approved by: https://github.com/ezyang	2022-12-19 22:37:34 +00:00
mikey dagitses	322e4b4c8a	set -Wsuggest-override for builds (#89852 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/89852). * __->__ #89852 * #89851 set -Wsuggest-override for builds Summary: This was flagged by a Meta internal build. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89852 Approved by: https://github.com/malfet	2022-12-19 22:08:47 +00:00
Kulin Seth	8ecb49b8fb	[MPS] Add Inverse op. (#90428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90428 Approved by: https://github.com/DenisVieriu97, https://github.com/malfet	2022-12-19 22:00:12 +00:00
Driss Guessous	58b5a9df00	Update to sdp benchmark to take into account pt2.0 stack (#90096 ) Updates to sdp benchmark to fix failures due to sdp being included into nn.f.mha. As well compare against compiled version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90096 Approved by: https://github.com/cpuhrsch	2022-12-19 21:59:21 +00:00
Larry Liu	909b7ca92a	[torchgen] Move Executorch codegen logic into torchgen (#90806 ) ## Codegen entry point Main logic and Executorch codegen entry: `gen_executorch.py`. `RegisterCodegenUnboxedKernels.cpp`: ```cpp register_operators({ Operator( "aten::add.out", [](EValue** stack) { EValue& self = stack[0]; EValue& other = stack[1]; EValue& alpha = stack[2]; EValue& out = stack[3]; const at::Tensor & self_base = self.to<at::Tensor>(); const at::Tensor & other_base = other.to<at::Tensor>(); const at::Scalar & alpha_base = alpha.to<at::Scalar>(); at::Tensor & out_base = out.to<at::Tensor>(); EXECUTORCH_SCOPE_PROF("native_call_add.out"); torch::executor::aten::add_outf(self_base, other_base, alpha_base, out_base); }) ); ``` `Functions.h`: ```cpp namespace torch { namespace executor { namespace aten { // aten::add_outf(Tensor self, Tensor other, Scalar alpha, , Tensor(a!) out) -> Tensor(a!) TORCH_API inline at::Tensor & add_outf(const at::Tensor & self, const at::Tensor & other, at::Scalar alpha, at::Tensor & out) { return at::add_outf(self, other, alpha, out); } } // namespace aten } // namespace executor } // namespace torch ``` Unit tests: `test_executorch_gen.py` CI job in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90806 Approved by: https://github.com/ezyang	2022-12-19 21:58:43 +00:00
Larry Liu	679da8bd89	[torchgen] Move Executorch custom ops logic into torchgen (#90099 ) ## Logic to handle custom ops We generate files for custom ops, so that they can be registered into PyTorch. Generated files: * `Register{dispatch_key}CustomOps.cpp` (dispatch_key = CPU), it's basically the same as vanilla PyTorch `RegisterCPU.cpp`. The only difference is that we bind to native functions directly. * `Register{dispatch_key}Stub.cpp` (dispatch_key = CPU), register placeholder kernels for custom ops. Only used when there's no custom op kernel available. As an example: ```cpp namespace { at::Tensor & wrapper_out_unsqueeze_out(const at::Tensor & self, int64_t dim, at::Tensor & out) { // No device check // DeviceGuard omitted return torch::executor::native::unsqueeze_out(self, dim, out); } } // anonymous namespace TORCH_LIBRARY_IMPL(aten, CPU, m) { m.impl("unsqueeze.out", TORCH_FN(wrapper_out_unsqueeze_out)); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90099 Approved by: https://github.com/ezyang	2022-12-19 21:58:43 +00:00
Larry Liu	ca52f63fc0	[torchgen] Move Executorch unboxing logic into torchgen (#90098 ) This PR adds `unboxing.py` which converts a `EValue` (similar to `IValue`) to its corresponding C++ type, based on the `ExecutorchCppSignature`. Added unit tests to it in `test_executorch_unboxing.py`. Notice that this unboxing logic should work for both ATen types and Executorch types, hence the unit tests are parametrized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90098 Approved by: https://github.com/ezyang	2022-12-19 21:58:43 +00:00
Kshiteej K	f02e93b584	jacrev : Support chunked computation (#89376 ) Ref: https://github.com/pytorch/functorch/issues/680 We introduce a kwarg `chunk_size` in `jacrev` to control whether the Jacobian computation should be chunked and if so then `chunk_size` will dictate the maximum size of the chunks used. We try two approaches, * Stacked Approach: Append the intermediate computation to a list and then stack those results. * Pre-allocation Approach: Pre-allocate a zeros tensor and copy chunked computation into it. For Memory Benchmark, see https://github.com/pytorch/pytorch/pull/89376#issuecomment-1348479098 Benchmark CPU : Performs better with more chunks/ smaller chunk_size. NOTE: There seems to be a lot of noise for shape `(64, 64)`. <details> ``` [----------------------------------------------- jacrev : device cpu : chunks 2 -----------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: --------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 2080 \| 76.2 \| 50.9 \| 80.1 (128, 128) : chunk_size 8256 \| 1172.8 \| 783.3 \| 1225.5 (128, 144) : chunk_size 9288 \| 1475.1 \| 990.4 \| 1548.3 (144, 144) : chunk_size 10440 \| 1871.3 \| 1254.4 \| 1971.2 Times are in milliseconds (ms). [----------------------------------------------- jacrev : device cpu : chunks 3 ----------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: -------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 1386 \| 39.9 \| 25.8 \| 58.8 (128, 128) : chunk_size 5504 \| 1182.6 \| 782.2 \| 1229.7 (128, 144) : chunk_size 6192 \| 1483.6 \| 995.4 \| 1550.6 (144, 144) : chunk_size 6960 \| 1879.1 \| 1257.7 \| 1960.5 Times are in milliseconds (ms). [----------------------------------------------- jacrev : device cpu : chunks 4 ----------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: -------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 1040 \| 41.7 \| 50.6 \| 29.1 (128, 128) : chunk_size 4128 \| 1171.6 \| 782.3 \| 1226.7 (128, 144) : chunk_size 4644 \| 1482.2 \| 994.6 \| 1550.9 (144, 144) : chunk_size 5220 \| 1870.2 \| 1254.5 \| 1961.4 Times are in milliseconds (ms). [--------------------------------------------- jacrev : device cpu : chunks 100 ---------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: ------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 41 \| 46.8 \| 50.5 \| 46.4 (128, 128) : chunk_size 165 \| 622.2 \| 775.2 \| 656.0 (128, 144) : chunk_size 185 \| 803.9 \| 987.3 \| 866.9 (144, 144) : chunk_size 208 \| 1021.1 \| 1251.2 \| 1088.2 Times are in milliseconds (ms). [--------------------------------------------- jacrev : device cpu : chunks 200 ---------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: ------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 20 \| 60.9 \| 50.2 \| 62.3 (128, 128) : chunk_size 82 \| 583.1 \| 779.4 \| 634.3 (128, 144) : chunk_size 92 \| 834.1 \| 1005.8 \| 472.3 (144, 144) : chunk_size 104 \| 1053.6 \| 1277.0 \| 1033.9 Times are in milliseconds (ms). [--------------------------------------------- jacrev : device cpu : chunks 300 --------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: ------------------------------------------------------------------------------------------------------------------ (64, 64) : chunk_size 13 \| 77.7 \| 50.4 \| 79.6 (128, 128) : chunk_size 55 \| 578.9 \| 782.3 \| 626.9 (128, 144) : chunk_size 61 \| 718.2 \| 1024.9 \| 800.4 (144, 144) : chunk_size 69 \| 919.7 \| 1313.7 \| 1023.0 Times are in milliseconds (ms). ``` </details> Benchmark CUDA: Performs better with less chunks/bigger chunk_size. <details> ``` [--------------------------------------------- jacrev : device cuda:1 : chunks 2 ----------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: --------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 2080 \| 1485.7 \| 923.8 \| 1632.3 (128, 128) : chunk_size 8256 \| 25390.2 \| 14103.2 \| 33557.4 (128, 144) : chunk_size 9288 \| 801.7 \| 16854.1 \| 42894.6 (144, 144) : chunk_size 10440 \| 1003.5 \| 21386.5 \| 59648.5 Times are in microseconds (us). 3 / 3 : Shape (144, 144) : Device cuda:1 : chunks: 3 [--------------------------------------------- jacrev : device cuda:1 : chunks 3 ---------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: -------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 1386 \| 1474.5 \| 924.5 \| 1655.5 (128, 128) : chunk_size 5504 \| 25368.9 \| 10156.0 \| 34022.1 (128, 144) : chunk_size 6192 \| 25223.0 \| 12933.7 \| 56418.5 (144, 144) : chunk_size 6960 \| 24729.3 \| 16367.4 \| 68744.7 Times are in microseconds (us). 3 / 3 : Shape (144, 144) : Device cuda:1 : chunks: 4 [--------------------------------------------- jacrev : device cuda:1 : chunks 4 ---------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: -------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 1040 \| 1489.2 \| 924.4 \| 1679.6 (128, 128) : chunk_size 4128 \| 25370.4 \| 8987.4 \| 57201.3 (128, 144) : chunk_size 4644 \| 32239.1 \| 10136.2 \| 72406.5 (144, 144) : chunk_size 5220 \| 40994.3 \| 12867.8 \| 108653.4 Times are in microseconds (us). 3 / 3 : Shape (144, 144) : Device cuda:1 : chunks: 100 [------------------------------------------- jacrev : device cuda:1 : chunks 100 --------------------------------------------] \| with chunk_size and stacked \| without chunk_size \| with chunk_size and pre-allocated 1 threads: ------------------------------------------------------------------------------------------------------------------- (64, 64) : chunk_size 41 \| 21121.8 \| 924.2 \| 22753.5 (128, 128) : chunk_size 165 \| 23679.7 \| 14284.4 \| 26758.2 (128, 144) : chunk_size 185 \| 30082.3 \| 18063.3 \| 33553.5 (144, 144) : chunk_size 208 \| 38175.6 \| 22839.5 \| 42030.0 Times are in microseconds (us). ``` </details> Benchmark Script <details> ```python import functorch import torch import itertools import time from torch.utils.benchmark import Timer from torch.utils.benchmark import Compare import sys import pickle from torch import profiler import math def prod(l): prod = 1 for el in l: prod = el return prod def fn(x, y): return x + y, x.sum(0) shapes = ((64, 64), (128, 128), (128, 144), (144, 144)) for device in ('cpu', 'cuda:1'): if device == 'cuda:1': chunks = (2, 3, 4, 100,) else: chunks = (2, 3, 4, 100, 200, 300) for chunk in chunks: results = [] for shape in shapes: x = torch.zeros(shape, dtype=torch.float, device=device) y = x.sum() chunk_size = (prod(shape) + prod(shape[1:])) // chunk jacrev_fn_chunked = functorch.jacrev(fn, (0, 1), chunk_size=chunk_size) jacrev_fn_chunked_pre = functorch.jacrev(fn, (0, 1), chunk_size=chunk_size, _preallocate_and_copy=True) jacrev_fn = functorch.jacrev(fn, (0, 1), chunk_size=None) tasks = [("jacrev_fn_chunked(x, y)", "with chunk_size and stacked"), ("jacrev_fn(x, y)", "without chunk_size"), ("jacrev_fn_chunked_pre(x, y)", "with chunk_size and pre-allocated"),] timers = [Timer(stmt=stmt, label=f"jacrev : device {device} : chunks {chunk}", sub_label=f"{(shape)} : chunk_size {chunk_size}", description=desc, globals=globals()) for stmt, desc in tasks] for i, timer in enumerate(timers): results.append( timer.blocked_autorange(min_run_time=2.) ) print(f"\r{i + 1} / {len(timers)} : Shape {shape} : Device {device} : chunks: {chunk}", end="") sys.stdout.flush() print() comparison = Compare(results) comparison.print() ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89376 Approved by: https://github.com/zou3519	2022-12-19 20:04:21 +00:00
Salil Desai	e2dc60c6cb	[Vulkan + Profiler] Add Timestamp Adjustment Algorithm (#90672 ) @bypass-github-export-checks This change ensures that vulkan event start/end times are correctly synced with their parent CPU times. This sometimes requires increasing CPU event durations (to fully contain their child events) and delaying CPU event start times (to prevent overlaps), so this should not be used unless Vulkan events are being profiled and it is ok to use this modified timestamp/duration information instead of the the original information. Differential Revision: [D39893109](https://our.internmc.facebook.com/intern/diff/D39893109/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39893109/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90672 Approved by: https://github.com/kimishpatel	2022-12-19 20:01:07 +00:00
Salil Desai	0428de06ee	[Vulkan + Profiler] Use 0 as Vulkan Event Durations During Tree Building (#90671 ) @bypass-github-export-checks This change ensures that parent/child relationships between vulkan events and their corresponding CPU events are established correctly. (Previously, if a vulkan event's duration was too long, it would not be made a child correctly). This could be merged in with the preceding diff, but I wanted to separate it for now because I'm not sure what the most appropriate way to pass through the events and adjust the in_tree_building_ flag (the way I have it now seems a bit awkward), so keeping it separate for now makes it easier to understand/fix. Taylor if you have feedback on this let me know. Differential Revision: [D40084788](https://our.internmc.facebook.com/intern/diff/D40084788/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90671 Approved by: https://github.com/kimishpatel	2022-12-19 19:58:53 +00:00
Salil Desai	8c80a4684b	[Vulkan + Profiler] Report Vulkan Events to Profiler in QueryPool (#90670 ) @bypass-github-export-checks With this change, we see Vulkan events reported on the generated chrometrace with proper names and durations. However, their start/end times are not yet synced with the cpu event timeline, and their parent/child relationships are not established properly. These concerns will be addressed in future diffs Differential Revision: [D39834807](https://our.internmc.facebook.com/intern/diff/D39834807/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39834807/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90670 Approved by: https://github.com/kimishpatel	2022-12-19 19:56:28 +00:00
Salil Desai	193068cbcf	[Vulkan + Profiler] Enable Processing Vulkan Events in Profiler (#90852 ) @bypass-github-export-checks This diff enables passing processing events in the profiler. Passing the events from QueryPool, and making sure vulkan events align with parent CPU events correctly will be handled later in this diff stack. This diff was made by forking Taylor's scaffolding diff, D39779878, with a few changes: - Rebasing + resolving merge conflicts - Fixing (i.e. removing) auto import of profiler/containers.h - Changing the activity type to CPU_OP which makes the vulkan events appear on chrometrace - Moving timestamp adjustment scaffolding to D39893109 Differential Revision: [D39834805](https://our.internmc.facebook.com/intern/diff/D39834805/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90852 Approved by: https://github.com/mcr229	2022-12-19 19:54:32 +00:00
Salil Desai	7badd0b9e6	[Vulkan] Store entries in a separate queue after resetting query pool (#90668 ) @bypass-github-export-checks We want to avoid tossing shader log entries when we reset the query pool so that the old entires can be used by the profiler after gathering all profiling data is done. ```get_shader_name_and_execution_duration_ns``` is used for accessing shader names/durations after they are flushed. It will be used with the torch profiler. Differential Revision: [D40119621](https://our.internmc.facebook.com/intern/diff/D40119621/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90668 Approved by: https://github.com/kimishpatel	2022-12-19 19:52:21 +00:00
Andrew Gu	c345755013	[FSDP] Fix `_mp_shard` `record_stream()` (#91096 ) IIUC, I dropped a needed `record_stream` call in https://github.com/pytorch/pytorch/pull/83665. I think this was because my original version of the PR retired the pre-unshard stream, but after some quantitative investigation, I brought it back. - We allocate the `_mp_shard` in the pre-unshard stream. `731f417f60/torch/distributed/fsdp/_runtime_utils.py (L260-L263)` - For sharded strategies, we consume the `_mp_shard` only in the unshard stream (for all-gather). `731f417f60/torch/distributed/fsdp/_runtime_utils.py (L270-L273)` `731f417f60/torch/distributed/fsdp/flat_param.py (L1005-L1006)` - For `NO_SHARD`, we consume the `_mp_shard` in the the unshard stream (for views) and in the default stream (for computation). `731f417f60/torch/distributed/fsdp/_runtime_utils.py (L304)` `731f417f60/torch/distributed/fsdp/flat_param.py (L1256-L1261)` - We must call `record_stream(_mp_shard, current_stream)` when freeing so that the allocator knows about the usage in the current stream. - For sharded strategies, the free happens in `post_unshard()`, which runs in the unshard stream. - For `NO_SHARD`, the free happens in `post_reshard()`, which runs in the default stream. - Conveniently, for both, the current stream is the correct stream to synchronize. For `NO_SHARD`, the default stream waits for the unshard stream, so only recording in the default stream should suffice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91096 Approved by: https://github.com/rohan-varma	2022-12-19 19:45:34 +00:00
Bin Bao	2a37ba8e81	[inductor] Add retry after benchmark test fails on CI (#90808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90808 Approved by: https://github.com/malfet	2022-12-19 18:10:55 +00:00
Yanli Zhao	50ab2b702f	move inputs to device on root module only (#91078 ) 1. No need to move inputs/activations to devices for every nested FSDP instance 2. it also breaks the case when some nested FSDP instances have newly added inputs/activations in the signatures of submodules wrapped by nested FSDP instances, args_tuple[0] and kargs_tuple[0] are not correct to get the inputs/activations for these nested instances Pull Request resolved: https://github.com/pytorch/pytorch/pull/91078 Approved by: https://github.com/mrshenli, https://github.com/rohan-varma	2022-12-19 17:49:05 +00:00
Brian Hirsh	d6efd25d1e	functionalization: check for undefined tensors in advanced indexing (#90791 ) It looks like running code like `a[:, tensor_idx] = b` can results in: (1) calling `index_put_()` (2) passing (potential undefined) tensors as the indices to index_put_(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/90791 Approved by: https://github.com/ezyang	2022-12-19 16:11:06 +00:00
Brian Hirsh	440a3f2398	fix set_() with functionalization (#90722 ) This should fix https://github.com/pytorch/pytorch/issues/90573 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90722 Approved by: https://github.com/ezyang	2022-12-19 16:11:06 +00:00
Bin Bao	548960f68e	Replace TORCHINDUCTOR_TRACE with TORCH_COMPILE_DEBUG in documentation (#91011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91011 Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/msaroufim	2022-12-19 14:45:27 +00:00
Shen Li	e5a48da664	Allow FSDP to have ignored modules out of wrapped root (#91079 ) Motivations for this change: 1. TorchRec returns inconsistent results on `m.named_parameters()` and `m.m1.named_parameters()` if m1 is a `ShardedModule`. Basically, `ShardedModule` appears in `m.named_modules()`, but its parameters are not in `m.named_parameters()`. As a result, when we identify `ShardedModule` and pass them as `ignored_modules` to FSDP, FSDP complains about key error in `_get_ignored_params`. 2. If users are manually wrapping submodules with FSDP, it could be easier for them to keep a global set of ignored parameters, instead of create a new collection for every FSDP invocation. Given the above two reasons, we allow FSDP to have ignored modules out of the wrapped root module. Differential Revision: [D42132394](https://our.internmc.facebook.com/intern/diff/D42132394) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91079 Approved by: https://github.com/awgu	2022-12-19 14:28:25 +00:00
Xia, Weiwen	6686e9bc07	[Quant] Add fused LinearTanh module for onednn backend (#88923 ) Summary This PR adds fused `QLinearTanh` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown. Test plan python test_quantization.py TestStaticQuantizedModule Pull Request resolved: https://github.com/pytorch/pytorch/pull/88923 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-19 13:42:25 +00:00
yanbing-j	731f417f60	Use scalar implementation to keep the precision in linspace of integral types (#89048 ) Fixes #88652 In the CPU implementation of linspace of integral types, `base` type in vectorized implementation is `int64_t`, which will drop the precision when `base` comes from a floating number. Meanwhile, its vectorized implementation tends to suffer from the catastrophic cancellation of floating point arithemtic since both the `base (start + step * idx)` and the `step` are not exact. Its scalar implementation is fine since start is always an integer and the result would be truncated to integer as well. Therefore, in this PR , we will skip the vectorized implementation since the vec doesn't contribute to performance anyway. And now the behaviors between CPU and GPU are the same. In some cases, the results are the same as numpy's. In some other cases, the results are different from numpy's, but it is not related to the devices (CPU and GPU). https://github.com/pytorch/pytorch/issues/81996#issuecomment-1192980485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89048 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/albanD	2022-12-19 13:05:56 +00:00
Chris Zheng	f833880b2e	Fix torch.distributed.run init connect timeout by comparing `host` with the current IP list (#90221 ) Summary: Pull Request: https://github.com/pytorch/pytorch/issues/79388 Fix torch.distributed.run init connect timeout by comparing `host` with the current IP list. Test Plan: unit tests Differential Revision: D41373962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90221 Approved by: https://github.com/d4l3k	2022-12-19 12:58:23 +00:00
Edward Z. Yang	dfe916ca88	Dynamo comptime, with public ComptimeContext API (#90983 ) This PR adds `@comptime`, a decorator that causes a given function to be executed at compile time when Dynamo is symbolically evaluating their program. To query the Dynamo state, we offer a public ComptimeContext API which provides a limited set of APIs for querying Dynamo's internal state. We intend for users to use this API and plan to keep it stable. Here are some things you can do with it: * You want to breakpoint Dynamo compilation when it starts processing a particular line of user code: give comptime a function that calls breakpoint * You want to manually induce a graph break for testing purposes; give comptime a function that calls unimplemented * You want to perform a debug print, but you don't want to induce a graph break; give comptime a function that prints. * You can print what the symbolic locals at a given point in time are. * You can print out the partial graph the Dynamo had traced at this point. * (My original motivating use case.) You want to add some facts to the shape env, so that a guard evaluation on an unbacked SymInt doesn't error with data-dependent. Even if you don't know what the final user API for this should be, with comptime you can hack out something quick and dirty. (This is not in this PR, as it depends on some other in flight PRs.) Check out the tests to see examples of comptime in action. In short, comptime is a very powerful debugging tool that lets you drop into Dynamo from user code, without having to manually jerry-rig pdb inside Dynamo to trigger after N calls. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90983 Approved by: https://github.com/jansel	2022-12-19 11:06:01 +00:00
XiaobingSuper	ec748cbecd	inductor: separate onednn fx fusion from overriders.py (#90890 ) fix https://github.com/pytorch/pytorch/issues/90851. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90890 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-19 09:32:37 +00:00
mingfeima	4bf22fcfe2	add mixed data type support for GroupNorm (#81852 ) 1. If user uses amp to run bfloat16 models, `torch.autocast` will keep module paramters in acc dtype which will leave `gamma` and`beta` in float while input/output will be in bfloat16. 2. If user explicitly cast the model to bfloat16, the input/output and gamma/beta will all be in bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81852 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-12-19 07:59:40 +00:00
Xia, Weiwen	ea49e769f6	[Quant] Add fused linear-tanh op for onednn backend (#88879 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `linear-tanh` op for `onednn` backend, which will be used for int8 inference with `onednn` backend. Linear-tanh is found in models like CGAN. Cannot call this op with other quantization backends otherwise an error is thrown. Test Plan python test_quantization.py TestQuantizedLinear Pull Request resolved: https://github.com/pytorch/pytorch/pull/88879 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-19 07:55:30 +00:00
Edward Z. Yang	17d860d03e	Type torch._inductor.graph (#90987 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90987 Approved by: https://github.com/albanD, https://github.com/jansel	2022-12-19 07:50:46 +00:00
Aaron Gokaslan	3916d7a575	Apply modernize-use-emplace to aten, c10, torch (#91077 ) Apply clang-tidy check modernize-use-emplace. This is slightly more efficient by using an inplace constructor and is the recommended style in parts of the codebase covered by clang-tidy. This just manually applies the check to rest of the codebase. Pinging @ezyang as this is related to my other PRs he reviewed like #89000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91077 Approved by: https://github.com/ezyang	2022-12-19 07:49:56 +00:00
Edward Z. Yang	944519a468	Switch use_fake_tensor to True by default (#89663 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89663 Approved by: https://github.com/anjali411, https://github.com/Morgan77523	2022-12-19 07:24:06 +00:00
Eddie Yan	ce4900f3bb	[cuDNN][cuDNN V8 API] Fix `benchmark_limit` ignoring failed kernels in FIND (#91032 ) Currently the `torch.backends.cudnn.benchmark_limit` setting ignores the validity/status of proposed cuDNN frontend execution plans because we do not know if they will complete successfully until execution is attempted. However, there are rare cases where the majority of execution plans fail and a fallback plan is needed (e.g., in the case of extremely small pointer alignment on the input tensors). If the limit is too small to include a working fallback plan, we currently bail out prematurely without checking the plans exhaustively. The fix is to defer applying the `benchmark_limit` setting until we are sure that plans will execute successfully, but this requires changes to the cuDNN frontend timing function. This PR adds a hacked version of the cuDNN frontend timing function for now, with the intent that we can switch to the upstream cuDNN frontend implementation once this functionality is added. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91032 Approved by: https://github.com/ngimel	2022-12-19 06:04:44 +00:00
Wang, Eikan	856651dd55	Vectorize exmp1 and log1p (#91074 ) - Fix the UT to capture the operators that have been defined in `CppOverrides` but not in `CppVecOverrides` - Vectorize `log1p` and `expm1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91074 Approved by: https://github.com/jansel	2022-12-19 05:07:39 +00:00
Yanbo Liang	490c1cf650	[Dynamo] Support torch.get_default_dtype (#89790 ) Fixes https://github.com/pytorch/torchdynamo/issues/1930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89790 Approved by: https://github.com/soumith	2022-12-19 04:14:11 +00:00
Michael Lazos	1accd915a4	Re-enable optimizers (#90709 ) Fixes https://github.com/pytorch/pytorch/issues/90165 https://github.com/pytorch/torchdynamo/issues/328 Re-enables optimizer capture + compilation now that the dynamo slowdowns have been fixed and it has speedups, numbers to come soon Pull Request resolved: https://github.com/pytorch/pytorch/pull/90709 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/yanboliang	2022-12-19 04:07:41 +00:00
Xia, Weiwen	9ca41a986c	[Quant][FX] Lower QLinearLeakyReLU for onednn backend (#88668 ) Summary Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode. Test plan python test_quantization.py TestQuantizeFx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88668 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-19 00:44:24 +00:00
Pearu Peterson	8004f934cd	Fix CSR with int32 indices to CSC conversion (#91061 ) Fixes https://github.com/pytorch/pytorch/issues/91007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91061 Approved by: https://github.com/nikitaved	2022-12-18 13:53:25 +00:00
Iris	6be1e43367	[Checkpoint][Test] Add 2d DCP model state checkpoint test (save/load) (#91046 ) Add test to test 2D checkpoint save/load functionality for model state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91046 Approved by: https://github.com/fduwjj	2022-12-18 08:20:33 +00:00
Michael Voznesensky	b72caf311d	Introduce guardexpr, aot autograd guarding of duplicates into torch._guards (#90955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90955 Approved by: https://github.com/ezyang	2022-12-18 03:05:47 +00:00
Edward Z. Yang	212873c615	Add dynamic shapes benchmark accuracy to CI (#90444 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444 Approved by: https://github.com/voznesenskym	2022-12-17 11:17:20 +00:00
Charlie Yan	a1a2f548a9	[Composable API] Enable composable `fully_shard` submodules in `replicate` parent module (#90711 ) To make sure `fully_shard` and `replicate` can work together, we need to check for each other in the implementation. This change adds the check in `replicate()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90711 Approved by: https://github.com/mrshenli	2022-12-17 09:28:38 +00:00
Iris	3229713cf2	[Checkpoint][nit] Fix test_fsdp_optim_state.py test name (#90943 ) Fixing the test name that does not represent the actual test. https://github.com/pytorch/pytorch/issues/90960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90943 Approved by: https://github.com/fduwjj	2022-12-17 08:28:13 +00:00
PyTorch MergeBot	e2377c8300	Revert "Add dynamic shapes benchmark accuracy to CI (#90444 )" This reverts commit 85db031e60d63cfdf5aaf8b30f54e01d56161a78. Reverted https://github.com/pytorch/pytorch/pull/90444 on behalf of https://github.com/ezyang due to lint failing	2022-12-17 07:18:07 +00:00
Edward Z. Yang	85db031e60	Add dynamic shapes benchmark accuracy to CI (#90444 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444 Approved by: https://github.com/voznesenskym	2022-12-17 06:39:45 +00:00
Michael Lazos	7c524221ba	[reland3][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956 ) …king (#87492)" (#90746)" This reverts commit ff1bbc2773a31ab839438966266ed8ee206cb8c5. This should be okay to merge now. The flakiness of HF models will be fixed by seeding the rng (https://github.com/pytorch/pytorch/pull/90936), and the numeric mismatch was root-caused to three decomps (still investigating why those decomps cause this) see https://github.com/pytorch/torchdynamo/issues/1985 for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90956 Approved by: https://github.com/desertfire	2022-12-17 06:27:15 +00:00
PyTorch MergeBot	78efde920e	Revert "[inductor] add conv_transpose2d unary fusion for cpu in inference mode (#90265 )" This reverts commit d6fe9838d19a5dee60410b3b9212bf10a43105a4. Reverted https://github.com/pytorch/pytorch/pull/90265 on behalf of https://github.com/ezyang due to earlier pr on stack got yanked, this one needs to go too	2022-12-17 05:07:59 +00:00
Xia, Weiwen	7b0ec67e34	[Quant][FX] Add backend config for onednn backend and fuse Linear-LeakyReLU (#88665 ) Summary Add backend config for onednn backend so that it can support more post op fusion for int8 inference. First `Linear - LeakyReLU` fusion is implemented based on previous PRs. Test plan python test_quantization.py TestFuseFx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88665 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-17 03:33:08 +00:00
Iris	bfa223aaa6	[Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036 ) This PR: 1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API. 2. Update docstring accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036 Approved by: https://github.com/fegin	2022-12-17 03:02:31 +00:00
Zain Rizvi	1d948787b7	Remove duplicate line (#91006 ) Two [nearly](https://github.com/pytorch/pytorch/pull/90927) [identical](https://github.com/pytorch/pytorch/pull/90948) PRs both got merged without a reported merge conflicts? First time for everything Pull Request resolved: https://github.com/pytorch/pytorch/pull/91006 Approved by: https://github.com/kit1980	2022-12-17 02:20:36 +00:00
Jerry Zhang	f7b384cc46	[reland][quant][pt2e] Add early prototype top level quantize_pt2e APIs (#91035 ) Summary: This PR introduces the top level APIs for quantization support in PyTorch 2.0 Export stack * torch.ao.quantization.quantize_pt2e.prepare_pt2e Takes a model that is captured by the PyTorch 2.0 export (torchdynamo full graph mode) and prepares the model for calibration for post training quantization * torch.ao.quantization.quantize_pt2e.convert_pt2e Takes a calibrated model and converts that to a reference quantized model that can be lowered later to quantized operator libraries or delegation modules Also added a backend config for the qnnpack_pt2e backend: * torch.ao.quantization.backend_config.get_qnnpack_pt2e_backend_config Note: everything related to quantize_pt2e are experimental (prototype), and we don't have any bc guarantees Test Plan: python test/test_quantization.py TestQuantizePT2EModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91035 Approved by: https://github.com/HDCharles	2022-12-17 02:15:53 +00:00
Brian Hirsh	4ab81ae80d	fix default partitioner: save sizes instead of tensor for backward when possible (#91012 ) This should fix hf_Longformer, AllenaiLongformerBase, and tacotron2 with dynamic shapes. Example repro: ``` TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 python benchmarks/dynamo/torchbench.py --accuracy --backend aot_eager --training --only hf_Longformer ``` used to fail with: ``` RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 1024, 12, 513]], which is output 0 of AsStridedBackward0, is at version 6; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). ``` The problem is that: (1) when we have a tensor from the forward, whose sizes are needed the backward, we were saving the actual tensor for backward, and directly grabbing the sizes off of it inside of the backward graph (bad for perf) (2) If that tensor happens to be a graph input that gets mutated, we end up with the above error. Autograd yells at you if you try to save a tensor for backward, and later mutate it. I confirmed that this problem doesn't happen for the min cut partitioner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91012 Approved by: https://github.com/ezyang	2022-12-17 02:06:10 +00:00
Edward Z. Yang	1609b954f8	Save and restore tracked_fakes (#90995 ) This fixes BERT_pytorch and some other models. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90995 Approved by: https://github.com/voznesenskym	2022-12-17 01:36:35 +00:00
Richard Zou	ed589dd8e4	[functorch] add composition-of-3-transform tests for autograd_function (#90962 ) This PR adds the following OpInfo tests: - vmap x vjp x vmap - vjp x vmap x vmap - vjp x vjp x vmap These OpInfo tests only run for the autograd_function_db. In general, testing composition of two transforms is sufficient to convince ourselves that functorch works on a given operator. The autograd.Function testing (especially the upcoming generate_vmap_rule) didn't feel rigorous enough to me, so I added these additional tests to convince myself. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/90962 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-12-17 00:43:44 +00:00
Richard Zou	e1c799ff82	Fix comment about get_fw_grad_mode() only being used in custom Function (#90790 ) Addresses https://github.com/pytorch/pytorch/pull/90240#issuecomment-1349596445 This was the only comment I found after grepping the codebase, but please let me know if I missed others. Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/90790 Approved by: https://github.com/soulitzer	2022-12-17 00:43:44 +00:00
Richard Zou	ffa37c9fca	Add VmapInterpreter.randomness (in pyfunctorch) provide it in info object (#90789 ) This PR: - adds VmapInterpreter.randomness. This returns the randomness option the user provided in vmap(..., randomness=...) - adds randomness in the info object passed to the vmap staticmethod of autograd.Function. This is so that the user can handle random operations on their own terms (if randomness="error", and if the autograd.Function has random operations, then it is the user's responsiblity to raise an error). Test Plan: - updated unittest Pull Request resolved: https://github.com/pytorch/pytorch/pull/90789 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-12-17 00:43:43 +00:00
mikey dagitses	8bd959e462	set -Winconsistent-missing-override for builds (#89851 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/89851). * #89852 * __->__ #89851 set -Winconsistent-missing-override for builds Summary: This has triggered internally on some PyTorch code. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89851 Approved by: https://github.com/malfet	2022-12-17 00:30:06 +00:00
David Berard	93cb580677	lint transformer.py (#91048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91048 Approved by: https://github.com/ZainRizvi, https://github.com/kit1980, https://github.com/ezyang	2022-12-16 23:51:42 +00:00
David Berard	5d70d12812	[dynamo] turn torch.backends.cudnn.is_acceptable into a constant (#90323 ) Tracing `torch.backends.cudnn.is_acceptable(Tensor) -> bool:` fails with: ``` ... File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/functions.py", line 196, in call_function return super(UserFunctionVariable, self).call_function(tx, args, kwargs) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/functions.py", line 67, in call_function return tx.inline_user_function_return( File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 426, in inline_user_function_return result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 1698, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 1752, in inline_call_ tracer.run() File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 485, in run and self.step() File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 455, in step getattr(self, inst.opname)(inst) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 281, in wrapper return inner_fn(self, inst) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 912, in CALL_FUNCTION self.call_function(fn, args, {}) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 389, in call_function self.push(fn.call_function(self, args, kwargs)) File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/torch.py", line 431, in call_function tensor_variable = wrap_fx_proxy( File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/builder.py", line 662, in wrap_fx_proxy return wrap_fx_proxy_cls( File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/builder.py", line 820, in wrap_fx_proxy_cls raise AssertionError( AssertionError: torch.* op returned non-Tensor bool call_function <function is_acceptable at 0x7f00deefb790> ``` So instead, evaluate `is_acceptable()` and convert the result to a constant. The result of `is_acceptable(tensor) -> bool` depends on: * dtype/device of the input tensor (this should already be guarded) * properties of the build & whether cudnn is available * some global state that gets initialized during the first call to `torch.backends.cudnn._init()` (this is NOT guarded in this PR) Note: this fixes tts_angular with FSDP. This was an issue with FSDP because FSDP modules are interpreted as UnspecializedNNModules, and UnspecializedNNModules try to inline calls. In comparison, NNModules (e.g. when the tts_angular model is not wrapped in FSDP) do not inline calls and instead evaluate subsequent calls. In subsequent calls, cudnn.is_acceptable would be skipped by eval_frame.py:catch_errors because it is not in an allowlist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90323 Approved by: https://github.com/jansel	2022-12-16 23:26:54 +00:00
PyTorch MergeBot	7d3f2b7902	Revert "add conv_transpose2d pointwise(unary) fusion kernel (#90264 )" This reverts commit 85698d0ac4686c10ba527f94724de61b4a856027. Reverted https://github.com/pytorch/pytorch/pull/90264 on behalf of https://github.com/osalpekar due to build breakage on feed pytorch build package internally	2022-12-16 23:16:59 +00:00
Howard Huang	7a0f29b776	Allow Process Group to support multiple backends (#88330 ) (#90997 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330 ### Implementation Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type. ### Changes #### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`) - Update pybind definitions for new process group base class and new backend class - Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests - Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class. - Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type - Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched. - Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122. #### python changes (`distributed_c10d.py`, test files) - Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API - `get_backend()` deprecation warning - `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to. - `new_group` updated to return the same as above - Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options - Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group` - Specific tests updated: `test_Backend_enum_class` ### Changes missing - lazy initialization of backends - support parsing of BackendConfig ### open questions - Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338) # Example This is a basic script (using 2 backends within a process group) ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py import torch.distributed as dist import torch import os if __name__ == "__main__": rank = os.environ.get("RANK") # initialize with both gloo and nccl dist.init_process_group() # with gloo dist.all_reduce(torch.tensor([1.0])) print(f"Rank {rank} finished") # with nccl dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}")) ``` Test Plan: Imported from OSS Differential Revision: D42069829 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997 Approved by: https://github.com/awgu, https://github.com/fduwjj	2022-12-16 23:15:00 +00:00
Bin Bao	93ac8c4aeb	[dynamo] Refactor how autocast parameters are binded (#90953 ) Summary: Use `inspect.signature` for unified args handling Test Plan: `test_dynamo` Differential Revision: D42078621 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90953 Approved by: https://github.com/brad-mengchi	2022-12-16 23:12:49 +00:00
Edward Z. Yang	4fa8d774b8	Add macro C10_AS_INTARRAYREF_SLOW (#90675 ) This makes it easier to narrow down who is throwing the error, instead of having to use gdb. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D42088781](https://our.internmc.facebook.com/intern/diff/D42088781)	2022-12-16 15:10:35 -08:00
PyTorch MergeBot	ba7aeac37b	Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022 )" This reverts commit eecd621f06d97d51072d924749a5d54b081295a0. Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025	2022-12-16 23:06:35 +00:00
Sergii Dymchenko	4438b019a8	Fix non-existing parameters in docstrings in torch/ao (#90875 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90875 Approved by: https://github.com/clee2000	2022-12-16 22:34:33 +00:00
Joel Schlosser	ee2475869c	ModuleInfo-based tests for AOTAutograd (#90980 ) Adds a set of generated tests for `AOTAutograd` using the `ModuleInfo` db, analogous to the `OpInfo`-based tests. Includes the following changes: * Adds a `TestEagerFusionModuleInfo` test class, with both symbolic and non-symbolic tests, just like the OpInfo tests. * Test logic "functionalizes" the module under test and calls into the now-factored-out verification logic the OpInfo tests use to compare compiled vs. non-compiled function outputs / grads. * Adds a `decorateForModules(decorator, module_set)` utility to `test/functorch/common_utils.py` to handle xfails, skips, etc. The pre-existing logic is specific to ops, and I didn't want to duplicate all that, so I kept additions minimal with this function. * Bunch of xfails to get everything passing; haven't looked deeply into all these yet. #90500 is relevant for the RNN failures. * Fixes a bug in the `ModuleInfo` entry for `NLLLoss` to ensure sample input has the requested `requires_grad` setting (was causing spurious test failures). Pull Request resolved: https://github.com/pytorch/pytorch/pull/90980 Approved by: https://github.com/ezyang	2022-12-16 21:43:34 +00:00
Joel Schlosser	3226209636	LSTM SymInt-aware changes & meta registration (cuDNN) (#90944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90944 Approved by: https://github.com/ezyang	2022-12-16 21:42:32 +00:00
Michael Gschwind	512ec181ec	Introduce causal mask (#90508 ) Summary: Introduce causal mask This PR introduces a causal mask option _causal_mask (as well as causal mask detection if attn_mask is provided), since current custom kernels do not support arbitrary masks. Test Plan: sandcastle & github ci/cd Differential Revision: D41723137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90508 Approved by: https://github.com/albanD	2022-12-16 21:39:42 +00:00
Natalia Gimelshein	e689c50922	Don't recompute var in bn decomp (#90984 ) Fixes https://github.com/pytorch/torchdynamo/issues/1988 Repeated `var` computation is not CSE'd for some reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90984 Approved by: https://github.com/Chillee	2022-12-16 21:38:49 +00:00
Kshiteej K	e4de6ed6bb	functorch: non-contig samples for test_grad (#90990 ) Ref: https://github.com/pytorch/functorch/issues/1029 Before PR: (Time: ~30s) ``` ================================================= 1052 passed, 264 skipped, 17373 deselected, 9 xfailed in 29.09s ================================================= ``` After PR: (Time: ~43s) ``` ================================================ 1042 passed, 264 skipped, 17373 deselected, 19 xfailed in 43.13s ================================================= ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90990 Approved by: https://github.com/zou3519	2022-12-16 21:27:44 +00:00
Andrew Gu	5ea418bf63	[FSDP][3/N] Move `fsdp_modules(root_only=True)` -> `_get_fsdp_root_states()` (#90862 ) - This PR introduces `_get_fsdp_root_states(state: _FSDPState, module: nn.Module)` to return all states that are FSDP root in the module tree rooted at `module`. - This requires passing in both `state` and `module` because it must call `_lazy_init()` to check for root-ness, which requires that signature. - This PR moves the one internal usage of `FullyShardedDataParallel.fsdp_modules(root_only=True)` to use `_get_fsdp_root_states()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90862 Approved by: https://github.com/rohan-varma	2022-12-16 21:27:27 +00:00
PyTorch MergeBot	67ef88af37	Revert "[Quant] onednn backend switch to ideep new api without affacting performance (#90354 )" This reverts commit 9b89ff0923251d2a30ceccf61120d051a687557c. Reverted https://github.com/pytorch/pytorch/pull/90354 on behalf of https://github.com/osalpekar due to Breaking core pytorch contbuilds internally with function not found errors- more details in D42081737	2022-12-16 21:15:22 +00:00
Brian Hirsh	7a683eaeb8	aot_autograd: add assert for functional-only graph (#88816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88816 Approved by: https://github.com/ezyang, https://github.com/ngimel	2022-12-16 21:04:36 +00:00
Nikita Shulga	c83ff1ea08	[GHA][BE] Update to newer checkout action (#90969 ) This one uses node-16 so it would not spew that many warnings Also, change `build` to `test` in `_binary_test_linux` to fix https://github.com/pytorch/pytorch/issues/83044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90969 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2022-12-16 20:56:29 +00:00
Jacob Szwejbka	bd94ee66ea	[quantized] [executorch] typo (#89960 ) Summary: Inefficient impl in python Test Plan: buck2 test mode/dev //caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quantized_embedding_byte (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' Differential Revision: D41627744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89960 Approved by: https://github.com/jerryzh168	2022-12-16 19:49:09 +00:00
Edward Z. Yang	68805b565a	Include dispatch key in wrapper symbol name (#90674 ) When looking at gdb traces, this makes it easier to tell that you're looking at the CPU wrapper vs CUDA wrapper, etc. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D42088744](https://our.internmc.facebook.com/intern/diff/D42088744) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90674 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-12-16 19:36:32 +00:00
PyTorch MergeBot	6bc6fb21db	Revert "[reland2][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956 )" This reverts commit 8bc38ae4e2037ae42813d552e5d412db77167bc0. Reverted https://github.com/pytorch/pytorch/pull/90956 on behalf of https://github.com/desertfire due to Causing TIMM model failures	2022-12-16 19:28:05 +00:00
Andrew Gu	8cd1808dbf	[FSDP] Introduce "fully sharded module"; remove comm. module (#90933 ) This PR removes the "communication module" (comm. module / `comm_module`) concept from the FSDP code base since it causes disproportionate confusion compared to its benefit for now. Instead, we introduce the term "fully sharded module" as the single concept to unify the wrapper and non-wrapper code paths. The definition is presented in a note at the top of `flat_param.py`. I reproduce it here: --- We define the "fully sharded module" to be the original `nn.Module` that owns a `FlatParamHandle`. It is the single module logically responsible for the single unshard/reshard pair for the handle's `FlatParameter` for a given forward or backward pass. The fully sharded module should be passed to the `FlatParamHandle` constructor. For the wrapper code path: - The `FullyShardedDataParallel` module wrapping the fully sharded module runs the unshard/reshard on behalf of the fully sharded module by overriding `nn.Module.forward`. - The fully sharded module is exactly the module passed to the `FullyShardedDataParallel` constructor's `module` argument and is saved in `_fsdp_wrapped_module`. For the non-wrapper code path: - Hooks registered on the fully sharded module run the unshard/reshard. - The fully sharded module may either be the direct argument to `fully_shard` or a submodule chosen by the provided wrapping policy. --- After this PR, `handle.flat_param._fqns`, `_param_infos`, and `_shared_param_infos` all prefix names from the same module, namely the fully sharded module. This should make state dict less confusing. --- As an example, consider: ``` mod: Module( sub1: Submodule( subsub1: Subsubmodule(), subsub2: Subsubmodule(), ), sub2: Submodule( subsub1: Subsubmodule(), subsub2: Subsubmodule(), ), ) ``` For wrapper FSDP manual wrap: ``` mod.sub1 = FSDP(mod.sub1) mod.sub2 = FSDP(mod.sub2) mod = FSDP(mod) ``` For wrapper FSDP auto wrap: ``` mod = FSDP(mod, auto_wrap_policy=ModuleWrapPolicy({Submodule})) ``` (WIP) For non-wrapper FSDP manual wrap: ``` fully_shard(mod.sub1) fully_shard(mod.sub2) fully_shard(mod) ``` For non-wrapper FSDP auto wrap: ``` fully_shard(mod, policy=ModuleWrapPolicy({Submodule})) ``` The fully sharded module in all cases are `mod`, `mod.sub1`, `mod.sub2`, and notably, `subsub1` and `subsub2`s are not fully sharded modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90933 Approved by: https://github.com/rohan-varma	2022-12-16 18:45:52 +00:00
Joel Schlosser	b0cda0b38c	LSTM SymInt-aware changes & meta registration (non-cuDNN CUDA) (#90701 ) Adds meta registrations for cuDNN and vanilla CUDA ops underneath `lstm()` and makes the logic SymInt-aware. TODO: * cuDNN side does some [nasty stuff](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp#L1567) with buffers; this needs larger redesign to figure out * Indicate that AOT Autograd can be used when an LSTM is present (remove the check for this once it's fully supported) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90701 Approved by: https://github.com/ezyang	2022-12-16 18:08:45 +00:00
Natalia Gimelshein	a10b3ce876	generate device context managers in inductor code (#90934 ) Fixes https://github.com/pytorch/torchdynamo/issues/1717, https://github.com/pytorch/torchdynamo/issues/1990 <s>TODO: add test with multiple devices, figure out extra context initialization</s> Problems: <s>It still initializes context on 0-th device that it shouldn't, I'll take a look where that happens and fix before landing</s> It adds a python device context manages, that is absurdly slow and takes ~2.5 us (should be nanoseconds). That's not a problem for real models, because it'll be called just once, but it is a bit of an inconvenience for microbenchmarking, we should make that context manager more performant (won't fix in this PR) It still can have bugs for graphs that run on multiple devices and can have buffers incorrectly shared between multiple device by memory reuse, if that happens that'll need to be solved separately. Generated code: ``` def call(args): arg0_1, arg1_1 = args args.clear() with torch.cuda.device(1): buf0 = empty_strided((4, ), (1, ), device='cuda', dtype=torch.float32) stream1 = get_cuda_stream(1) triton_fused_div_0.run(arg0_1, arg1_1, buf0, 4, grid=grid(4), stream=stream1) del arg0_1 del arg1_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90934 Approved by: https://github.com/wconstab	2022-12-16 18:03:39 +00:00
Eli Uriegas	9d8fa78d2c	tools: Update clang-tidy hash (#91008 ) clang-tidy was updated due to a BE change (https://github.com/pytorch/test-infra/pull/1309), this updates the hash to the latest version through an [automated github action](https://github.com/pytorch/test-infra/actions/runs/3713717677) causing failures since the s3 hash is hardcoded here; To resolve failures like: ([logs](https://github.com/pytorch/pytorch/actions/runs/3714626185/jobs/6298779282#step:5:81)) ``` INFO: Downloaded clang-tidy successfully. WARNING: Found binary hash does not match reference! Found hash: e4a1537ee997aa486a67bcc06d050b1aa6cfb14aa3073c08f19123ac990ab2f7 Reference hash: 4[93](https://github.com/pytorch/pytorch/actions/runs/3714626185/jobs/6298779282#step:5:94)43a448fcb75cd1e0fb9d6b1f6c2ef4b008b6f91d6ff899d4ac6060f5e52a5 Deleting .lintbin/clang-tidy just to be safe. CRITICAL: Downloaded binary clang-tidy failed its hash check CRITICAL: Unable to initialize clang-tidy error: lint initializer for 'CLANGTIDY' failed with non-zero exit code ``` Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91008 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-12-16 17:28:02 +00:00
Pearu Peterson	01e7f46215	Ensure sorted indices from the CSR->BSR conversion (#90918 ) Fixes https://github.com/pytorch/pytorch/issues/90910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90918 Approved by: https://github.com/cpuhrsch	2022-12-16 15:49:48 +00:00
Justin Chu	634555d981	[ONNX] Auto test based on OpInfo (#86182 ) This change introduces a mechanism to test onnx export based on sample inputs registered in OpInfo, similar to how MPS and other components of pytorch are tested. It provides test coverage on ops and dtypes previously unattainable with manually created test models. This is the best way for us to discover gaps in the exporter support, especially for ops with partial existing support. This test is adapted from https://github.com/pytorch/pytorch/blob/master/test/test_mps.py This PR also - Update sqrt to support integer inputs to match pytorch behavior - Add pytest-subtests for unittest subtests support in the new test file I only enabled very few ops: `t`, `ceil` and `sqrt` because otherwise too many things will fail due to (1) unsupported dtypes in the exporter (2) unimplemented dtype support in onnxruntime (3) unexpected input to verification.verify. Subsequent PRs should improve `verification.verify` first for it to accept any legal input to a pytorch model, then incrementally fix the symbolic functions to enable more test cases. Fixes #85363 Design #88118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86182 Approved by: https://github.com/BowenBao	2022-12-16 14:43:41 +00:00
Michael Lazos	8bc38ae4e2	[reland2][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956 ) …king (#87492)" (#90746)" This reverts commit ff1bbc2773a31ab839438966266ed8ee206cb8c5. This should be okay to merge now. The flakiness of HF models will be fixed by seeding the rng (https://github.com/pytorch/pytorch/pull/90936), and the numeric mismatch was root-caused to three decomps (still investigating why those decomps cause this) see https://github.com/pytorch/torchdynamo/issues/1985 for more detail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90956 Approved by: https://github.com/desertfire	2022-12-16 13:33:38 +00:00
Nikita Vedeneev	c2c14f9597	Sparse compressed mm: fix for orthogonal inputs (#90917 ) Fixes https://github.com/pytorch/pytorch/issues/90836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90917 Approved by: https://github.com/cpuhrsch	2022-12-16 13:08:00 +00:00
Nikita Vedeneev	4dd3de23dd	Sparse compressed mm: fix for empty inputs (#90763 ) Fixes [#90693 ](https://github.com/pytorch/pytorch/issues/90693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90763 Approved by: https://github.com/cpuhrsch	2022-12-16 12:33:57 +00:00
Andrew Gu	3e44fcee2f	[FSDP][2/N] Move `fsdp_modules(root_only=False)` -> `_get_fsdp_states()` (#90861 ) This PR migrates all internal usages of `FullyShardedDataParallel.fsdp_modules(root_only=False)` to `_get_fsdp_states()`. This is to unify the code paths for composable and wrapper FSDP. This PR _does not_ change the usages in test files. This is because we should revisit those usages separately as a way to track which functionality for which we have not tested composable FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90861 Approved by: https://github.com/rohan-varma	2022-12-16 12:21:47 +00:00
Andrew Gu	673c25d45a	[FSDP][Easy] Rename `entry` -> `fsdp_module` to be more descriptive (#90864 ) I started refactoring unit tests to use `_get_fsdp_states()` instead of `FullyShardedDataParallel.fsdp_modules()` but realized we should not do that for now. This is just a change I made while doing that. `entry` is not descriptive. Let us explicitly say `fsdp_module`. `for fsdp_module in FSDP.fsdp_modules(module)` is a proper idiom. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90864 Approved by: https://github.com/rohan-varma	2022-12-16 12:16:08 +00:00
Andrew Gu	95ee5fecb1	[FSDP][1/N] Add `_get_fsdp_states()` (#90860 ) - This PR introduces `_get_fsdp_states(module: nn.Module) -> List[_FSDPState]` to prepare for `fully_shard` manual "wrapping". - ~~I place it in `_runtime_utils.py`, not `_common_utils.py`, because in a follow-up PR, I will add `_get_root_fsdp_states()`, which requires `_lazy_init()`. I concluded that it would be preferred to have both of these getters be in the same place than to have them split, even if that means that `_get_fsdp_states()` is in `_runtime_utils.py`.~~ Due to circular import issues, I think I should still put it in `_common_utils.py`. - This PR changes `FullyShardedDataParallel.fsdp_modules()` to be backed by `_get_fsdp_states()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90860 Approved by: https://github.com/rohan-varma	2022-12-16 12:15:42 +00:00
Nikita Karetnikov	06533a2eb7	[Inductor] actually check `replacements` in `AutogradMonkeypatch` (#90901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90901 Approved by: https://github.com/ezyang	2022-12-16 11:54:02 +00:00
ZainRizviWhiteHat	9d79d09b6e	Make it easier to find troubleshooting steps (#90927 ) People's general tendency is to read from top to bottom. Leverage that at the right moment to help them realize that there's a troubleshooting section they can use if they get stuck Pull Request resolved: https://github.com/pytorch/pytorch/pull/90927 Approved by: https://github.com/ZainRizvi	2022-12-16 11:04:46 +00:00
PyTorch MergeBot	ad1b04c4a9	Revert "[reland][quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90971 )" This reverts commit 7dd5e554971411cbb50fc2eb157057c1e8a0de63. Reverted https://github.com/pytorch/pytorch/pull/90971 on behalf of https://github.com/ezyang due to still broke tons of master jobs sorry	2022-12-16 09:29:39 +00:00
Alvaro Gaona	ddf5b68dcb	Nuttall window (#90103 ) Relates #85366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90103 Approved by: https://github.com/lezcano	2022-12-16 09:05:53 +00:00
Michael Voznesensky	53e71fad8f	Add shape_env guards to tracing context (#90876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90876 Approved by: https://github.com/Chillee, https://github.com/ezyang	2022-12-16 09:05:05 +00:00
HDCharles	a01c1ee594	[ao] making _is_activation_post_process private with BC (#90554 ) same function in observer and quantize, consolidated to a single function note: this is a recreation of D40709276 which caused severa breakages due to not maintaining BC for models with cached code with calls to the old function name Differential Revision: [D41793604](https://our.internmc.facebook.com/intern/diff/D41793604/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41793604/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90554 Approved by: https://github.com/jcaip	2022-12-16 08:09:33 +00:00
Xia, Weiwen	6ea93b2295	[Quant] Add fused LinearLeakyReLU module for onednn backend (#88661 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `QLinearLeakyReLU` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown. Test plan python test_quantization.py TestStaticQuantizedModule Pull Request resolved: https://github.com/pytorch/pytorch/pull/88661 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-16 07:28:13 +00:00
Edward Z. Yang	ffd0b15a49	Add support for keep-going label (#90902 ) This makes run_test.py keep going even on failure. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-12-16 06:47:06 +00:00
Nikita Shulga	c6cba1865f	[Docker] Install Trition deps (#90841 ) Triton needs a working gcc, so install one from apt Also, copy `ptxas` and `cuda.h` from conda to `/usr/local/cuda` Add `torchaudio` to the matrix Fix typo in workflow file Fixes https://github.com/pytorch/pytorch/issues/90377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90841 Approved by: https://github.com/ngimel	2022-12-16 06:35:43 +00:00
Jerry Zhang	7dd5e55497	[reland][quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90971 ) Summary: This PR introduces the top level APIs for quantization support in PyTorch 2.0 Export stack * torch.ao.quantization.quantize_pt2e.prepare_pt2e Takes a model that is captured by the PyTorch 2.0 export (torchdynamo full graph mode) and prepares the model for calibration for post training quantization * torch.ao.quantization.quantize_pt2e.convert_pt2e Takes a calibrated model and converts that to a reference quantized model that can be lowered later to quantized operator libraries or delegation modules Also added a backend config for the qnnpack_pt2e backend: * torch.ao.quantization.backend_config.get_qnnpack_pt2e_backend_config Note: everything related to quantize_pt2e are experimental (prototype), and we don't have any bc guarantees Test Plan: python test/test_quantization.py TestQuantizePT2EModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90971 Approved by: https://github.com/HDCharles	2022-12-16 06:24:28 +00:00
Edward Z. Yang	e48c91688b	DebugInterpreter works with symbolic shapes now, plus test (#90913 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90913 Approved by: https://github.com/voznesenskym	2022-12-16 05:22:56 +00:00
Edward Z. Yang	67436f621a	Add utility for binding symbols based on arguments passed to placeholders (#90912 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90912 Approved by: https://github.com/voznesenskym	2022-12-16 05:22:56 +00:00
Edward Z. Yang	bbea58d500	Stop using GraphArgs for shape env guard source tracking (#90911 ) GraphArgs worked fairly well, but it was still missing sources sometimes. Now, we maintain an auxiliary data structure which we MUST populate whenever we fakeify a tensor / allocate a bare SymInt. This should guarantee once and for all that every symbol is available. Should fix swin_base_patch4_window7_224. While I was at it, I moved fakeification utility back to builder as it was only used at once call site. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90911 Approved by: https://github.com/voznesenskym	2022-12-16 05:22:56 +00:00
Edward Z. Yang	eef019c14a	Lint rule to forbid direct use of logging.info/etc APIs (#90907 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90907 Approved by: https://github.com/jansel	2022-12-16 05:13:51 +00:00
PyTorch MergeBot	82a191313e	Revert "Add support for keep-going label (#90902 )" This reverts commit 855f4b7d2470a349a0b61c5d20e3eb21414a5fb5. Reverted https://github.com/pytorch/pytorch/pull/90902 on behalf of https://github.com/huydhn due to This change breaks trunk where, unlike PR, there is no label	2022-12-16 05:07:49 +00:00
Jiawen Liu	2f6ada84b4	[inductor] Remove flag of bmm's dim m and n in shape padding (#90937 ) Summary: There was an OOM issue in two internal models when turning on padding bmm with dim m and n with shape padding optimization, so added a flag to turned on/off for the internal models. The issue was gone now so removing the flag. Differential Revision: D42074557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90937 Approved by: https://github.com/ngimel	2022-12-16 04:29:12 +00:00
Mergen Nachin	5e3bc1975b	Add `any_chain()` in upstream (#90949 ) Summary: I need any chain. Current chain is logical AND. Test Plan: arc lint, follow-up diffs use it. Differential Revision: D42078837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90949 Approved by: https://github.com/angelayi	2022-12-16 04:09:10 +00:00
Edward Z. Yang	855f4b7d24	Add support for keep-going label (#90902 ) This makes run_test.py keep going even on failure. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-12-16 04:03:52 +00:00
Rich Zhu	4372dbb89f	use pytree to allow any input format for cuda graph (#90941 ) Summary: 1. use pytree to allow any input format for make_graphed_callables 2. add allow_unused_input argument for make_graphed_callables Test Plan: buck2 test mode/dev-nosan //caffe2/test:cuda -- --print-passing-details Differential Revision: D42077976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90941 Approved by: https://github.com/ngimel	2022-12-16 03:01:47 +00:00
PyTorch MergeBot	d9d263efb9	Revert "[Quant] Add fused LinearLeakyReLU module for onednn backend (#88661 )" This reverts commit 353c2e7d39c2c4d0c3e1b8c4d7338e19c7b02f57. Reverted https://github.com/pytorch/pytorch/pull/88661 on behalf of https://github.com/Xia-Weiwen due to This is breaking tests. Need to rebase.	2022-12-16 02:58:26 +00:00
Sahan Paliskara	d3e0bcc796	pin multipy (#90942 ) Pins multipy to prevent breakages in the torch CI due to multipy changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90942 Approved by: https://github.com/huydhn	2022-12-16 02:49:39 +00:00
ZainAlt	d8c1872cc3	Make it easier to find troubleshooting steps (#90948 ) People's general tendency is to read from top to bottom. Leverage that at the right moment to help them realize that there's a troubleshooting section they can use if they get stuck Pull Request resolved: https://github.com/pytorch/pytorch/pull/90948 Approved by: https://github.com/soumith, https://github.com/ZainRizvi	2022-12-16 02:13:28 +00:00
mingfeima	9d523616b3	fix segfault for EmbeddingBag on CPU slow path when include_last_offset is true (#90358 ) This PR is to fix the segfault reported at https://github.com/pytorch/pytorch/issues/89677, this is a `double free` issue caused by `invalid read`. The reported issue broke at slow path for `EmbeddingBag` on float32, at [EmbeddingBag.cpp#L451](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L451) Root cause is that `add_indices` has index which exceeds range of `output_data`, for the reported case. The offsets are given as ``` {0, 6, 12, 15, 25, 32, 40, 42, 46, 53, 53} ``` The `indices` has 55 elements and `offsets[-1] != indices.size(0)`. When `include_last_offset` is true, the `output` will be in the shape of {offsets.size(0) - 1, weight.sizes()[1]}, which will be {10, 5}. Originally, `add_indices` will be (i re-arange the 1D tensor by rows, so here 10 rows in total) ``` ### this is 55 elements 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 7 7 7 7 8 8 8 8 8 8 8 10 10 ``` The last row has index of 10 which is out of range of output tensor whose size is [10, 5]. The reason is `make_offset2bag` at [EmbeddingBag.cpp#L66](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L66) would give the following `offset2bag`: ``` ### this is 55 + 1 elements: 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 ``` Notice for index 53, it is added twice. The fix is ignore the last index from `offsets` when `include_last_offset` is true, also this behavior aligns with CUDA, quote from https://github.com/pytorch/pytorch/pull/57208#issuecomment-1021727378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90358 Approved by: https://github.com/ezyang	2022-12-16 02:08:14 +00:00
Xia, Weiwen	353c2e7d39	[Quant] Add fused LinearLeakyReLU module for onednn backend (#88661 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `QLinearLeakyReLU` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown. Test plan python test_quantization.py TestStaticQuantizedModule Pull Request resolved: https://github.com/pytorch/pytorch/pull/88661 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-16 01:54:10 +00:00
PyTorch MergeBot	750576a50a	Revert "Include dispatch key in wrapper symbol name (#90674 )" This reverts commit e87370133cb839a2c934eeafb002dbe8c1190f1a. Reverted https://github.com/pytorch/pytorch/pull/90674 on behalf of https://github.com/osalpekar due to executorch breakage internally, more details in [D42051698](https://www.internalfb.com/diff/D42051698)	2022-12-16 01:05:57 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	f660d62ddc	Make dynamo.export preserve user input/output format (#90884 ) Currently, dynamo flattens the user input so when user reuses the input they use for tracing, exported graph wouldn't work as it would expect flat args. This PR changes this behaviour by explicitly wrapping the dynamo produced graph with correct user input/output format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90884 Approved by: https://github.com/zhxchen17, https://github.com/voznesenskym	2022-12-16 00:57:09 +00:00
PyTorch MergeBot	31b8dc7542	Revert "[JIT] Frozen Graph Linear-BatchNormNd Folding (#86706 )" This reverts commit e585156c59767ff13306a31d8c31ffe7a33439dc. Reverted https://github.com/pytorch/pytorch/pull/86706 on behalf of https://github.com/davidberard98 due to possibly causing internal build failures, will revert and investigate later	2022-12-16 00:49:54 +00:00
Edward Z. Yang	535b0e37dd	Suppress RecursionError in sympy; fix logging (#90904 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90904 Approved by: https://github.com/Chillee	2022-12-16 00:34:24 +00:00
PyTorch MergeBot	140a3139d6	Revert "Add macro C10_AS_INTARRAYREF_SLOW (#90675 )" This reverts commit 8090cb5386dccf4cf341aea585c793dfbb6c6002. Reverted https://github.com/pytorch/pytorch/pull/90675 on behalf of https://github.com/osalpekar due to broke internal acc_tensor implementation in training_platform contbuild. See [D42052101](https://www.internalfb.com/diff/D42052101) for details.	2022-12-16 00:30:50 +00:00
HDCharles	9259933edd	[ao][fx] fixing public v private prepare.py (#88398 ) Summary: made _DO_NOT_OBS_DTYPE_LIST, _add_matched_node_name_to_set, _get_arg_target_is_dynamic_as_input_to_node, _get_arg_target_is_dynamic_as_input_to_node, _get_arg_target_dtype_as_input_to_node, _get_arg_target_dtype_as_output, _get_target_activation_dtype_for_node, _get_standalone_module_configs, _insert_observer, _is_activation_post_process_node, _is_input_arg_dtype_supported_by_backend, _is_observer_in_same_graph, _is_output_dtype_supported_by_backend, _maybe_insert_input_equalization_observers_for_node, _maybe_insert_input_observer_for_arg_or_kwarg, _maybe_insert_input_observers_for_node, _maybe_insert_observers_before_graph_output, _maybe_insert_output_observer_for_node, _maybe_make_input_output_share_observers, _maybe_propagate_dtype_for_node, _qat_swap_modules, _remove_output_observer, _run_prepare_fx_on_standalone_modules, _save_state, _swap_custom_module_to_observed private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015542](https://our.internmc.facebook.com/intern/diff/D41015542) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88398 Approved by: https://github.com/jcaip	2022-12-16 00:30:41 +00:00
Michael Lazos	f3da157ce3	Reset rng in hf before loading a model (#90936 ) Reset the rng in hf before generating input and loading model, this makes the huggingface inputs+weights deterministic depending on the seed of the rng. This matches the behavior of the other test suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90936 Approved by: https://github.com/desertfire	2022-12-16 00:15:27 +00:00
Andrew Gu	d04e3c994f	[FSDP] Fix input grad propagation when using param mixed precision (#90921 ) For parameter mixed precision, we cast the inputs to the low precision parameter dtype. If the input has tensors that require gradient, then we must cast them in place in order for them to receive a gradient. The cast should be tracked by autograd (e.g. with `grad_fn` equal to `ToCopyBackward0`). This removes the `torch.no_grad` context when calling `_apply_to_tensors`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90921 Approved by: https://github.com/mrshenli, https://github.com/rohan-varma	2022-12-15 23:55:19 +00:00
PyTorch MergeBot	9c912c7dd0	Revert "[quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90802 )" This reverts commit a66af1feba90cc64381bec45b0aa20ec778c92c5. Reverted https://github.com/pytorch/pytorch/pull/90802 on behalf of https://github.com/malfet due to somehow broke test_resnet18 (quantization.fx.test_quantize_pt2e.TestQuantizePT2EModels), see `a66af1feba`	2022-12-15 23:28:21 +00:00
Bin Bao	fdc973308b	[inductor] Use --continue_on_fail when installing torchbench (#90922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90922 Approved by: https://github.com/xuzhao9	2022-12-15 22:52:40 +00:00
eqy	57e2090e21	[Dynamo][TIMM][Benchmarks] Fix TIMM `0.8.0dev` breaking the `timm_models.py` script's data config (#90404 ) It seems `0.8.0dev` breaks the current argument passing by expecting a dictionary instead of a namespace after `0dadb4a6e9` CC @desertfire @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/90404 Approved by: https://github.com/ngimel	2022-12-15 22:21:19 +00:00
Edward Z. Yang	e686a442b4	If a torch.* returns non-Tensor, make this unimplemented rather than assert. (#89918 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89918 Approved by: https://github.com/albanD	2022-12-15 21:53:54 +00:00
Jerry Zhang	a66af1feba	[quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90802 ) Summary: This PR introduces the top level APIs for quantization support in PyTorch 2.0 Export stack * torch.ao.quantization.quantize_pt2e.prepare_pt2e Takes a model that is captured by the PyTorch 2.0 export (torchdynamo full graph mode) and prepares the model for calibration for post training quantization * torch.ao.quantization.quantize_pt2e.convert_pt2e Takes a calibrated model and converts that to a reference quantized model that can be lowered later to quantized operator libraries or delegation modules Also added a backend config for the qnnpack_pt2e backend: * torch.ao.quantization.backend_config.get_qnnpack_pt2e_backend_config Note: everything related to quantize_pt2e are experimental (prototype), and we don't have any bc guarantees Test Plan: python test/test_quantization.py TestQuantizePT2EModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90802 Approved by: https://github.com/qihqi	2022-12-15 21:50:29 +00:00
Joel Schlosser	201c36d81a	Hack get_nbytes() to return 0 for sparse tensors as workaround for functionalization (#90702 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90702 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2022-12-15 19:59:30 +00:00
Michael Gschwind	15c9df7756	Error messages for kernel selection (#90783 ) Summary: Error messages fro kernel selection Test Plan: sandcastle & github Differential Revision: D42008661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90783 Approved by: https://github.com/cpuhrsch	2022-12-15 18:41:12 +00:00
HDCharles	173accd1c1	[ao][fx] fixing public v private qconfig_mapping_utils.py (#88399 ) Summary: made _check_is_valid_config_dict, _compare_prepare_convert_qconfig_mappings, _generate_node_name_to_qconfig, _is_qconfig_supported_by_dtype_configs, _maybe_adjust_qconfig_for_module_name_object_type_order, _update_qconfig_for_fusion private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015544](https://our.internmc.facebook.com/intern/diff/D41015544) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88399 Approved by: https://github.com/jcaip	2022-12-15 17:48:34 +00:00
Richard Zou	abc54f9314	Revert "Revert "[functorch] Refactor life handle storage (#90317 )"" (#90856 ) Adds the fix for -Wsign-compare. See original PR (https://github.com/pytorch/pytorch/pull/90317) for commit message Pull Request resolved: https://github.com/pytorch/pytorch/pull/90856 Approved by: https://github.com/samdow	2022-12-15 16:03:16 +00:00
Peter Bell	81f351acd7	[inductor] Prevent blowup in inner_fn_str and extract_read_writes (#88933 ) Currently the default `ops` handler expects strings as arguments and just formats them into a function call template string. For complex expressions, this can lead to exponential growth in terms. Say for example you have: ```python def fn(a): for _ in range(3) a = ops.mul(a, a) return a ``` You might expect `inner_fn_str` to contain 1 load and 3 multiplies, but instead you find 8 loads and 7 multiplies: ```python load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) ``` This type of blowup is present in the lowering for `max_pool2d_with_indices_backward` which in #pytorch/torchdynamo#1352 was reported to have caused the entire compilation to hang. This PR fixes the issue by formatting the string as a series of assignments to variables, so for the example above, we now get: ``` tmp0 = load(arg_0, i0) tmp1 = tmp0 * tmp0 tmp2 = tmp1 * tmp1 tmp3 = tmp2 * tmp2 return tmp3 ``` Which corresponds to sequence of `ops` calls made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88933 Approved by: https://github.com/jansel	2022-12-15 15:36:52 +00:00
Andrew Gu	c4718e9b09	[FSDP] Enable mixed hybrid/non-hybrid sharding strategies (#90846 ) In the context of hybrid sharding strategies, we only need to enforce the same process groups among the instances using a hybrid sharding strategy, not all instances. We can even mix and match the two different hybrid sharding strategies. This PR relaxes the validation to support this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90846 Approved by: https://github.com/rohan-varma	2022-12-15 15:36:23 +00:00
Andrew Gu	2f8c0cb2a4	[FSDP][Easy] Use `run_subtests` for hybrid shard test (#90859 ) This PR uses `self.run_subtests` which exactly contains the `self.subTest` and `dist.barrier()` boilerplate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90859 Approved by: https://github.com/rohan-varma	2022-12-15 15:32:00 +00:00
Rohan Varma	b92975a6f3	replicate state_dict tests (#90868 ) Simple tests for replicate() state_dict. Ensuring composition with FSDP works will come as a follow up. Differential Revision: [D42048131](https://our.internmc.facebook.com/intern/diff/D42048131/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90868 Approved by: https://github.com/awgu	2022-12-15 14:53:24 +00:00
chunyuan	d6fe9838d1	[inductor] add conv_transpose2d unary fusion for cpu in inference mode (#90265 ) An FX transformation is added to fuse ConvTranspose2d with eltwise OPs in torchinductor for CPU in inference mode, following the implementation in https://github.com/pytorch/pytorch/pull/87063. The fusion OP is implemented in https://github.com/pytorch/pytorch/pull/90264 and will be treated as an extern kernel call in torchinductor. The fusion of ConvTranspose2d with the below OPs is supported: - relu - sigmoid - tanh - hardswish - leaky_relu - hardtanh - gelu Pull Request resolved: https://github.com/pytorch/pytorch/pull/90265 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-15 14:22:04 +00:00
chunyuan	85698d0ac4	add conv_transpose2d pointwise(unary) fusion kernel (#90264 ) This PR adds `torch.ops.mkldnn._convolution_transpose_pointwise` that supports ConvTranspose fusion with the below unary pointwise OPs: - relu - sigmoid - tanh - hardswish - leaky_relu - hardtanh - gelu Pull Request resolved: https://github.com/pytorch/pytorch/pull/90264 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-15 14:16:58 +00:00
Xia, Weiwen	9b89ff0923	[Quant] onednn backend switch to ideep new api without affacting performance (#90354 ) Summary Onednn quantization backend switch to new API in `third_party/ideep`. - `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly. - Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated. - Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`. - For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch. - Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc. It won't impact correctness. Performance should be unaffected or slightly better. FBGEMM and QNNPACK backends are not affected. Performance results are given below. 1. End-to-end performance of static quantized models (from torchvision) (throughput: fps, higher is better) ![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png) 2. Op benchmark of dynamic quantized linear (Latency: ms, lower is better) ![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png) Test method & env: - Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz - Run multi-instances on a single node. Use one core for each instance. - Use Jemalloc and Intel OpenMP Test plan python test/test_quantization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/90354 Approved by: https://github.com/jgong5	2022-12-15 12:48:45 +00:00
Xiao Wang	79009cbc53	[CUDA 12] Fix the endif guard position for cusparse const descriptors (#90897 ) [CUDA 12] Fix the endif guard position for cusparse const descriptors Related https://github.com/pytorch/pytorch/pull/90765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90897 Approved by: https://github.com/IvanYashchuk	2022-12-15 11:28:54 +00:00
Charlie Yan	98799ca0f4	[Composable API] `replicate`: cleanup _ddp.py (#90257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90257 Approved by: https://github.com/mrshenli	2022-12-15 08:48:57 +00:00
Edward Z. Yang	0b22f5ae9f	Deeply rework WeakIdKeyDictionary (#90825 ) In the prior patch, I just YOLOed a mutable mapping implementation. Many edge cases were not handled correctly. In this PR, I just copy paste the WeakKeyDictionary from CPython and the hacked it up to use WeakIdRef instead of weakref.ref. You can see each line I changed with the comment CHANGED; there aren't many. Being exactly API compatible with WeakKeyDictionary means I can also rob all of the tests from CPython, which I also did for test/test_weak.py How to review? You could either try taking the delta from CPython (recommended), or review everything from scratch (not recommended). Can post diff representing delta on request. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90825 Approved by: https://github.com/albanD	2022-12-15 08:43:08 +00:00
Edward Z. Yang	54563e6288	Don't put tracing state on Tensor (#90628 ) Fixes https://github.com/pytorch/pytorch/issues/89626 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90628 Approved by: https://github.com/voznesenskym	2022-12-15 08:43:08 +00:00
chunyuan	103029e035	inductor: sort the reads buf by name (#89744 ) Sort `read_writes.reads` by name to make sure the same graph is generated for a fixed model. Otherwise, the buffer reuse may be different since the order of `read_writes.reads` is random. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89744 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-15 08:42:49 +00:00
cyy	9fe050f39c	fix cudnn RNN reproducibility problem (#90522 ) Fixes #74177 Since RNN code use static variables to cache state, we store an atomic_flag in RNG generator to notify new seed changes and generate new random state for RNN. The additional cost is that the it must check the atomic_flag each time to ensure reproducibility. This may be ugly but it is the best way currently without large code refactoring Pull Request resolved: https://github.com/pytorch/pytorch/pull/90522 Approved by: https://github.com/ngimel	2022-12-15 08:21:37 +00:00
cyy	dcfe7ff7e2	fix a memory leak on return without free (#90372 ) This issue is found by static analysis. The allocated object by new may be leaked on early return. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90372 Approved by: https://github.com/swolchok	2022-12-15 07:07:48 +00:00
Sergii Dymchenko	0ac0af02d5	Reland Fix issue 38095 TODO in test_multiprocessing.py (#90741 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Reland of https://github.com/pytorch/pytorch/pull/90335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90741 Approved by: https://github.com/clee2000	2022-12-15 05:32:27 +00:00
Huy Do	4e6455163f	Fix unittest rerun logic when checking for skipped tests (#90888 ) I made an important mistake here when thinking `not result.skipped` mean that the current test wasn't skipped. Similar to `result.failures` or `result.errors`, `result.skipped` is that it's a list including all the skipped messages so far in the test suite (https://docs.python.org/3/library/unittest.html#unittest.TestResult). As such, the correct way to check if the current test was skipped is to compare `skipped_before` and `len(result.skipped)` after running the test in the same way as failures and errors are handled. If they are the same, the test isn't skipped. ### Testing `python test/run_test.py -i test_autograd --verbose` to confirm that the disabled test `test_profiler_seq_nr` is run 50 times always in rerun mode Pull Request resolved: https://github.com/pytorch/pytorch/pull/90888 Approved by: https://github.com/clee2000	2022-12-15 05:13:59 +00:00
chunyuan	2ba5c1d7c4	Inductor cpp wrapper: change inputs args from tuple to vector (#90754 ) ## Pitch Change input args type from `std::tuple` to `std::vector` to reduce the compilation time. ## Description `std::tie()` takes quite a long time during the compilation when the input args number grows. For example, for a graph from the `PegasusForConditionalGeneration` model with 318 input args, the compilation of `std::tie` for the args is about 10s. By changing to std::vector, the compilation time of arg assignment is reduced to less than 1s. ### Code before: ```cpp at::Tensor call_0(std::tuple<at::Tensor&, at::Tensor&> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; ... return buf0; } ``` ### Code after: ```cpp at::Tensor call_0(std::vector<at::Tensor> args) { at::Tensor arg0_1, arg1_1; arg0_1 = args[0]; arg1_1 = args[1]; ... return buf0; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90754 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-15 05:07:16 +00:00
Andrew Gu	39d9dd135a	[FSDP][Easy] ufmt files (#90858 ) ``` ufmt format torch/distributed/fsdp ufmt format test/distributed/fsdp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90858 Approved by: https://github.com/rohan-varma	2022-12-15 04:15:26 +00:00
Xiao Wang	670efb974a	[CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs (#90427 ) Fixes https://github.com/pytorch/pytorch/issues/89836 This PR changes the CUDA kernels of grid_sample 2d and 3d, forward, to use accumulate type to improve accuracy on half precision inputs. Also, the backward error on grad with half input is in the order of 1e-4, unlike 1e2 in forward process. The backward kernels are thus unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90427 Approved by: https://github.com/ngimel	2022-12-15 03:41:35 +00:00
Eddie Yan	eecd621f06	[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022 ) Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833 One new failure that seems to be surfaced with V8 on appears in halonext + amp ``` RuntimeError: Internal Triton PTX codegen error: Segmentation fault (core dumped) ``` But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck Current dynamo benchmarks on A100: v7 vs. v8 \|dev \|name \|batch_size\|abs_latency_v7\|abs_latency_v8\| \|----\|-------------------------------\|----------\|--------------\|--------------\| \|cuda\|adv_inception_v3 \|128 \|166.0240 \|165.5798 \| \|cuda\|beit_base_patch16_224 \|64 \|123.5912 \|123.0797 \| \|cuda\|botnet26t_256 \|128 \|107.7343 \|107.5948 \| \|cuda\|cait_m36_384 \|4 \|184.5038 \|184.0271 \| \|cuda\|coat_lite_mini \|128 \|142.3061 \|140.5814 \| \|cuda\|convit_base \|64 \|165.2499 \|161.0743 \| \|cuda\|convmixer_768_32 \|32 \|325.6984 \|325.7094 \| \|cuda\|convnext_base \|64 \|237.4632 \|238.0142 \| \|cuda\|crossvit_9_240 \|128 \|72.2980 \|72.4367 \| \|cuda\|cspdarknet53 \|64 \|96.6862 \|96.8308 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|117.6045 \|117.9616 \| \|cuda\|dla102 \|128 \|182.3073 \|182.2304 \| \|cuda\|dm_nfnet_f0 \|128 \|133.6011 \|133.6298 \| \|cuda\|dpn107 \|32 \|148.5080 \|148.5885 \| \|cuda\|eca_botnext26ts_256 \|128 \|113.8676 \|113.1514 \| \|cuda\|eca_halonext26ts \|128 \|119.2242 \|119.1845 \| \|cuda\|ese_vovnet19b_dw \|128 \|80.0217 \|79.9438 \| \|cuda\|fbnetc_100 \|128 \|91.4548 \|91.4009 \| \|cuda\|fbnetv3_b \|128 \|115.4496 \|115.5058 \| \|cuda\|gernet_l \|128 \|114.8365 \|114.7870 \| \|cuda\|ghostnet_100 \|128 \|58.5766 \|58.5766 \| \|cuda\|gluon_inception_v3 \|128 \|165.5222 \|165.7167 \| \|cuda\|gluon_xception65 \|32 \|165.8779 \|165.7818 \| \|cuda\|gmixer_24_224 \|128 \|116.3611 \|113.4925 \| \|cuda\|gmlp_s16_224 \|128 \|121.2607 \|121.2534 \| \|cuda\|hrnet_w18 \|128 \|246.5706 \|246.7599 \| \|cuda\|inception_v3 \|128 \|166.1096 \|166.2034 \| \|cuda\|jx_nest_base \|32 \|93.6064 \|93.4088 \| \|cuda\|lcnet_050 \|128 \|21.4156 \|21.4207 \| \|cuda\|levit_128 \|128 \|27.2901 \|27.2543 \| \|cuda\|mixer_b16_224 \|128 \|157.8992 \|158.2878 \| \|cuda\|mixnet_l \|128 \|197.3443 \|197.2125 \| \|cuda\|mnasnet_100 \|128 \|71.4604 \|71.2997 \| \|cuda\|mobilenetv2_100 \|128 \|67.6080 \|67.7515 \| \|cuda\|mobilenetv3_large_100 \|128 \|57.7224 \|57.6591 \| \|cuda\|mobilevit_s \|64 \|93.0372 \|93.0530 \| \|cuda\|nfnet_l0 \|128 \|113.1664 \|113.2853 \| \|cuda\|pit_b_224 \|64 \|133.3333 \|133.4153 \| \|cuda\|pnasnet5large \|16 \|238.9545 \|238.8122 \| \|cuda\|poolformer_m36 \|64 \|144.2353 \|144.2375 \| \|cuda\|regnety_002 \|128 \|32.8534 \|32.9069 \| \|cuda\|repvgg_a2 \|128 \|102.4150 \|102.3827 \| \|cuda\|res2net101_26w_4s \|64 \|120.8127 \|120.8322 \| \|cuda\|res2net50_14w_8s \|128 \|149.7052 \|149.8969 \| \|cuda\|res2next50 \|128 \|153.7439 \|153.8215 \| \|cuda\|resmlp_12_224 \|128 \|89.1918 \|86.9226 \| \|cuda\|resnest101e \|64 \|159.4706 \|159.3133 \| \|cuda\|rexnet_100 \|128 \|88.0032 \|88.0397 \| \|cuda\|sebotnet33ts_256 \|64 \|80.4635 \|80.0120 \| \|cuda\|selecsls42b \|128 \|70.4430 \|70.3663 \| \|cuda\|spnasnet_100 \|128 \|78.0537 \|78.1991 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|212.9073 \|213.0824 \| \|cuda\|swsl_resnext101_32x16d \|32 \|193.0229 \|193.0404 \| \|cuda\|tf_efficientnet_b0 \|128 \|97.1316 \|97.0410 \| \|cuda\|tf_mixnet_l \|128 \|203.4956 \|203.5340 \| \|cuda\|tinynet_a \|128 \|82.4038 \|82.8733 \| \|cuda\|tnt_s_patch16_224 \|128 \|284.8576 \|284.8867 \| \|cuda\|twins_pcpvt_base \|64 \|118.3893 \|119.2329 \| \|cuda\|visformer_small \|128 \|126.0533 \|126.0390 \| \|cuda\|vit_base_patch16_224 \|64 \|118.2873 \|118.0573 \| \|cuda\|volo_d1_224 \|64 \|108.7764 \|108.2063 \| \|cuda\|xcit_large_24_p8_224 \|5 \|100.4656 \|100.5209 \| v7 vs. v8 amp \|dev \|name \|batch_size\|abs_latency_v7\|abs_latency_v8\| \|----\|-------------------------------\|----------\|--------------\|--------------\| \|cuda\|adv_inception_v3 \|128 \|104.9729 \|105.1237 \| \|cuda\|beit_base_patch16_224 \|64 \|75.4330 \|75.2039 \| \|cuda\|botnet26t_256 \|128 \|74.5149 \|74.8071 \| \|cuda\|cait_m36_384 \|4 \|110.9788 \|111.5170 \| \|cuda\|coat_lite_mini \|128 \|62.3618 \|64.4965 \| \|cuda\|convit_base \|64 \|116.4054 \|117.9129 \| \|cuda\|convmixer_768_32 \|32 \|264.4401 \|264.4491 \| \|cuda\|convnext_base \|64 \|182.9009 \|179.2136 \| \|cuda\|crossvit_9_240 \|128 \|48.8586 \|48.8359 \| \|cuda\|cspdarknet53 \|64 \|80.0245 \|80.0160 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|66.5921 \|66.7448 \| \|cuda\|dla102 \|128 \|116.7780 \|117.1683 \| \|cuda\|dm_nfnet_f0 \|128 \|78.9322 \|79.1135 \| \|cuda\|dpn107 \|32 \|85.5206 \|85.7514 \| \|cuda\|eca_botnext26ts_256 \|128 \|76.3672 \|77.0050 \| \|cuda\|eca_halonext26ts \|128 \|86.2458 \| \| \|cuda\|ese_vovnet19b_dw \|128 \|43.2943 \|43.3379 \| \|cuda\|fbnetc_100 \|128 \|54.8479 \|54.9251 \| \|cuda\|fbnetv3_b \|128 \|70.7504 \|71.0188 \| \|cuda\|gernet_l \|128 \|66.1607 \|66.0379 \| \|cuda\|ghostnet_100 \|128 \|43.8882 \|43.9336 \| \|cuda\|gluon_inception_v3 \|128 \|104.9297 \|105.0204 \| \|cuda\|gluon_xception65 \|32 \|85.7118 \|85.8370 \| \|cuda\|gmixer_24_224 \|128 \|75.1214 \|76.1170 \| \|cuda\|gmlp_s16_224 \|128 \|76.4207 \|76.6641 \| \|cuda\|hrnet_w18 \|128 \|186.1326 \|186.2435 \| \|cuda\|inception_v3 \|128 \|105.0561 \|105.0783 \| \|cuda\|jx_nest_base \|32 \|65.3066 \|65.3245 \| \|cuda\|lcnet_050 \|128 \|14.7991 \|14.8687 \| \|cuda\|levit_128 \|128 \|19.2893 \|19.4772 \| \|cuda\|mixer_b16_224 \|128 \|93.9826 \|94.2056 \| \|cuda\|mixnet_l \|128 \|147.1245 \|147.0435 \| \|cuda\|mnasnet_100 \|128 \|39.1781 \|39.2565 \| \|cuda\|mobilenetv2_100 \|128 \|42.3704 \|42.3114 \| \|cuda\|mobilenetv3_large_100 \|128 \|37.2946 \|37.2816 \| \|cuda\|mobilevit_s \|64 \|55.8930 \|55.8934 \| \|cuda\|nfnet_l0 \|128 \|64.0448 \|64.4438 \| \|cuda\|pit_b_224 \|64 \|80.6342 \|80.2933 \| \|cuda\|pnasnet5large \|16 \|154.9611 \|154.8654 \| \|cuda\|poolformer_m36 \|64 \|101.7489 \|101.8138 \| \|cuda\|regnety_002 \|128 \|27.0939 \|27.0309 \| \|cuda\|repvgg_a2 \|128 \|60.9651 \|61.2533 \| \|cuda\|res2net101_26w_4s \|64 \|77.3291 \|77.4739 \| \|cuda\|res2net50_14w_8s \|128 \|93.6572 \|93.7221 \| \|cuda\|res2next50 \|128 \|112.4975 \|112.3248 \| \|cuda\|resmlp_12_224 \|128 \|59.5422 \|60.7644 \| \|cuda\|resnest101e \|64 \|97.9894 \|98.3358 \| \|cuda\|rexnet_100 \|128 \|55.2218 \|55.0718 \| \|cuda\|sebotnet33ts_256 \|64 \|60.4880 \|60.8113 \| \|cuda\|selecsls42b \|128 \|41.4294 \|41.5341 \| \|cuda\|spnasnet_100 \|128 \|45.0037 \|45.0304 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|98.2561 \|98.6925 \| \|cuda\|swsl_resnext101_32x16d \|32 \|100.6179 \|100.9195 \| \|cuda\|tf_efficientnet_b0 \|128 \|56.5344 \|56.4591 \| \|cuda\|tf_mixnet_l \|128 \|153.0318 \|152.9367 \| \|cuda\|tinynet_a \|128 \|54.1307 \|53.9298 \| \|cuda\|tnt_s_patch16_224 \|128 \|142.4801 \|142.6589 \| \|cuda\|twins_pcpvt_base \|64 \|67.9027 \|67.8325 \| \|cuda\|visformer_small \|128 \|72.5589 \|72.9427 \| \|cuda\|vit_base_patch16_224 \|64 \|71.4885 \|71.7342 \| \|cuda\|volo_d1_224 \|64 \|69.3539 \|69.5910 \| \|cuda\|xcit_large_24_p8_224 \|5 \|59.9000 \|59.9699 \| v7 vs. v8 float16 \|dev \|name \|batch_size\|abs_latency\|abs_latency\| \|----\|-------------------------------\|----------\|-----------\|-----------\| \|cuda\|adv_inception_v3 \|128 \|104.2544 \|104.2677 \| \|cuda\|beit_base_patch16_224 \|64 \|85.3601 \|85.3786 \| \|cuda\|botnet26t_256 \|128 \|72.1476 \|71.8277 \| \|cuda\|cait_m36_384 \|4 \|108.3075 \|108.5941 \| \|cuda\|coat_lite_mini \|128 \|61.2382 \|61.6049 \| \|cuda\|convmixer_768_32 \|32 \|263.3818 \|263.3598 \| \|cuda\|convnext_base \|64 \|172.6821 \|173.8520 \| \|cuda\|crossvit_9_240 \|128 \|44.6321 \|44.6340 \| \|cuda\|cspdarknet53 \|64 \|79.3165 \|79.2964 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|61.9816 \|62.2109 \| \|cuda\|dla102 \|128 \|115.7403 \|115.9928 \| \|cuda\|dm_nfnet_f0 \|128 \|77.5434 \|77.7440 \| \|cuda\|dpn107 \|32 \|83.6489 \|83.5605 \| \|cuda\|eca_botnext26ts_256 \|128 \|73.9953 \|74.1031 \| \|cuda\|eca_halonext26ts \|128 \|81.7951 \|81.7103 \| \|cuda\|ese_vovnet19b_dw \|128 \|42.9618 \|42.8853 \| \|cuda\|fbnetc_100 \|128 \|54.3590 \|54.3575 \| \|cuda\|fbnetv3_b \|128 \|69.7977 \|70.1696 \| \|cuda\|gernet_l \|128 \|64.8684 \|65.1726 \| \|cuda\|ghostnet_100 \|128 \|43.2054 \|43.1319 \| \|cuda\|gluon_inception_v3 \|128 \|104.1988 \|104.3030 \| \|cuda\|gluon_xception65 \|32 \|84.2245 \|84.5085 \| \|cuda\|gmixer_24_224 \|128 \|82.0418 \|82.7252 \| \|cuda\|gmlp_s16_224 \|128 \|75.4792 \|75.8374 \| \|cuda\|hrnet_w18 \|128 \|184.1450 \|184.1848 \| \|cuda\|inception_v3 \|128 \|104.1203 \|104.2536 \| \|cuda\|jx_nest_base \|32 \|58.2386 \|58.4901 \| \|cuda\|lcnet_050 \|128 \|14.6409 \|14.5616 \| \|cuda\|levit_128 \|128 \|22.3875 \|22.4680 \| \|cuda\|mixer_b16_224 \|128 \|98.9534 \|98.4730 \| \|cuda\|mixnet_l \|128 \|146.1623 \|146.1947 \| \|cuda\|mnasnet_100 \|128 \|38.9208 \|39.3463 \| \|cuda\|mobilenetv2_100 \|128 \|41.8946 \|41.9847 \| \|cuda\|mobilenetv3_large_100 \|128 \|36.7810 \|36.8264 \| \|cuda\|mobilevit_s \|64 \|55.3211 \|55.3186 \| \|cuda\|nfnet_l0 \|128 \|63.1302 \|63.5544 \| \|cuda\|pit_b_224 \|64 \|73.8752 \|73.4602 \| \|cuda\|pnasnet5large \|16 \|151.6806 \|151.6111 \| \|cuda\|poolformer_m36 \|64 \|86.8341 \|86.8021 \| \|cuda\|regnety_002 \|128 \|26.6798 \|26.5295 \| \|cuda\|repvgg_a2 \|128 \|61.6652 \|62.1482 \| \|cuda\|res2net101_26w_4s \|64 \|75.8037 \|75.7739 \| \|cuda\|res2net50_14w_8s \|128 \|92.6362 \|92.4338 \| \|cuda\|res2next50 \|128 \|111.5371 \|111.5832 \| \|cuda\|resmlp_12_224 \|128 \|58.2349 \|57.9807 \| \|cuda\|resnest101e \|64 \|96.1114 \|96.2742 \| \|cuda\|rexnet_100 \|128 \|54.8138 \|54.7643 \| \|cuda\|sebotnet33ts_256 \|64 \|53.1524 \|53.3823 \| \|cuda\|selecsls42b \|128 \|40.6070 \|40.7104 \| \|cuda\|spnasnet_100 \|128 \|44.5732 \|44.4318 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|98.6447 \|98.8445 \| \|cuda\|swsl_resnext101_32x16d \|32 \|97.0195 \|97.2968 \| \|cuda\|tf_efficientnet_b0 \|128 \|56.0640 \|56.0278 \| \|cuda\|tf_mixnet_l \|128 \|152.0958 \|152.0874 \| \|cuda\|tinynet_a \|128 \|53.3694 \|53.3762 \| \|cuda\|tnt_s_patch16_224 \|128 \|130.2981 \|130.3726 \| \|cuda\|twins_pcpvt_base \|64 \|62.5459 \|62.6416 \| \|cuda\|visformer_small \|128 \|68.8502 \|69.1756 \| \|cuda\|vit_base_patch16_224 \|64 \|65.8587 \|66.0285 \| \|cuda\|volo_d1_224 \|64 \|64.5348 \|64.6057 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022 Approved by: https://github.com/ngimel	2022-12-15 03:24:44 +00:00
HDCharles	6a866c3ed1	[ao] fixing public v private for torch.ao.nn.X (#87883 ) Summary: this mostly consisted of adding __all__ to files without them. A few functions in X.utils were made private too Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40814548](https://our.internmc.facebook.com/intern/diff/D40814548) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87883 Approved by: https://github.com/jcaip, https://github.com/anjali411	2022-12-15 03:03:07 +00:00
Edward Z. Yang	edc5bb5fbe	Only populate real_value_cache during export (#90468 ) Fixes https://github.com/pytorch/torchdynamo/issues/1950 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90468 Approved by: https://github.com/voznesenskym	2022-12-15 02:28:21 +00:00
HDCharles	f286cbebce	[ao][fx] fixing public v private graph_module.py (#88395 ) Summary: made _is_observed_module, _is_observed_standalone_module private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015545](https://our.internmc.facebook.com/intern/diff/D41015545) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88395 Approved by: https://github.com/jcaip	2022-12-15 02:15:04 +00:00
Edward Z. Yang	283cf718ed	Fix _fix_weakref memory leak (#90823 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90823 Approved by: https://github.com/eellison, https://github.com/albanD	2022-12-15 01:07:29 +00:00
vasiliy	d19791e4cd	add autocast keys to pybind11 DispatchKey object (#90821 ) Summary: This is useful for debugging what autocast is doing when it's running on top of torchdynamo, without this the Python dispatch key for autocast prints as `???`. Test Plan: ``` import torch dir(torch._C.DispatchKey) // the autocast keys show up now ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90821 Approved by: https://github.com/ezyang	2022-12-15 00:15:07 +00:00
William Wen	86269852de	Serialize dynamo/inductor config for minifier (#90501 ) Fixes https://github.com/pytorch/torchdynamo/issues/1965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90501 Approved by: https://github.com/mlazos	2022-12-14 23:44:06 +00:00
min-jean-cho	e585156c59	[JIT] Frozen Graph Linear-BatchNormNd Folding (#86706 ) This PR adds linear-batchnormNd folding for JIT frozen graphs. Performance benchmark A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine. FP32, JIT without linear-bn folding ![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png) with linear-bn folding ![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86706 Approved by: https://github.com/davidberard98	2022-12-14 23:24:50 +00:00
HDCharles	1ca9d43d4e	[ao] quantize.py fixing public v private (#87521 ) Summary: made _register_activation_post_process_hook, _add_observer, _get_unique_devices_, _get_observer_dict private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709277](https://our.internmc.facebook.com/intern/diff/D40709277) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87521 Approved by: https://github.com/jerryzh168	2022-12-14 22:50:39 +00:00
andrewor14	691a44f403	[Quant][fx][bc-breaking] Add simpler BackendConfig pattern format (#90698 ) Summary: The existing BackendConfig fusion pattern uses a "reversed nested tuple" format that is highly unintuitive. For example, ``` linear-relu -> (nn.ReLU, nn.Linear) conv-bn-relu -> (nn.ReLU, (nn.BatchNorm2d, nn.Conv2d)) ``` This pattern format also complicates the signatures of the user specified "fuser methods", which needed to accept arguments in reverse nested order to match the patterns: ``` def fuse_linear_relu(is_qat, relu, linear): ... def fuse_conv_bn_relu(is_qat, relu, bn_conv): (bn, conv) = bn_conv ... ``` Instead, this commit introduces a new pattern format that simply specifies the ops in forward order with no nesting: ``` linear-relu -> (nn.Linear, nn.ReLU) conv-bn-relu -> (nn.Conv2d, nn.BatchNorm2d, nn.ReLU) def fuse_linear_relu(is_qat, linear, relu): ... def fuse_conv_bn_relu(is_qat, conv, bn, relu): ... ``` Note that the legacy "reversed nested tuple" is still used internally since it is more general. In the future, we should replace it with the format used in the subgraph rewriter in `torch.fx`, and simplify the existing pattern matching code to handle the new format added in this commit. BC-breaking Notes: Before: ``` import torch as nn import torch.ao.nn.intrinsic as nni from torch.ao.quantization.backend_config import BackendPatternConfig def fuse_linear_relu(is_qat, relu, bn_conv): (bn, conv) = bn_conv return nni.ConvBnReLU2d(conv, bn, relu) config = BackendPatternConfig((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) \ .set_dtype_configs(...) \ .set_fuser_method(fuse_conv_bn_relu) \ .set_fused_module(nni.ConvBnReLU2d) ``` After: ``` def fuse_linear_relu(is_qat, conv, bn, relu): return nni.ConvBnReLU2d(conv, bn, relu) config = BackendPatternConfig((nn.Conv2d, nn.BatchNorm2d, nn.ReLU)) \ .set_dtype_configs(...) \ .set_fuser_method(fuse_conv_bn_relu) \ .set_fused_module(nni.ConvBnReLU2d) ``` OR (for backward-compatibility) ``` def fuse_linear_relu(is_qat, relu, bn_conv): (bn, conv) = bn_conv return nni.ConvBnReLU2d(conv, bn, relu) config = BackendPatternConfig() \ ._set_pattern_complex_format((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) \ .set_dtype_configs(...) \ .set_fuser_method(fuse_conv_bn_relu) \ .set_fused_module(nni.ConvBnReLU2d) \ ._set_use_legacy_pattern_format(True) ``` Before: ``` backend_config.configs # returns Dict[Pattern, BackendPatternConfig] ``` After: ``` backend_config.configs # returns List[BackendPatternConfig] ``` Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestBackendConfig Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Differential Revision: [D41954553](https://our.internmc.facebook.com/intern/diff/D41954553) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90698 Approved by: https://github.com/vkuzo, https://github.com/jerryzh168	2022-12-14 22:44:29 +00:00
Nikita Shulga	1e347b737b	Run MPS PR tests on both Ventura and Monterey (#89312 ) Add `runs-on` input parameter to _mac-test-mps.yml and run `ciflow/mps` on both Monterey and Ventura machines Pull Request resolved: https://github.com/pytorch/pytorch/pull/89312 Approved by: https://github.com/huydhn	2022-12-14 22:05:33 +00:00
erjia	7a112c43c1	[DataLoader2] Fix apply_sharding to accept one sharding_filter per branch (#90769 ) Changes: - Allow multiple `sharding_filter` in the pipeline as long as they are not on the same branch - [x] Add test Example: ```mermaid graph TD; DP1-->sharding_filter_1; sharding_filter_1-->DP3; DP2-->sharding_filter_2; sharding_filter_2-->DP4; DP3-->DP4; DP4-->output; ``` In order to properly shard `DP1` and `DP2`, we should allow multiple `sharding_filter`s Pull Request resolved: https://github.com/pytorch/pytorch/pull/90769 Approved by: https://github.com/NivekT	2022-12-14 22:03:41 +00:00
Andrew Gu	1ba4e3c711	[FSDP][BE] Remove `_module_to_handles`, `HandleConfig`; use term "fqn"; clarify docs (#90840 ) This PR - Removes `_module_to_handles` since it is no longer used. We instead use `_comm_module_to_handles`. - Removes `HandleConfig` and stores its fields directly as attributes on `FlatParamHandle`. - Uses the term `fqn`/`fqns` uniformly in `flat_param.py` instead of `prefixed_param_name` / `prefixed_param_names`. - Clarifies some documentation. I am including all of these BE items in the same PR to save CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90840 Approved by: https://github.com/rohan-varma	2022-12-14 21:37:37 +00:00
Edward Z. Yang	8090cb5386	Add macro C10_AS_INTARRAYREF_SLOW (#90675 ) This makes it easier to narrow down who is throwing the error, instead of having to use gdb. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90675 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/JackCaoG	2022-12-14 21:29:23 +00:00
Kshiteej K	cdf4a80cc1	replace skipIf with xfailif (#90368 ) Replace skips with xfails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90368 Approved by: https://github.com/zou3519	2022-12-14 20:35:58 +00:00
Nikita Shulga	fb18c29486	[BE] Tweak Meta copyright headers (#90805 ) s/Facebook, Inc./Meta Platforms, Inc/ s/Confidential and proprietary./This source code is licensed under the BSD-style license/ Per https://www.internalfb.com/intern/wiki/Open_Source/Licenses/Straight_BSD/ Also, add linter that prevents adding those in the future Fixes https://github.com/pytorch/pytorch/issues/90187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90805 Approved by: https://github.com/zpao	2022-12-14 20:30:31 +00:00
Larry Liu	f3393b7ea7	[torchgen] Introduce Executorch types and signatures (#90781 ) Retry of #90591, which is a retry of #89595. Reverted due to dependency PR breaking internal fbcode. ## Forked BaseCppType Created a module for Executorch: `torchgen.executorch`. ## In `torchgen.executorch.api.types.types`: * Define `BaseCppType` with `torch::executor` namespace. ## In `torchgen.executorch.api.et_cpp`: * Help generate `NamedCType` for `ExecutorchCppSignature` arguments. ## In `torchgen.executorch.api.types.signatures`: * Define the signature using these types. (`ExecutorchCppSignature`) ## In `torchgen.executorch.api.types.__init__`: * Suppress flake8 error for `import *`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90781 Approved by: https://github.com/ezyang	2022-12-14 20:13:04 +00:00
Larry Liu	4adffe6d51	[torchgen] Let native function declaration generation logic take a callable (#90780 ) Retry of #90590, which is a retry of #89594. Original PR reverted due to internal breakage. This PR fixes the breakage by adding a default value to the new argument. This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90780 Approved by: https://github.com/ezyang	2022-12-14 20:13:04 +00:00
Dean Yu	df58020bb6	Align max_pool1d Error Checking between CPU and CUDA/CPU requires_grad (#90211 ) Fixes https://github.com/pytorch/pytorch/issues/85712 Standardizes error checking for max_pool1d between CPU and CPU requires_grad/CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90211 Approved by: https://github.com/mruberry	2022-12-14 20:12:09 +00:00
Nikita Shulga	3859aace20	[MPS] Skip tests broken on Ventura (#90843 ) Also add `torch.backends.mps.is_macos13_or_newer` See https://github.com/pytorch/pytorch/issues/85758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90843 Approved by: https://github.com/kulinseth, https://github.com/albanD	2022-12-14 19:51:00 +00:00
Nicolas Hug	8a21cac3c3	Improve interpolate() speed for channels_last CPU videos (#90302 ) This is the exact same PR as https://github.com/pytorch/pytorch/pull/86361, but on Videos (3D) instead of images (2D). For torchvision training use-cases (num_threads=1), the speed-ups range in 1X-2X. When num_threads>1 the speed-ups are a lot higher, up to ~30X Benchmarks details: <details > ``` main branch=c6942dbbfbf836450898aa9a0c08aefe437d0765 input shape output size mode dtype num_threads speed-up main PR (1, 3, 8, 256, 256) -> (16, 320, 320) linear float32 num_threads=1 1.0X 54.7ms vs 55.7ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest float32 num_threads=1 1.7X 40.5ms vs 24.4ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest uint8 num_threads=1 1.4X 33.1ms vs 23.7ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact float32 num_threads=1 2.0X 47.5ms vs 24.3ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact uint8 num_threads=1 1.7X 39.9ms vs 23.7ms (1, 3, 8, 256, 256) -> (16, 320, 320) linear float32 num_threads=2 2.2X 54.6ms vs 25.1ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest float32 num_threads=2 2.3X 21.2ms vs 9.3ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest uint8 num_threads=2 1.4X 16.5ms vs 12.0ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact float32 num_threads=2 2.6X 24.3ms vs 9.3ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact uint8 num_threads=2 1.7X 19.9ms vs 12.0ms (1, 3, 8, 256, 256) -> (16, 320, 320) linear float32 num_threads=12 10X 54.3ms vs 5.4ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest float32 num_threads=12 2.5X 4.1ms vs 1.6ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest uint8 num_threads=12 1.4X 2.9ms vs 2.1ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact float32 num_threads=12 1.7X 4.8ms vs 2.8ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact uint8 num_threads=12 1.7X 3.5ms vs 2.1ms (1, 3, 8, 256, 256) -> (16, 320, 320) linear float32 num_threads=32 20X 54.2ms vs 2.7ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest float32 num_threads=32 1.5X 2.2ms vs 1.5ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest uint8 num_threads=32 1.6X 1.3ms vs 0.8ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact float32 num_threads=32 1.3X 1.8ms vs 1.4ms (1, 3, 8, 256, 256) -> (16, 320, 320) nearest-exact uint8 num_threads=32 1.7X 1.3ms vs 0.8ms (1, 3, 16, 320, 320) -> (8, 256, 256) linear float32 num_threads=1 1.0X 15.4ms vs 16.0ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest float32 num_threads=1 2.0X 12.3ms vs 6.0ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest uint8 num_threads=1 1.6X 12.0ms vs 7.7ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact float32 num_threads=1 2.2X 13.1ms vs 6.0ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact uint8 num_threads=1 1.7X 12.8ms vs 7.6ms (1, 3, 16, 320, 320) -> (8, 256, 256) linear float32 num_threads=2 1.9X 15.5ms vs 8.2ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest float32 num_threads=2 2.0X 6.1ms vs 3.1ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest uint8 num_threads=2 1.5X 6.0ms vs 3.9ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact float32 num_threads=2 2.2X 6.6ms vs 3.0ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact uint8 num_threads=2 1.7X 6.5ms vs 3.9ms (1, 3, 16, 320, 320) -> (8, 256, 256) linear float32 num_threads=12 11X 15.5ms vs 1.4ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest float32 num_threads=12 2.0X 1.1ms vs 0.5ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest uint8 num_threads=12 1.6X 1.1ms vs 0.7ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact float32 num_threads=12 2.1X 1.2ms vs 0.5ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact uint8 num_threads=12 1.5X 1.1ms vs 0.8ms (1, 3, 16, 320, 320) -> (8, 256, 256) linear float32 num_threads=32 15X 15.4ms vs 1.0ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest float32 num_threads=32 1.7X 0.7ms vs 0.4ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest uint8 num_threads=32 1.3X 0.7ms vs 0.5ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact float32 num_threads=32 3X 0.7ms vs 0.2ms (1, 3, 16, 320, 320) -> (8, 256, 256) nearest-exact uint8 num_threads=32 2.6X 0.7ms vs 0.3ms (1, 3, 16, 320, 320) -> (32, 512, 512) linear float32 num_threads=1 1.0X 295.6ms vs 304.3ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest float32 num_threads=1 1.5X 223.2ms vs 144.3ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest uint8 num_threads=1 1.5X 177.7ms vs 121.0ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact float32 num_threads=1 1.8X 258.6ms vs 145.3ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact uint8 num_threads=1 1.6X 203.9ms vs 128.6ms (1, 3, 16, 320, 320) -> (32, 512, 512) linear float32 num_threads=2 1.8X 295.4ms vs 160.4ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest float32 num_threads=2 1.5X 119.0ms vs 80.2ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest uint8 num_threads=2 1.4X 84.8ms vs 60.6ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact float32 num_threads=2 1.7X 136.1ms vs 80.1ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact uint8 num_threads=2 1.7X 102.2ms vs 60.5ms (1, 3, 16, 320, 320) -> (32, 512, 512) linear float32 num_threads=12 9X 295.3ms vs 32.3ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest float32 num_threads=12 1.4X 25.2ms vs 18.7ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest uint8 num_threads=12 1.4X 16.5ms vs 11.9ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact float32 num_threads=12 1.5X 28.1ms vs 18.8ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact uint8 num_threads=12 1.7X 19.4ms vs 11.5ms (1, 3, 16, 320, 320) -> (32, 512, 512) linear float32 num_threads=32 18X 294.7ms vs 16.2ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest float32 num_threads=32 1.2X 14.4ms vs 12.5ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest uint8 num_threads=32 1.2X 5.9ms vs 4.8ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact float32 num_threads=32 1.2X 14.5ms vs 12.5ms (1, 3, 16, 320, 320) -> (32, 512, 512) nearest-exact uint8 num_threads=32 1.4X 6.9ms vs 4.8ms (1, 3, 32, 512, 512) -> (16, 320, 320) linear float32 num_threads=1 0.9X 48.6ms vs 55.1ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest float32 num_threads=1 2.0X 38.8ms vs 19.2ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest uint8 num_threads=1 1.6X 37.6ms vs 23.8ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact float32 num_threads=1 2.1X 41.2ms vs 19.2ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact uint8 num_threads=1 1.7X 39.9ms vs 23.8ms (1, 3, 32, 512, 512) -> (16, 320, 320) linear float32 num_threads=2 1.9X 48.8ms vs 25.3ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest float32 num_threads=2 2.0X 19.2ms vs 9.5ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest uint8 num_threads=2 1.6X 18.8ms vs 12.0ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact float32 num_threads=2 2.2X 20.5ms vs 9.5ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact uint8 num_threads=2 1.7X 20.0ms vs 12.0ms (1, 3, 32, 512, 512) -> (16, 320, 320) linear float32 num_threads=12 11X 48.6ms vs 4.6ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest float32 num_threads=12 2.0X 3.4ms vs 1.7ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest uint8 num_threads=12 1.6X 3.3ms vs 2.1ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact float32 num_threads=12 2.1X 3.6ms vs 1.7ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact uint8 num_threads=12 1.7X 3.5ms vs 2.1ms (1, 3, 32, 512, 512) -> (16, 320, 320) linear float32 num_threads=32 27X 48.3ms vs 1.8ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest float32 num_threads=32 1.1X 2.2ms vs 2.0ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest uint8 num_threads=32 2.6X 2.1ms vs 0.8ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact float32 num_threads=32 2.4X 2.3ms vs 0.9ms (1, 3, 32, 512, 512) -> (16, 320, 320) nearest-exact uint8 num_threads=32 2.6X 2.2ms vs 0.8ms ``` </details> Code: <details> ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((16, 320, 320), (8, 256, 256)), ((16, 320, 320), (32, 512, 512)), ) attrs = [] for (DHW1, DHW2) in sizes: attrs.append([(1, 3, DHW1), DHW2]) attrs.append([(1, 3, DHW2), DHW1]) config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` ```py import re import argparse parser = argparse.ArgumentParser() parser.add_argument("f3", nargs="?", default="main") parser.add_argument("f2", nargs="?", default="new") args = parser.parse_args() with open(args.f1) as f: main = f.readlines() with open(args.f2) as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): # num_threads=1 # TODO: remove if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("$.?$", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 5 == 0: print() # if i % 10 == 0 and i % 40 != 0: # print() # if i % 40 == 0: # print("-" 100) print(l) ``` </details > Pull Request resolved: https://github.com/pytorch/pytorch/pull/90302 Approved by: https://github.com/vfdev-5, https://github.com/fmassa	2022-12-14 19:21:02 +00:00
PyTorch MergeBot	0cd69d7cda	Revert "[functorch] Refactor life handle storage (#90317 )" This reverts commit 4d494986af5201a0c487a9b7f3c68cfa6c4e28d0. Reverted https://github.com/pytorch/pytorch/pull/90317 on behalf of https://github.com/osalpekar due to Causing contbuilds to fail when pytorch is built with -Wsign-compare internally - details in [D42019543](https://www.internalfb.com/diff/D42019543)	2022-12-14 19:08:33 +00:00
Brian Hirsh	3c637e8007	fix aot autograd for None fw inputs (#89975 ) hot fix: Confirmed this fixes an internal model that had None as one if its inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/89975 Approved by: https://github.com/aazzolini	2022-12-14 18:44:08 +00:00
William Wen	e9dc8cc19b	Add torch.compile support to minifier (#90308 ) Initial fix for https://github.com/pytorch/torchdynamo/issues/1964. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90308 Approved by: https://github.com/mlazos	2022-12-14 18:24:42 +00:00
chunyuan	fde5646f3d	Inductor cpp wrapper: support bmm, mm, addmm extern call (#88667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88667 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-14 18:19:27 +00:00
Driss Guessous	51c6c5e156	[SDPA] Standardizes the return shape for dense tensor of SDPA regardless of fused kernel called (#90776 ) # Summary Continues to fix up the meta output story of SDPA to be more correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/90776 Approved by: https://github.com/cpuhrsch	2022-12-14 18:08:02 +00:00
Zain Rizvi	caa05e6f87	Give linting steps a unique prefix (#90705 ) Give a unique prefix to all steps in lint.yml which catch valid linter errors. This will let retrybot identify lint.yml steps which should not be retried. This is a prelude to https://github.com/pytorch/test-infra/pull/1275 which extends the retry-on-failure behavior to all PRs in addition to trunk. This hadn't been an issue previously since we would always only linter failures on `master`, where linter failures were always safe to retry since legitimate linter failures there are virtually non-existent Pull Request resolved: https://github.com/pytorch/pytorch/pull/90705 Approved by: https://github.com/huydhn, https://github.com/malfet	2022-12-14 17:38:14 +00:00
Richard Zou	f21cb7d77e	[pyfunctorch] Generate a more meaningful name for _SingleLevelAutogradFunction (#90418 ) The API to do this is not pretty, but at least it works. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/90418 Approved by: https://github.com/soulitzer	2022-12-14 16:20:57 +00:00
Richard Zou	da42eab48b	Fix circular import in torch/autograd/function.py (#90415 ) It turns out it is possible to break cycles by not directly importing a module: - there's a problem that torch.jit imports torch._ops and torch._ops import torch.jit - there's another problem that torch.autograd.function imports custom_function_call but torch._functorch.autograd_function imports torch.autograd.function The "better" way to handle all of this is to do some large refactoring so that torch._functorch.autograd_function imports some file that has _SingleLevelAutogradFunction and then have torch.autograd.function depend on torch.functorch.autograd_function... (and ditto for torch.jit vs torch._ops), but I'm scared to move code around too much for BC reasons and the fix in this PR works well. Test Plan: - import torch Pull Request resolved: https://github.com/pytorch/pytorch/pull/90415 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-12-14 16:20:57 +00:00
Richard Zou	4809e838c1	functorch.jvp support for autograd.Function (#90077 ) This PR adds functorch.jvp support for autograd.Function. It does so by adding a jvp rule for custom_function_call. For a regular PyTorch operation (like at::sin), the VariableType kernel: - re-dispatches to at::sin - calls the jvp rule for at::sin The jvp rule for custom_function_call does just that. It constructs a new autograd.Function (because the above logic already exists). Inside the forward, it re-dispatches to custom_function_call. In the jvp rule, it just calls whatever the jvp rule is supposed to be. Since this logic is really close to the custom_function_call_grad, I just put them together. Test Plan: - added jvp rules to the autograd.Function in autograd_function_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/90077 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-12-14 16:20:53 +00:00
Bin Bao	dcb73aa291	Run inductor benchmark test for every PR (#90773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90773 Approved by: https://github.com/huydhn	2022-12-14 14:43:14 +00:00
chunyuan	cc4131a815	Inductor cpp wrapper: support more dtypes of input (#88666 ) Previously only float32 is supported as input types for the cpp wrapper. This PR extends the cpp wrapper to support the built-in types: float32, float64, int64, int32, int16, int8, uint8, bool. Bfloat16 and Float16 will be covered later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88666 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-14 14:30:13 +00:00
Peter Bell	ba77afbce1	Move _test_inductor_realize into python (#90517 ) Addresses https://github.com/pytorch/pytorch/pull/90014/files#r1043625932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90517 Approved by: https://github.com/ngimel	2022-12-14 12:40:00 +00:00
chunyuan	d35aa2f65a	Inductor cpp wrapper: support Reduction (#88561 ) For reductions, the code string in the codegen stage and the execution stage are different due to `\`. - The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage: ``` #pragma omp declare reduction(argmax : struct IndexValue_1 :\ omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\ omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\ initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ``` - The code string loaded during the execution (`\` will be escaped): ``` #pragma omp declare reduction(argmax : struct IndexValue_1 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ``` Thus we can't get the same hash value for these two pieces of code. This PR adds a function to make the transformation escape the backslash in the codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-12-14 12:29:50 +00:00
Edward Z. Yang	7963dbf3db	symbolic-shapes: -anjali411, +jbschlosser (#90816 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90816 Approved by: https://github.com/SherlockNoMad	2022-12-14 10:14:46 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	1aab755320	Fakify params and weights under private config (#90417 ) Previously, we planned to lift the parameters and weights while exporting and implement our own transformer to "unlift" the lifted weights and params back to the graph as attributes. But this is bit challenging because: - We need to maintain correct ordering for weights and parameters that are passed as inputs so that we know how to map them back. - Some weights are unused in the graph, so our transformer needs to be aware of which weights and parameters are not used in the graph. And we need to distinguish which are real user input and which are parameters. - There can be more edge cases we haven't seen in other models yet. I am aware that @Chillee and @bdhirsh mentioned that functionalization won't work with fake-tensor attributes but this is fine for the short term as we don't expect users to be modifying weights and params in inference mode. In fact, we explicitly disable attribute mutation in torchdynamo export mode right now. Given above condition, it might be ok to just fakify params when we need. I use a flag to guard against this change. Differential Revision: [D41891201](https://our.internmc.facebook.com/intern/diff/D41891201) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90417 Approved by: https://github.com/eellison	2022-12-14 09:33:18 +00:00
Nikita Vedeneev	3870a9e28d	to_sparse_XXX: backward support (#90281 ) As per title. Fixes https://github.com/pytorch/pytorch/issues/85226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90281 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer	2022-12-14 09:05:17 +00:00
vfdev-5	708108a9d3	Optimized vertical flip using memcpy (#89414 ) ## Description - Use memcpy for vertical flip - Added bool type support for horizontal flip - channels last input with horizontal flip goes also into cpu_vflip_memcpy and has a speed-up Previous PRs: - https://github.com/pytorch/pytorch/pull/90013 - https://github.com/pytorch/pytorch/pull/88989 ## Results ### Horizontal flip - AVX2 (only cases with speed-up or same perfs for channels last input) ``` [------------------------------------------------------------------------- Horizontal flip -------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0a0+gitb0bd5c4) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 204.813 (+-1.018) \| \| 308.070 (+-1.573) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 844.523 (+-2.302) \| \| 1226.801 (+-5.069) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 2246.512 (+-8.935) \| \| 2689.692 (+-22.654) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 21.024 (+-0.083) \| 44.196 (+-0.131) \| 22.564 (+-0.066) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 71.806 (+-0.150) \| 166.653 (+-0.789) \| 72.660 (+-0.160) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 129.354 (+-0.385) \| 306.998 (+-0.819) \| 130.094 (+-0.274) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 177.250 (+-0.485) \| 44.232 (+-0.465) \| 289.201 (+-2.837) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 699.055 (+-1.940) \| 166.540 (+-0.903) \| 1172.747 (+-3.645) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 1302.968 (+-5.390) \| 307.210 (+-0.852) \| 2149.396 (+-23.570) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 11.943 (+-0.079) \| \| 12.451 (+-0.033) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 39.830 (+-0.093) \| \| 40.583 (+-0.070) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 69.001 (+-0.078) \| \| 69.590 (+-0.162) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 177.378 (+-0.507) \| \| 283.461 (+-2.957) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 698.915 (+-1.840) \| \| 1061.208 (+-10.449) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 1299.365 (+-3.919) \| \| 1957.424 (+-13.149) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 17.955 (+-0.077) \| \| 89.456 (+-0.285) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 56.901 (+-0.081) \| \| 339.802 (+-0.879) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 103.629 (+-0.256) \| \| 627.845 (+-1.185) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 21.179 (+-0.077) \| 44.146 (+-0.260) \| 22.957 (+-0.138) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 71.685 (+-0.155) \| 166.666 (+-0.730) \| 72.606 (+-0.124) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 129.168 (+-0.288) \| 307.094 (+-1.571) \| 130.156 (+-0.453) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 33.049 (+-0.089) \| \| 33.056 (+-0.477) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 116.635 (+-0.299) \| \| 113.433 (+-0.891) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 212.134 (+-0.413) \| \| 204.394 (+-0.822) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 207.214 (+-0.586) \| \| 302.370 (+-0.670) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 846.553 (+-2.301) \| \| 1223.851 (+-5.280) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 2251.687 (+-6.513) \| \| 2711.557 (+-14.011) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 33.237 (+-0.072) \| \| 33.101 (+-0.070) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 113.605 (+-0.337) \| \| 117.067 (+-0.547) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 204.632 (+-0.487) \| \| 212.590 (+-0.848) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 7.950 (+-0.030) \| \| 37.757 (+-0.080) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 23.799 (+-0.080) \| \| 136.571 (+-0.441) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 37.970 (+-0.075) \| \| 246.894 (+-0.926) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 8.009 (+-0.077) \| \| 37.800 (+-0.100) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 23.861 (+-0.099) \| \| 136.553 (+-0.519) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 38.211 (+-0.104) \| \| 246.939 (+-0.692) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100405-pr_vs_nightly-md) - AVX512 (only cases with speed-up or same perfs for channels last input) ``` [---------------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0.dev20221208+cu116) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 194.708 (+-9.566) \| \| 372.067 (+-12.430) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 765.151 (+-10.098) \| \| 1524.231 (+-111.283) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 1587.229 (+-88.117) \| \| 2950.081 (+-92.322) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 13.328 (+-0.375) \| 49.693 (+-1.193) \| 10.323 (+-0.333) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 90.580 (+-0.812) \| 191.936 (+-4.369) \| 92.269 (+-0.980) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 163.821 (+-3.174) \| 352.053 (+-10.909) \| 165.661 (+-4.436) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 206.862 (+-4.417) \| 49.336 (+-1.492) \| 287.373 (+-7.266) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 829.736 (+-15.857) \| 191.489 (+-5.645) \| 1166.126 (+-45.667) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 1540.953 (+-28.269) \| 352.171 (+-8.784) \| 2171.570 (+-82.740) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 7.856 (+-0.131) \| \| 7.943 (+-0.148) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 34.750 (+-1.195) \| \| 36.309 (+-0.716) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 85.858 (+-0.729) \| \| 87.306 (+-0.981) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 206.896 (+-5.716) \| \| 262.551 (+-6.598) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 828.212 (+-13.441) \| \| 1077.916 (+-28.810) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 1542.748 (+-31.379) \| \| 2003.661 (+-71.614) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 11.038 (+-0.271) \| \| 126.867 (+-5.590) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 90.190 (+-1.185) \| \| 501.446 (+-13.498) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 165.797 (+-3.016) \| \| 921.131 (+-20.500) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 13.516 (+-0.578) \| 49.678 (+-1.966) \| 10.360 (+-0.256) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 91.195 (+-0.830) \| 191.778 (+-4.742) \| 91.117 (+-0.855) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 168.551 (+-3.352) \| 351.585 (+-8.230) \| 164.199 (+-3.725) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 35.832 (+-0.840) \| \| 35.087 (+-0.972) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 133.624 (+-5.293) \| \| 131.423 (+-6.002) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 240.702 (+-5.213) \| \| 236.876 (+-7.867) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 192.351 (+-6.740) \| \| 313.999 (+-12.141) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 766.553 (+-16.669) \| \| 1270.797 (+-49.828) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 1501.700 (+-69.499) \| \| 2427.303 (+-126.694) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 35.386 (+-0.801) \| \| 34.539 (+-0.844) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 132.369 (+-4.107) \| \| 130.926 (+-3.597) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 237.722 (+-6.680) \| \| 237.072 (+-5.027) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 6.796 (+-0.132) \| \| 44.727 (+-0.905) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 24.827 (+-0.669) \| \| 166.758 (+-5.141) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 42.392 (+-0.980) \| \| 310.830 (+-6.130) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 8.114 (+-0.141) \| \| 44.776 (+-0.707) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 24.787 (+-0.787) \| \| 167.766 (+-5.004) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 42.545 (+-0.636) \| \| 313.715 (+-7.603) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105633-pr_vs_nightly-avx512-md) ### Vertical flip - AVX2 (all tested cases showing speed-up or same perfs) ``` [-------------------------------------------------------------------------- Vertical flip --------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0a0+gitb0bd5c4) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 93.125 (+-3.022) \| \| 101.064 (+-0.436) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 412.942 (+-57.066) \| \| 461.463 (+-2.098) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 1533.265 (+-4.071) \| \| 1829.713 (+-14.311) channels=3, size=256, dtype=torch.int64, mf=channels_first \| 101.134 (+-0.924) \| \| 102.858 (+-0.319) channels=3, size=520, dtype=torch.int64, mf=channels_first \| 421.679 (+-1.101) \| \| 477.413 (+-1.809) channels=3, size=712, dtype=torch.int64, mf=channels_first \| 1550.418 (+-3.647) \| \| 1877.143 (+-6.622) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 20.961 (+-0.063) \| 19.515 (+-0.302) \| 21.980 (+-0.070) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 71.199 (+-0.173) \| 70.199 (+-0.332) \| 95.262 (+-0.109) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 128.532 (+-0.318) \| 127.325 (+-0.328) \| 167.190 (+-0.370) channels=1, size=256, dtype=torch.int32, mf=channels_first \| 21.206 (+-0.059) \| 19.471 (+-0.128) \| 21.469 (+-0.064) channels=1, size=520, dtype=torch.int32, mf=channels_first \| 71.284 (+-0.163) \| 70.124 (+-0.388) \| 94.988 (+-0.239) channels=1, size=712, dtype=torch.int32, mf=channels_first \| 129.017 (+-0.286) \| 128.088 (+-0.461) \| 167.115 (+-1.075) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 16.909 (+-0.057) \| 19.570 (+-0.353) \| 17.981 (+-0.072) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 55.163 (+-0.138) \| 70.218 (+-0.275) \| 107.938 (+-0.620) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 98.518 (+-0.121) \| 127.737 (+-0.486) \| 170.965 (+-0.436) channels=3, size=256, dtype=torch.uint8, mf=channels_first \| 18.150 (+-0.084) \| 19.758 (+-0.221) \| 18.122 (+-0.088) channels=3, size=520, dtype=torch.uint8, mf=channels_first \| 56.693 (+-0.200) \| 70.278 (+-0.386) \| 89.018 (+-0.206) channels=3, size=712, dtype=torch.uint8, mf=channels_first \| 100.409 (+-0.235) \| 127.772 (+-0.457) \| 168.072 (+-0.436) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 12.817 (+-0.041) \| \| 12.818 (+-0.049) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 38.359 (+-0.081) \| \| 63.378 (+-0.165) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 68.246 (+-0.090) \| \| 116.637 (+-0.583) channels=1, size=256, dtype=torch.int16, mf=channels_first \| 12.899 (+-0.054) \| \| 12.649 (+-0.060) channels=1, size=520, dtype=torch.int16, mf=channels_first \| 38.404 (+-0.069) \| \| 63.448 (+-0.108) channels=1, size=712, dtype=torch.int16, mf=channels_first \| 68.378 (+-0.104) \| \| 116.415 (+-0.332) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 17.071 (+-0.044) \| \| 17.792 (+-0.050) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 55.163 (+-0.100) \| \| 108.539 (+-0.466) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 98.537 (+-0.091) \| \| 171.675 (+-0.553) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 17.837 (+-0.071) \| \| 18.355 (+-0.067) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 56.051 (+-0.087) \| \| 88.261 (+-0.129) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 100.603 (+-0.245) \| \| 169.067 (+-0.430) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 21.204 (+-0.063) \| 19.607 (+-0.140) \| 22.202 (+-0.094) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 71.356 (+-0.211) \| 69.844 (+-0.343) \| 94.614 (+-0.167) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 129.087 (+-0.290) \| 127.065 (+-0.319) \| 166.513 (+-0.444) channels=1, size=256, dtype=torch.float32, mf=channels_first \| 21.196 (+-0.065) \| 19.156 (+-0.132) \| 21.516 (+-0.073) channels=1, size=520, dtype=torch.float32, mf=channels_first \| 71.422 (+-0.180) \| 70.296 (+-0.136) \| 94.913 (+-0.095) channels=1, size=712, dtype=torch.float32, mf=channels_first \| 129.045 (+-0.312) \| 128.023 (+-0.585) \| 166.089 (+-0.409) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 12.770 (+-0.045) \| \| 34.853 (+-0.089) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 38.363 (+-0.064) \| \| 131.969 (+-0.577) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 67.954 (+-0.107) \| \| 239.507 (+-0.835) channels=1, size=256, dtype=torch.float16, mf=channels_first \| 12.855 (+-0.067) \| \| 35.124 (+-0.109) channels=1, size=520, dtype=torch.float16, mf=channels_first \| 38.725 (+-0.079) \| \| 131.708 (+-0.586) channels=1, size=712, dtype=torch.float16, mf=channels_first \| 68.931 (+-0.086) \| \| 239.022 (+-0.914) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 90.277 (+-0.083) \| \| 101.512 (+-0.285) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 421.277 (+-1.030) \| \| 471.913 (+-3.654) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 1534.394 (+-7.572) \| \| 1833.262 (+-12.185) channels=3, size=256, dtype=torch.float64, mf=channels_first \| 100.809 (+-0.328) \| \| 103.166 (+-0.335) channels=3, size=520, dtype=torch.float64, mf=channels_first \| 425.535 (+-0.926) \| \| 482.606 (+-1.450) channels=3, size=712, dtype=torch.float64, mf=channels_first \| 1550.832 (+-3.547) \| \| 1859.098 (+-6.517) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 12.954 (+-0.051) \| \| 12.744 (+-0.046) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 41.180 (+-0.064) \| \| 63.362 (+-0.139) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 68.136 (+-0.142) \| \| 117.009 (+-0.292) channels=1, size=256, dtype=torch.bfloat16, mf=channels_first \| 13.049 (+-0.052) \| \| 12.792 (+-0.076) channels=1, size=520, dtype=torch.bfloat16, mf=channels_first \| 38.488 (+-0.092) \| \| 63.451 (+-0.096) channels=1, size=712, dtype=torch.bfloat16, mf=channels_first \| 68.103 (+-0.091) \| \| 116.693 (+-0.290) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 7.572 (+-0.029) \| \| 8.017 (+-0.071) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 22.121 (+-0.061) \| \| 23.614 (+-0.074) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 36.896 (+-0.094) \| \| 39.460 (+-0.084) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 7.671 (+-0.028) \| \| 8.034 (+-0.058) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 21.989 (+-0.053) \| \| 23.645 (+-0.063) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 37.252 (+-0.072) \| \| 39.477 (+-0.100) channels=1, size=256, dtype=torch.complex64, mf=channels_last \| 37.129 (+-0.052) \| \| 37.801 (+-0.101) channels=1, size=520, dtype=torch.complex64, mf=channels_last \| 122.646 (+-0.230) \| \| 139.074 (+-0.467) channels=1, size=712, dtype=torch.complex64, mf=channels_last \| 228.946 (+-0.736) \| \| 257.589 (+-0.545) channels=1, size=256, dtype=torch.complex64, mf=channels_first \| 37.088 (+-0.070) \| \| 37.894 (+-0.078) channels=1, size=520, dtype=torch.complex64, mf=channels_first \| 122.695 (+-0.268) \| \| 138.933 (+-0.336) channels=1, size=712, dtype=torch.complex64, mf=channels_first \| 234.655 (+-0.454) \| \| 255.787 (+-0.530) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100440-pr_vs_nightly-md) - AVX512 (all tested cases showing speed-up or same perfs) ``` [---------------------------------------------------------------------------- Vertical flip -----------------------------------------------------------------------------] \| torch (1.14.0a0+giteb3e189) PR \| Pillow (9.3.0) \| torch (1.14.0.dev20221208+cu116) nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64, mf=channels_last \| 122.544 (+-1.962) \| \| 129.161 (+-1.809) channels=3, size=520, dtype=torch.int64, mf=channels_last \| 508.274 (+-4.790) \| \| 533.872 (+-7.457) channels=3, size=712, dtype=torch.int64, mf=channels_last \| 951.176 (+-29.534) \| \| 1073.603 (+-44.676) channels=3, size=256, dtype=torch.int64, mf=channels_first \| 127.872 (+-2.700) \| \| 127.326 (+-2.666) channels=3, size=520, dtype=torch.int64, mf=channels_first \| 518.019 (+-4.157) \| \| 538.094 (+-6.600) channels=3, size=712, dtype=torch.int64, mf=channels_first \| 1002.176 (+-42.545) \| \| 1033.989 (+-42.137) channels=1, size=256, dtype=torch.int32, mf=channels_last \| 10.025 (+-0.135) \| 10.054 (+-0.369) \| 10.155 (+-0.285) channels=1, size=520, dtype=torch.int32, mf=channels_last \| 89.867 (+-0.994) \| 88.712 (+-0.622) \| 103.029 (+-2.254) channels=1, size=712, dtype=torch.int32, mf=channels_last \| 161.787 (+-2.080) \| 161.370 (+-1.801) \| 182.608 (+-7.031) channels=1, size=256, dtype=torch.int32, mf=channels_first \| 10.005 (+-0.277) \| 9.965 (+-0.338) \| 10.604 (+-0.334) channels=1, size=520, dtype=torch.int32, mf=channels_first \| 89.116 (+-0.996) \| 88.840 (+-0.608) \| 102.103 (+-2.111) channels=1, size=712, dtype=torch.int32, mf=channels_first \| 164.328 (+-3.284) \| 161.538 (+-2.739) \| 181.702 (+-3.770) channels=3, size=256, dtype=torch.uint8, mf=channels_last \| 8.853 (+-0.148) \| 10.292 (+-0.494) \| 8.961 (+-0.190) channels=3, size=520, dtype=torch.uint8, mf=channels_last \| 68.368 (+-1.158) \| 90.068 (+-1.780) \| 81.155 (+-0.945) channels=3, size=712, dtype=torch.uint8, mf=channels_last \| 125.458 (+-2.511) \| 163.150 (+-2.532) \| 147.039 (+-4.264) channels=3, size=256, dtype=torch.uint8, mf=channels_first \| 10.409 (+-0.435) \| 10.406 (+-0.351) \| 10.263 (+-0.252) channels=3, size=520, dtype=torch.uint8, mf=channels_first \| 69.077 (+-1.062) \| 90.057 (+-0.992) \| 79.910 (+-0.884) channels=3, size=712, dtype=torch.uint8, mf=channels_first \| 127.286 (+-2.789) \| 162.862 (+-2.953) \| 142.821 (+-2.119) channels=1, size=256, dtype=torch.int16, mf=channels_last \| 7.513 (+-0.143) \| \| 7.364 (+-0.154) channels=1, size=520, dtype=torch.int16, mf=channels_last \| 33.140 (+-0.779) \| \| 42.141 (+-0.820) channels=1, size=712, dtype=torch.int16, mf=channels_last \| 86.235 (+-1.187) \| \| 104.205 (+-2.205) channels=1, size=256, dtype=torch.int16, mf=channels_first \| 7.410 (+-0.162) \| \| 7.075 (+-0.126) channels=1, size=520, dtype=torch.int16, mf=channels_first \| 33.656 (+-0.914) \| \| 40.991 (+-0.893) channels=1, size=712, dtype=torch.int16, mf=channels_first \| 86.087 (+-1.191) \| \| 105.419 (+-1.801) channels=3, size=256, dtype=torch.int8, mf=channels_last \| 8.802 (+-0.196) \| \| 8.627 (+-0.202) channels=3, size=520, dtype=torch.int8, mf=channels_last \| 66.348 (+-0.775) \| \| 80.631 (+-1.832) channels=3, size=712, dtype=torch.int8, mf=channels_last \| 126.275 (+-2.318) \| \| 144.597 (+-4.242) channels=3, size=256, dtype=torch.int8, mf=channels_first \| 10.255 (+-0.383) \| \| 10.101 (+-0.335) channels=3, size=520, dtype=torch.int8, mf=channels_first \| 68.124 (+-0.849) \| \| 79.286 (+-0.748) channels=3, size=712, dtype=torch.int8, mf=channels_first \| 127.118 (+-2.225) \| \| 142.029 (+-2.507) channels=1, size=256, dtype=torch.float32, mf=channels_last \| 9.850 (+-0.453) \| 9.299 (+-0.253) \| 10.030 (+-0.234) channels=1, size=520, dtype=torch.float32, mf=channels_last \| 91.506 (+-1.319) \| 90.265 (+-0.824) \| 107.570 (+-2.093) channels=1, size=712, dtype=torch.float32, mf=channels_last \| 167.820 (+-3.883) \| 162.871 (+-2.397) \| 180.046 (+-8.952) channels=1, size=256, dtype=torch.float32, mf=channels_first \| 10.118 (+-0.359) \| 10.433 (+-0.479) \| 10.204 (+-0.344) channels=1, size=520, dtype=torch.float32, mf=channels_first \| 90.862 (+-1.486) \| 90.138 (+-0.969) \| 107.011 (+-1.801) channels=1, size=712, dtype=torch.float32, mf=channels_first \| 163.931 (+-3.653) \| 163.155 (+-2.673) \| 186.707 (+-2.248) channels=1, size=256, dtype=torch.float16, mf=channels_last \| 7.304 (+-0.134) \| \| 24.141 (+-0.444) channels=1, size=520, dtype=torch.float16, mf=channels_last \| 35.186 (+-0.656) \| \| 101.523 (+-1.465) channels=1, size=712, dtype=torch.float16, mf=channels_last \| 85.707 (+-0.841) \| \| 192.640 (+-4.942) channels=1, size=256, dtype=torch.float16, mf=channels_first \| 7.286 (+-0.142) \| \| 24.155 (+-0.555) channels=1, size=520, dtype=torch.float16, mf=channels_first \| 33.819 (+-1.009) \| \| 101.620 (+-3.034) channels=1, size=712, dtype=torch.float16, mf=channels_first \| 84.811 (+-0.993) \| \| 192.286 (+-4.707) channels=3, size=256, dtype=torch.float64, mf=channels_last \| 126.273 (+-2.519) \| \| 128.831 (+-1.975) channels=3, size=520, dtype=torch.float64, mf=channels_last \| 551.861 (+-4.159) \| \| 517.343 (+-4.501) channels=3, size=712, dtype=torch.float64, mf=channels_last \| 1102.465 (+-66.427) \| \| 1224.532 (+-55.656) channels=3, size=256, dtype=torch.float64, mf=channels_first \| 129.965 (+-2.083) \| \| 130.709 (+-2.261) channels=3, size=520, dtype=torch.float64, mf=channels_first \| 526.332 (+-5.354) \| \| 515.399 (+-4.320) channels=3, size=712, dtype=torch.float64, mf=channels_first \| 1169.215 (+-78.889) \| \| 1102.536 (+-51.178) channels=1, size=256, dtype=torch.bfloat16, mf=channels_last \| 7.478 (+-0.147) \| \| 7.154 (+-0.162) channels=1, size=520, dtype=torch.bfloat16, mf=channels_last \| 33.836 (+-1.022) \| \| 38.854 (+-0.648) channels=1, size=712, dtype=torch.bfloat16, mf=channels_last \| 85.483 (+-0.582) \| \| 99.190 (+-2.202) channels=1, size=256, dtype=torch.bfloat16, mf=channels_first \| 7.416 (+-0.125) \| \| 7.169 (+-0.121) channels=1, size=520, dtype=torch.bfloat16, mf=channels_first \| 34.958 (+-0.717) \| \| 40.136 (+-0.784) channels=1, size=712, dtype=torch.bfloat16, mf=channels_first \| 85.505 (+-1.207) \| \| 99.793 (+-2.065) channels=1, size=256, dtype=torch.bool, mf=channels_last \| 5.856 (+-0.178) \| \| 5.824 (+-0.118) channels=1, size=520, dtype=torch.bool, mf=channels_last \| 12.030 (+-0.330) \| \| 14.478 (+-0.554) channels=1, size=712, dtype=torch.bool, mf=channels_last \| 30.116 (+-0.639) \| \| 31.163 (+-0.873) channels=1, size=256, dtype=torch.bool, mf=channels_first \| 5.804 (+-0.113) \| \| 5.825 (+-0.102) channels=1, size=520, dtype=torch.bool, mf=channels_first \| 12.043 (+-0.363) \| \| 14.240 (+-0.341) channels=1, size=712, dtype=torch.bool, mf=channels_first \| 30.001 (+-1.001) \| \| 33.199 (+-0.430) channels=1, size=256, dtype=torch.complex64, mf=channels_last \| 29.941 (+-0.861) \| \| 28.229 (+-0.904) channels=1, size=520, dtype=torch.complex64, mf=channels_last \| 173.244 (+-2.577) \| \| 173.173 (+-2.260) channels=1, size=712, dtype=torch.complex64, mf=channels_last \| 323.548 (+-3.338) \| \| 318.318 (+-2.764) channels=1, size=256, dtype=torch.complex64, mf=channels_first \| 29.001 (+-1.029) \| \| 28.565 (+-2.074) channels=1, size=520, dtype=torch.complex64, mf=channels_first \| 173.078 (+-1.993) \| \| 170.664 (+-1.722) channels=1, size=712, dtype=torch.complex64, mf=channels_first \| 324.782 (+-3.759) \| \| 315.745 (+-2.600) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105707-pr_vs_nightly-avx512-md) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89414 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2022-12-14 08:19:07 +00:00
Sergii Dymchenko	e54c6c2870	Fix non-existing parameters in docstrings in torch/onnx (#90593 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90593 Approved by: https://github.com/justinchuby	2022-12-14 07:49:14 +00:00
XiaobingSuper	37cd96a6fe	inductor: using pre-existing fake mode to fallback kernels (#90814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90814 Approved by: https://github.com/ngimel, https://github.com/jgong5	2022-12-14 07:42:43 +00:00
Michael Voznesensky	6c8ef6a4c2	Add tracing context, Integrate dynamo guards into torch._guards (#90647 ) As defined here: https://docs.google.com/document/d/1oniZEgAaHE1IMByPRWRKbUHeaW06E2HMfCTCQyMRLek/edit# This PR creates a new structure, a TracingContext, whose lifecycle matches that of the traced frame. It carries on it a GuardsContext, and eventually, a FakeTensorMode. It is the source of truth of all accumulated guards. In this PR, we create the structure, and integrate it into dynamo. We do so by mapping OutputGraph's guards structure to its guard structure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90647 Approved by: https://github.com/ezyang	2022-12-14 07:35:32 +00:00
Pearu Peterson	f4099af1e9	Fix gradcheck for BSR and BSC inputs. (#90719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90719 Approved by: https://github.com/soulitzer, https://github.com/cpuhrsch	2022-12-14 05:37:05 +00:00
Pearu Peterson	a60d712010	Support (non-batch) BSR/BSC to COO sparse tensor conversions (#90718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90718 Approved by: https://github.com/cpuhrsch	2022-12-14 05:37:05 +00:00
Edward Z. Yang	cc504ce292	Restore test_warn_types (#90810 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90810 Approved by: https://github.com/ngimel	2022-12-14 05:15:32 +00:00
Jithun Nair	e8e591b72f	Upgrade CI to ROCm5.3 (#88297 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88297 Approved by: https://github.com/malfet	2022-12-14 05:09:56 +00:00
HDCharles	258860fa3a	[ao][fx] fixing public v private for pattern_utils.py (#88397 ) Summary: made _DEFAULT_FUSION_PATTERNS, _register_fusion_pattern, _DEFAULT_QUANTIZATION_PATTERNS, _DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP, _DEFAULT_OUTPUT_OBSERVER_MAP, _register_quant_pattern, _sorted_patterns_dict private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015537](https://our.internmc.facebook.com/intern/diff/D41015537) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88397 Approved by: https://github.com/jcaip	2022-12-14 03:40:02 +00:00
PyTorch MergeBot	769392178a	[vision hash update] update the pinned vision hash (#90727 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90727 Approved by: https://github.com/pytorchbot	2022-12-14 03:31:44 +00:00
Edward Z. Yang	e87370133c	Include dispatch key in wrapper symbol name (#90674 ) When looking at gdb traces, this makes it easier to tell that you're looking at the CPU wrapper vs CUDA wrapper, etc. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90674 Approved by: https://github.com/ngimel	2022-12-14 03:09:22 +00:00
Andrew Gu	6c605e9c3d	[FSDP] Skip param check for pure FP16 (#90785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90785 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2022-12-14 02:35:16 +00:00
chunyuan	e2e4a80cdb	Inductor cpp wrapper: support None as output (#88560 ) Map `None` to `at::Tensor()` in the cpp wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/88560 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-12-14 02:28:22 +00:00
Andrew Gu	93aee0cdc9	[FSDP][Easy] ufmt files (#90548 ) ``` ufmt format torch/distributed/fsdp ufmt format test/distributed/fsdp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90548 Approved by: https://github.com/mrshenli, https://github.com/rohan-varma	2022-12-14 02:02:53 +00:00
Keval Morabia	e90169d174	Fix missing return statement for test_it_returns_empty_list_when_model_contains_supported_inplace_ops in #89299 (#90797 ) Follow-up to #89299 where the return statement is missing in the test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/90797 Approved by: https://github.com/malfet	2022-12-14 01:45:31 +00:00
Andrew Gu	510339c07b	[FSDP][2/N] Refactor state dict hook registration (#90777 ) This PR includes some follow-ups from the previous PR to clean up the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90777 Approved by: https://github.com/rohan-varma	2022-12-14 01:13:19 +00:00
Natalia Gimelshein	ed050e7a18	Small fixes for better channels last performance (#89616 ) 1) don't codegen maxpool backward, it's exceedingly slow 2) better determine reduction variables for more accurate hints 3) deterministic iteration order for reduction arguments, take into account all full size reduction argument, for hints break ties to outer reduction fixes #1653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89616 Approved by: https://github.com/jansel, https://github.com/Chillee	2022-12-14 00:52:35 +00:00
Facebook Community Bot	dbe85265a8	Automated submodule update: kineto (#89846 ) This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto). New submodule commit: `72fa713ba6` Test Plan: Ensure that CI jobs succeed on GitHub before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89846 Approved by: https://github.com/aaronenyeshi	2022-12-14 00:25:44 +00:00
Chien-Chin Huang	d52f121dba	[Composable API]Common _State parent class for composable and wrapper FSDP (#89147 ) Why this PR? For the composable APIs implementation, sometimes the internal APIs may not have the application (FSDP, DDP) root module but only the local module. One example is the state_dict/optimizer_state_dict implementation of FSDP. These APIs are designed to start with the root module of the model. It is tricky for these APIs to tell whether a random submodule is managed by either DDP or FSDP. It will be useful to have APIs like: `_get_module_state(module)`: return the composable state if this module is managed by composable API. `_get_module_fsdp_state(module)`: return the FSDP state if this module is managed by FSDP. What does this PR propose? 1. Make `_State` out of `_composable` module so that `FullyShardedDataParallel` can inherit from it. 2. A global `_module_state_mapping: Dict[nn.Module, _State]` that keeps the mapping of all submodules (not just root module) to the state. 3. Create `_get_module_state(module)` to look up `_module_state_mapping`. 4. Create `_get_module_fsdp_state(module)` that uses `_get_module_state(module)` to get the state then verifies if the state is `_FSDPState`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89147 Approved by: https://github.com/awgu	2022-12-13 23:58:01 +00:00
Andrew Gu	b66cedd906	[FSDP] Fix `use_orig_params=True` + `no_sync()` (#90546 ) `no_sync()` introduces a separate case where a `FlatParameter` maintains an _unsharded_ gradient, instead of a _sharded_ one. This PR fixes `no_sync()` with `use_orig_params=True` by dealing with this separate case. The existing `use_orig_params=False` already bypasses the built-in parameter/gradient size check, where the `flat_param` is sharded, while the `flat_param.grad` is unsharded. For `use_orig_params=True`, we need to use the same `.data` hack to side step the size check that we used to side step the dtype check for `keep_low_precision_grads=True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90546 Approved by: https://github.com/rohan-varma	2022-12-13 23:40:04 +00:00
soulitzer	6d425a7ce9	Fix forward AD custom Function non-differentiable outputs (#90787 ) Fixes https://github.com/pytorch/pytorch/issues/90067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90787 Approved by: https://github.com/albanD	2022-12-13 23:13:44 +00:00
Jiewen Tan	9575f2ca83	[LTC] Make some LazyTensor interfaces virtual (#90686 ) Summary: Make some LazyTensor interfaces virtual such that XLA can adopt. It's related to https://github.com/pytorch/xla/pull/4317. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90686 Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG	2022-12-13 21:38:07 +00:00
Zachary DeVito	bf2668a899	Add support for kineto in memory viz (#90567 ) This is just rudimentary initial support that does the same stuff as the trace profile. Follow will add category encodings to the tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90567 Approved by: https://github.com/robieta	2022-12-13 21:31:16 +00:00
Sherlock Huang	b4b8a56589	Doc for Canonical Aten and Prims IR (#90644 ) as title. Sample output: https://docs-preview.pytorch.org/90644/ir.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/90644 Approved by: https://github.com/ezyang	2022-12-13 21:30:47 +00:00
Chien-Chin Huang	65e762acc8	[FSDP][optim_state_dict][5/N] Remove optim_inputs for sharded state_dict. (#89981 ) The argument, `optim_inputs`, is being deprecated. Sharded optimizer state_dict APIs are not be used. It is safe to remove them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89981 Approved by: https://github.com/awgu	2022-12-13 21:05:04 +00:00
Chien-Chin Huang	4a2d64994c	[FSDP][optim_state_dict][4/N] Remove the unused _get_flat_param_to_fsdp_module API (#89980 ) This is an easy PR, just remove an unused internal API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89980 Approved by: https://github.com/awgu	2022-12-13 21:01:46 +00:00
Khushi Agrawal	7cd900eb97	[fix] `adaptive_{avg, max}_pool` variants : cuda & cpu (#88906 ) Fixes #78868 #### TODO - [x] add tests - [x] adaptive_avg_pool2d - [x] adaptive_avg_pool3d - [x] adaptive_max_pool2d - [x] fix adaptive_max_pool3d_cuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/88906 Approved by: https://github.com/mruberry	2022-12-13 20:57:00 +00:00
Chien-Chin Huang	043de8d1b1	[FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) (#89900 ) What: This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89900 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-12-13 20:45:21 +00:00
Jagadish Krishnamoorthy	0a4e4de525	[ROCm] add case for FP32MatMulPattern skip property (#84077 ) TF32 is not supported on ROCm and hence the torch/profiler/_pattern_matcher.py FP32MatMulPattern should return False for ROCm instead of checking the results of torch.cuda.get_arch_list(). Depending on the gfx arch running the test, test_profiler.py's test_profiler_fp32_matmul_pattern (__main__.TestExperimentalUtils) will fail otherwise. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/84077 Approved by: https://github.com/jeffdaily, https://github.com/kit1980	2022-12-13 20:27:35 +00:00
HDCharles	79156c11c3	[ao][fx] fixing public v private match_utils.py (#88396 ) Summary: made _is_match, _find_matches, _MatchResult private also added __all__ to lower_to_qnnpack.py Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015540](https://our.internmc.facebook.com/intern/diff/D41015540) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88396 Approved by: https://github.com/jcaip	2022-12-13 20:16:55 +00:00
HDCharles	a856557b3a	[ao][fx] public v private convert.py (#88394 ) Summary: made _restore_state, _has_none_qconfig, _run_weight_observers, _maybe_recursive_remove_dequantize, _get_module_path_and_prefix, _insert_dequantize_node, _maybe_get_observer_for_node, _remove_previous_dequantize_in_custom_module private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015547](https://our.internmc.facebook.com/intern/diff/D41015547) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88394 Approved by: https://github.com/jcaip	2022-12-13 20:10:12 +00:00
Andrew Gu	b3d49c2fb8	[FSDP][1/N] `fully_shard` state dict (#90767 ) Co-authored with @rohan-varma. Overview This adds preliminary `state_dict()` support for `fully_shard`. - The only explicit branching between composable and wrapper code paths happens in the state dict hook registration, which is inevitable. - We introduce a `_comm_module_prefix` to match the FQNs between the two code paths. This is needed since for composable, the FQNs are prefixed from the local FSDP root, whereas for state dict purposes, we want them to be prefixed from the comm. module. Thus, we need this `_comm_module_prefix` to be stripped during state dict. - In my understanding, the alternative to not use the `prefix` argument in `state_dict()` does not support the case when `fully_shard` is applied to a submodule (i.e. not the global root module) since we still need _part_ of `prefix` then. Follow-Ups - We can retire the `functools.partial` usage once @fegin's PR lands. - We should add more thorough testing (e.g. sharded state dict, save and load together etc.). Pull Request resolved: https://github.com/pytorch/pytorch/pull/90767 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2022-12-13 20:05:40 +00:00
Bin Bao	ad4189c8db	[reland][inductor] Update TIMM skip list (#90762 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90762 Approved by: https://github.com/eellison	2022-12-13 19:56:23 +00:00
Yanbo Liang	5c133c5744	[Dynamo] Supports two torch.distributed.* functions (#90683 ) Fixes Meta internal user cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90683 Approved by: https://github.com/jansel	2022-12-13 19:06:38 +00:00
samdow	21fc28285e	[stateless] fix functional call docs (#90476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90476 Approved by: https://github.com/zou3519	2022-12-13 18:23:22 +00:00
Joel Schlosser	4a5f4416d0	Make at::outer SymInt-aware (#90714 ) Fixes matmul and related ops with meta; no more xfails needed. The non-working case for matmul was the matrix-vector case, which dispatches to `outer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90714 Approved by: https://github.com/lezcano	2022-12-13 18:16:09 +00:00
Joel Schlosser	3f14c70576	Make functional inverse for squeeze_copy SymInt-aware (#90697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90697 Approved by: https://github.com/ezyang	2022-12-13 18:15:37 +00:00
PyTorch MergeBot	1119d2fa54	Revert "Reland "Add heirachical module names to torchFX graph.node" (#90205 )" This reverts commit 6b7efac3c9ea5c9fbfb18069abd254ad7d9a103e. Reverted https://github.com/pytorch/pytorch/pull/90205 on behalf of https://github.com/seemethere due to Reverting since this caused failures in internal systems, see https://fb.workplace.com/groups/802176577445480/posts/894284641568006 for discussion	2022-12-13 17:47:07 +00:00
Wei Wang	1439ebd899	Enable inductor perf test on GCP A100 (#90322 ) This PR tries to enable inductor performance nightly testing on A100 runner provided by GCP. Currently these GCP runners were created and maintained using scripts in https://github.com/fairinternal/pytorch-gha-infra/pull/82. For some reason the artifacts cannot (and does not need to) be uploaded to S3, so adding use-gha parameter to _linux-test.yml to avoid creating a new but mostly identical _linux-test.yml. Workflow test results: https://github.com/pytorch/pytorch/actions/runs/3642340544/jobs/6149691109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90322 Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/desertfire	2022-12-13 17:47:01 +00:00
Li-Huai (Allan) Lin	544756ae5e	Fix mps constant pad (#89864 ) Support arbitrary dimensions for constant padding on MPS Fixes #89624 Fixes #87277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89864 Approved by: https://github.com/kulinseth, https://github.com/malfet	2022-12-13 17:28:54 +00:00
Bin Bao	7035bcdd0f	[inductor] Enable test_torch (#90518 ) Summary: Skipping failures in those tests so that CI can guard other passing cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90518 Approved by: https://github.com/jansel	2022-12-13 16:21:35 +00:00
Ivan Yashchuk	0d5c849d48	Update cuSPARSE usage for CUDA 12.0 (#90765 ) cuSPARSE v12.0 has started to use const pointers for the descriptors, from `cusparse.h` (documentation is incorrect): ```cpp typedef struct cusparseSpVecDescr const* cusparseConstSpVecDescr_t; typedef struct cusparseDnVecDescr const* cusparseConstDnVecDescr_t; typedef struct cusparseSpMatDescr const* cusparseConstSpMatDescr_t; typedef struct cusparseDnMatDescr const* cusparseConstDnMatDescr_t; ``` Changing also the function signature for the corresponding destructors to accept a const pointer. This PR adds `ConstCuSparseDescriptorDeleter` working with `cusparseStatus_t (destructor)(const T)`. Some algorithm enums were deprecated during CUDA 11 and removed in CUDA 12, I replaced the following occurences ``` CUSPARSE_CSRMM_ALG1 -> CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_COOMM_ALG1 -> CUSPARSE_SPMM_COO_ALG1 CUSPARSE_COOMM_ALG2 -> CUSPARSE_SPMM_COO_ALG2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90765 Approved by: https://github.com/cpuhrsch	2022-12-13 15:55:56 +00:00
Shen Li	d4dda519c9	Fix FSDP checkpoint tests (#90745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90745 Approved by: https://github.com/awgu	2022-12-13 15:34:25 +00:00
Bert Maher	a76032d8f4	[inductor] Pattern match cat->view*->pointwise and hoist pointwise (#90743 ) Summary: Inductor can't fuse pointwise into the output of concat, but it can fuse into the inputs, and that's the same thing. So we hoist pointwise through a concat (followed by an optional series of views). Test Plan: New unit test Differential Revision: D41901656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90743 Approved by: https://github.com/jiawenliu64, https://github.com/jansel	2022-12-13 15:18:01 +00:00
Aaron Gokaslan	da8f539e84	[Fix]: Add missing std::vector reserve in aten and torch/csrc (#90627 ) Applies some clang-tidy static analysis fixes to some places where the std::vector could call.reserve() first to allocate the appropriate amount of space. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90627 Approved by: https://github.com/ezyang	2022-12-13 14:46:27 +00:00
Richard Zou	4d494986af	[functorch] Refactor life handle storage (#90317 ) A "life handle" is a pointer-to-boolean that says whether or not a TensorWrapper is alive. A TensorWrapper is alive if we are currently inside of its corresponding transform. An Interpreter is alive if we are currently inside of its corresponding transform. I.e., for vmap(f)(x), the BatchedTensor(x, level=1) is alive inside of the execution of f; and the corresponding VmapInterpreter is alive inside of f. Previously, there was a global map of level to life handle. It is possible to get into a state where we have multiple levels that refer to different Interpreters (if the implementation of an operator calls into functorch) and that messes up the global map. This PR changes it so that - every Interpreter holds a life handle that says if it is alive - to construct a TensorWrapper, one must either (a) directly pass it a life handle, or (b) one must create the TensorWrapper when the corresponding Interpreter is on the stack (and we will automatically grab the life handle by indexing into the DynamicLayerStack with the level) (a) is more robust so I changed most of our C++ callsites to do that. (b) feels a bit hacky to me, but it seems fine for now: - It'll raise a nice error message if the interpreter isn't on the stack - all of our Python callsites already follow this convention (we construct TensorWrappers after pushing the Interpreter onto the stack). The alternative to (b) is that we always do (a), which we can do in the future if (b) runs us into any problems. Test Plan: - all functorch tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/90317 Approved by: https://github.com/samdow	2022-12-13 14:45:18 +00:00
Richard Zou	24c3ad7851	Move private forward grad mode helpers to torch.autograd.forward_ad (#90240 ) Motivation - These were previously defined in functorch. They are not functorch-specific, so I'm moving them to torch.autograd.forward_ad and the autograd python bindings. - I need this to avoid some of my cyclic import problems. Should these be public APIs? Probably. Though this needs discussion, so punting it to the future. Test Plan: - moved the tests of these from test/functorch/test_eager_transforms.py to test/test_autograd.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/90240 Approved by: https://github.com/soulitzer	2022-12-13 14:14:02 +00:00
Richard Zou	3049d99027	autograd.Function supports vmap staticmethod (#90037 ) This PR adds a `vmap` staticmethod to autograd.Function and a corresponding vmap kernel for custom_function_call. These two items mean that autograd.Function with a vmap staticmethod can be used with vmap. ```py class NumpyMul(torch.autograd.Function) staticmethod def forward(x, y): return torch.tensor(to_numpy(x) * to_numpy(y), device=x.device) staticmethod def setup_context(ctx, outputs, x, y): ctx.save_for_backward(x, y) staticmethod def backward(ctx, grad_output): x, y = ctx.saved_tensors gx = None if isinstance(x, torch.Tensor) and x.requires_grad: gx = NumpyMul.apply(grad_output, y) gy = None if isinstance(y, torch.Tensor) and y.requires_grad: gy = NumpyMul.apply(grad_output, x) return gx, gy staticmethod def vmap(info, in_dims, x, y): x_bdim, y_bdim = in_dims x = x.movedim(x_bdim, -1) if x_bdim else x.unsqueeze(-1) y = y.movedim(y_bdim, -1) if y_bdim else y.unsqueeze(-1) result = NumpyMul.apply(x, y) result = result.movedim(-1, 0) return result, 0 ``` API Spec - the staticmethod takes two arguments (info, in_dims) as well as the unexpanded inputs (x, y). - If we think about it as `vmap(info, in_dims, *args)`, `in_dims` is a pytree with the same tree structure as args. It has None if the arg is not being vmapped over and an integer vmapped dimension index if it is. - `info` is an object with metadata about the vmap. It currently has one field, `info.batch_size`. In the future we can extend this by adding things like the randomness information. - If there is a single vmap going on, (x, y) are NOT BatchedTensors, they've already been unpacked. - We expect the user to return a `(outputs, out_dims)` tuple. `out_dims` must "broadcast" to the same pytree structure as `outputs`. Semantics - vmap(NumpyMul.apply)(x) will apply the vmap staticmethod if there is one and will never actually run NumpyMul.forward. - In order for the autograd.Function to support nested vmap (e.g., `vmap(vmap(NumpyMul.apply))(x)`, then the vmap staticmethod must call into operations that vmap understands (i.e. PyTorch operators or more autograd.Function). At a high level, this PR: - adds a vmap rule for custom_function_call Testing - Added some tests for in_dims and info - Added vmap staticmethod to most of the autograd.Function in autograd_function_db and sent them through functorch's vmap-related OpInfo tests Future - Better error messages if the user gets the return contract wrong. I didn't include them in this PR because it might involve a refactor of some of the existing code in functorch/_src/vmap.py that will add ~200LOC to the PR, but LMK if you'd prefer it here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90037 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-12-13 14:14:02 +00:00
Jiewen Tan	4dc7d87421	[LTC] Make LazyGraphExecutor::RunPostOrder() virtual (#90680 ) Summary: This patch makes LazyGraphExecutor::RunPostOrder() virtual such that XLA can reuse it. It's related to https://github.com/pytorch/xla/pull/4315. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90680 Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG	2022-12-13 13:39:23 +00:00
PyTorch MergeBot	af4735d3ad	Revert "Upgrade CI to ROCm5.3 (#88297 )" This reverts commit 181a82ffd26d85bb8dda1b2551dffab2bc04452d. Reverted https://github.com/pytorch/pytorch/pull/88297 on behalf of https://github.com/IvanYashchuk due to Tests are unnecessarily skipped on all platforms	2022-12-13 12:23:44 +00:00
Aaron Gokaslan	96a36c9a3b	Fix: Apply clang-tidy to c10/core (#90699 ) Enables clang-tidy on 'c10/core'. Request by @ezyang to extend coverage of clang-tidy for better performance linting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90699 Approved by: https://github.com/ezyang	2022-12-13 12:07:36 +00:00
Bin Bao	ff1bbc2773	Revert "[reland][dynamo] use optimizers correctly in benchmarking (#87492 )" (#90746 ) This reverts commit d91d7a322172da4d92672301f3cfa3344d544a9e. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90746 Approved by: https://github.com/anijain2305	2022-12-13 11:37:16 +00:00
ecao	eae0f3f5e3	Add mkl implementation for exponential on CPU (#69967 ) ### Description Add mkl implementation for exponential on CPU to improve the performance of exponential. ### Testing data type: float32 single socket (28cores): ``` before: torch.Size([10, 128, 10, 124]) 0.065 s torch.Size([10, 128, 20, 124]) 0.130 s after: torch.Size([10, 128, 10, 124]) 5.9e-05 s torch.Size([10, 128, 20, 124]) 0.000113 s ``` single core: ``` before: torch.Size([10, 128, 10, 124]) 0.065 s torch.Size([10, 128, 20, 124]) 0.130 s after: torch.Size([10, 128, 10, 124]) 0.00117 s torch.Size([10, 128, 20, 124]) 0.002347 s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/69967 Approved by: https://github.com/frank-wei, https://github.com/jgong5	2022-12-13 09:51:24 +00:00
Jiewen Tan	a50fe978f8	[LTC] Make even more LazyGraphExecutor interfaces virtual (#90650 ) Summary: This patch makes the following interfaces virtual for XLA to adopt: 1. LazyGraphExecutor::Async. 2. TensorCollectionBarrier 3. SyncLiveTensorsGraph It's related to https://github.com/pytorch/xla/pull/4314. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90650 Approved by: https://github.com/wconstab	2022-12-13 09:03:28 +00:00
Andrew Gu	fc429512d5	[FSDP] Clean up `FlatParamHandle` dtypes, post-backward hook (#90660 ) This PR reworks the internal handling of parameter and gradient reduction mixed precision, cleans up the post-backward hook logic, and adds some minor changes to the communication hooks. Overview This PR addresses everything in https://github.com/pytorch/pytorch/issues/90657 except renaming `keep_low_precision_grads` to `keep_grads_in_reduce_dtype` since that is BC breaking. I recommend reading the issue before preceding. For `MixedPrecision(param_dtype, reduce_dtype, ...)`, the exact rule for parameter and gradient reduction mixed precision that we are following is: > If `param_dtype is not None` and `reduce_dtype is None`, then we infer `reduce_dtype = param_dtype`. Otherwise, we take `param_dtype` and `reduce_dtype` as is. This PR enforces that, at the `FlatParamHandle` level, `handle._config.fwd_bwd_param_dtype` and `handle._config.reduce_dtype` are never `None`. The way to check if mixed precision is enabled is to compare against the original parameter dtype, which is now stored in `handle._orig_param_dtype`. It is no longer to check against `None`. This avoids ambiguous cases such as when the user passes `MixedPrecision(param_dtype=torch.float32)`. In that case, our existing implementation mistakenly thinks that parameter mixed precision is enabled and either relies on no-ops silently or errors (such as one case reported by MosaicML). Additional Details - We remove `FullyShardedDataParallel._mixed_precision_enabled_for_params`, `FullyShardedDataParallel._mixed_precision_enabled_for_reduce`, and `FullyShardedDataParallel._mixed_precision_keep_low_precision_grads` since they are not used. - The unit test `test_meta_device_with_mixed_precision()` exercises a tricky edge case with meta device initialization, `apply()` (calling into `summon_full_params()`), and `param_dtype=torch.float32` for a nested wrapping case, where each nested instance has parameters. - We include some minor fixes/improvements to the communication hook implementation. Follow-Ups - We should get rid of `HandleConfig` and store its fields as attributes on `FlatParamHandle` directly. - Rename `keep_low_precision_grads` to `keep_grads_in_reduce_dtype`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90660 Approved by: https://github.com/zhaojuanmao	2022-12-13 07:34:59 +00:00
XiaobingSuper	ffa89033c5	TorchDynamo: always convert tensor to fake tensor at fake_mode path for ShapeProp (#90685 ) This PR will fix https://github.com/pytorch/torchdynamo/issues/1978, for HF models, there is always report a ShapeProp error, the root cause is that we use fake tensor mode to do the ShapeProp, but for torch.ones, it always gets a none fake tensor and introduces an operation with non-fake tensors with fake tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90685 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-12-13 06:59:43 +00:00
Andrew M. James	7a7f29704f	Remove hard numpy dep introduced by _inductor/utils.py (#90716 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90716 Approved by: https://github.com/cpuhrsch	2022-12-13 04:58:26 +00:00
Jithun Nair	181a82ffd2	Upgrade CI to ROCm5.3 (#88297 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88297 Approved by: https://github.com/malfet	2022-12-13 04:50:06 +00:00
mantaionut	7498e23bd5	Re-enabled 2 Metaprogramming tests on Windows (#87284 ) With C++17 these tests are not failing Fixes #25161 Depends on https://github.com/pytorch/pytorch/pull/85969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87284 Approved by: https://github.com/soulitzer	2022-12-13 04:34:26 +00:00
soulitzer	dc4d18d47d	Remove hack to hard code test times (#90720 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90720 Approved by: https://github.com/janeyx99	2022-12-13 04:28:01 +00:00
Wanchao Liang	1f86a1447b	[c10d] remove some outdated bc checks for c10d op (#90681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90681 Approved by: https://github.com/H-Huang	2022-12-13 04:21:45 +00:00
Wanchao Liang	7da504508d	[c10d] update alltoall signature to be more consistent (#90569 ) alltoall signature should be more consistent with its argument updated and this should be a BC breaking change Pull Request resolved: https://github.com/pytorch/pytorch/pull/90569 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Wanchao Liang	f30694c700	Add allgather_into_tensor to CommTensor (#90565 ) This PR adds _all_gather_base_ to CommTensor to support allgather_base Pull Request resolved: https://github.com/pytorch/pytorch/pull/90565 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Wanchao Liang	b782927ed4	Add reduce_scatter_tensor to CommTensor (#90564 ) This PR adds reduce_scatter_base to the CommTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/90564 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Wanchao Liang	3ba9e4cd55	Add alltoall_ to CommTensor (#90512 ) This PR adds alltoall_ to the CommTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/90512 Approved by: https://github.com/mrshenli	2022-12-13 04:18:02 +00:00
Jiewen Tan	6165a1807d	[LTC] Make DeviceContextArena protected (#90531 ) Summary: This patch makes DeviceContextArena protected such that XLAGraphExecutor can reuse it. In addition, it makes all methods that utilize DeviceContextArena virtual such that XLAGraphExecutor can override them to provide its own DeviceContextArena. P.S. This patch depends on pytorch/xla#4307 too. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90531 Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG	2022-12-13 04:17:41 +00:00
Eli Uriegas	b8f35ec6a5	Guard Symbol and ShapeGuardPrinter behind HAS_SYMPY (#90704 ) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Follow up to https://github.com/pytorch/pytorch/pull/90528 Fixes https://github.com/pytorch/pytorch/issues/90696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90704 Approved by: https://github.com/weiwangmeta, https://github.com/atalman, https://github.com/malfet	2022-12-13 03:56:56 +00:00
PyTorch MergeBot	ea64c8c6ad	Revert "[torchgen] Let native function declaration generation logic take a callable (#90590 )" This reverts commit de6beca838a4ff8f08ec2f51934f8c35cf5260ce. Reverted https://github.com/pytorch/pytorch/pull/90590 on behalf of https://github.com/seemethere due to Causes internal failures, see https://www.internalfb.com/intern/sandcastle/job/4503600464398605/insights	2022-12-13 03:41:04 +00:00
PyTorch MergeBot	b3e6a6dc0b	Revert "[torchgen] Introduce Executorch types and signatures (#90591 )" This reverts commit ddf00c803b2a99f4eec8a040b53ee18f62800fdd. Reverted https://github.com/pytorch/pytorch/pull/90591 on behalf of https://github.com/seemethere due to Part of a stack that causes internal failures, see https://www.internalfb.com/intern/sandcastle/job/4503600464398605/insights	2022-12-13 03:36:31 +00:00
Driss Guessous	42a5f6ee5d	Create stub function for doing SDPA cpp and cuda dispatch (#90576 ) ## Summary Torch.compile was previously not working for transformerencoder because torch.SDPA calls a native function on tensors that returns an int. This PR instead creates a dispatch stub for the function called in order to not create a separate fx node for this native function. As well this pr adds meta functions for the fused kerenels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90576 Approved by: https://github.com/cpuhrsch	2022-12-13 03:19:40 +00:00
Sergii Dymchenko	df569367ef	Fix non-existing parameters in docstrings in torch/fx (#90594 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90594 Approved by: https://github.com/clee2000	2022-12-13 01:19:28 +00:00
Jerry Zhang	94b9bb324f	[quant] Add example for lowering quantized dynamic linear pattern through delegation (#90640 ) Summary: Only the pattern part, will leave the delegation example to Chen Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic" Reviewed By: cccclai Pull Request resolved: https://github.com/pytorch/pytorch/pull/90640 Approved by: https://github.com/cccclai	2022-12-13 00:57:33 +00:00
Kevin Wang	b6f114c208	Fix a minor typo in documentation (#90667 ) This change fixes a typo in function's documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90667 Approved by: https://github.com/kit1980	2022-12-13 00:41:25 +00:00
soulitzer	98a9235dce	Fix prelu ref when a.ndim < 2 (#89809 ) Fixes https://github.com/pytorch/pytorch/issues/89560 Previously the test case for "input is 1-D or scalar + weight is not scalar" did not exist; adding it introduced some failures: - forward AD (fixed in this PR) - vmap (filed https://github.com/pytorch/pytorch/issues/89895) - ref/meta (fixed this PR, though this also regresses nvFuser support) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89809 Approved by: https://github.com/ngimel	2022-12-12 23:55:31 +00:00
William Wen	34dc34e8a0	Add comment to output_code in dynamo config (#90333 ) Title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90333 Approved by: https://github.com/mlazos	2022-12-12 23:36:01 +00:00
Philip Meier	7bb97c4ca4	move TypedStorage handling to assertEqual (#89557 ) #85303 added a patch to `torch.testing.assert_close` to handle `torch.storage.TypedStorage`'s. This change is not reflected in the docs and is not intended for the public API. This PR removes the patch ones again and moves the behavior to `TestCase.assertEqual` instead. Meaning, `TypedStorage`'s are again not supported by the public API, but the behavior is the same for all internal use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89557 Approved by: https://github.com/kurtamohler, https://github.com/mruberry	2022-12-12 23:26:00 +00:00
Laurent Mazare	17941b12e0	Fix a typo in some torch.load error message. (#90662 ) Very cosmetic change: only fixes a small typo in an error message that torch.load could raise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90662 Approved by: https://github.com/kit1980	2022-12-12 22:34:57 +00:00
Yanbo Liang	e2674aafed	[Dynamo] Supports calling parent class‘s non classmethod from child class (#90682 ) Fixes https://github.com/pytorch/pytorch/issues/90558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90682 Approved by: https://github.com/jansel	2022-12-12 22:33:46 +00:00
HDCharles	e11650887e	[ao] fix incorrect integer cast on histogram observer bounds (#90355 ) Summary: A cast to int was added in https://github.com/pytorch/pytorch/pull/45630 to make mypy not complain. However this leads to unexpected behavior where the histogram doesn't actually capture the full range of activation values. note1: the test_histogram_observer_against_reference test was secretly broken, on master. The random parameters that normally get run apparently don't cause a test failure but if you make a loop repeatedly run the test, it would eventually fail. This was due to in some cases sum(<tensor>)!=torch.sum(<tensor>).item(). I was not able to reproduce this with a toy example but running this test in a loop and editing either observer to print the calculation for 'total' would break the test and show different behaviors. Fixing this test was necessary to land this PR since the changing histogram bounds changed things enough that this test would error. note2: updating histogram observer breaks some BC tests unless I regenerate the model using the HistogramObserver from this PR Test Plan: python test/test_quantization.py TestHistogramObserver.test_histogram_observer_correct_numel python test/test_quantization -k histogram Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90355 Approved by: https://github.com/vkuzo	2022-12-12 20:30:44 +00:00
Catherine Lee	60e196c241	Better url in trymerge (#90583 ) example: old: `453b510b2d/checks` new: https://github.com/pytorch/pytorch/actions/runs/3644518486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90583 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2022-12-12 19:19:56 +00:00
BowenBao	f258753799	[ONNX] Add repro export from `GraphInfo` (#89947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89947 Approved by: https://github.com/justinchuby	2022-12-12 19:13:39 +00:00
BowenBao	525c33c09f	[ONNX] Verification tool to find mismatch in model export (#89946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89946 Approved by: https://github.com/justinchuby	2022-12-12 17:56:48 +00:00
Jeff Daily	4ed175bfb7	fix with statement in test_fsdp_hybrid_shard.py (#90580 ) Fixes PR #89915. The following syntax was not permitted until 3.10: ``` with ( patch_allreduce(patched_allreduce), patch_reduce_scatter(patched_reduce_scatter), ): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90580 Approved by: https://github.com/awgu	2022-12-12 17:43:30 +00:00
Soumith Chintala	06326a7721	[optim] skip .item calls in all optimizers when compiling with dynamo (#88173 ) @mlazos: skips `item()` calls if compiling with dynamo, by defining a helper function `_get_value` which either returns the result of `.item()` or the scalar cpu tensor if compiling with dynamo. This was done because removing `item()` calls significantly regresses eager perf. Additionally, `_dispatch_sqrt` calls the appropriate sqrt function (math.sqrt, or torch.sqrt). Fixes https://github.com/pytorch/torchdynamo/issues/1083 This PR will no longer be needed once symint support is default. This PR closes all remaining graph breaks in the optimizers (!!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88173 Approved by: https://github.com/albanD	2022-12-12 17:32:35 +00:00
Aaron Gokaslan	7541c9f8be	[Fix]: remove unnecessary copies in aten, c10, and torch bindings (#90629 ) Applies various automated fixes that reduces the number of spurious copies in torch, aten, and c10. I also inlined any default dtors that would have made the type trivially destructible. Follow up to #89000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90629 Approved by: https://github.com/ezyang	2022-12-12 17:05:52 +00:00
Peter Bell	27932ff8c9	[Inductor] Add note that stride_vars result may be inaccurate (#90184 ) Strides are are determined by substituting 1 and 0 for different indices, which will fail for any expression that doesn't match the expected stride calculation. So, lets add a note to make this clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90184 Approved by: https://github.com/jansel, https://github.com/ngimel	2022-12-12 16:49:12 +00:00
Sean Ross-Ross	dcce5677fd	Adding test when registering a batching rule for a CompositeImplicitAutograd operation (#89465 ) This is a Follow on from https://github.com/pytorch/pytorch/pull/88771 which should close out https://github.com/pytorch/functorch/issues/1009 I've got another PR where I'm moving some operators over https://github.com/pytorch/pytorch/pull/89762 you can see that the new test file is being picked [run here](https://github.com/pytorch/pytorch/actions/runs/3617298059/jobs/6096218583#step:10:472) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89465 Approved by: https://github.com/zou3519	2022-12-12 16:21:07 +00:00
PyTorch MergeBot	e37c8c8436	Revert "[inductor] Update TIMM skip list (#90188 )" This reverts commit fd3f5d7bf7247be662fcb47156bcbe4c6fa04903. Reverted https://github.com/pytorch/pytorch/pull/90188 on behalf of https://github.com/desertfire due to flaky accuracy failure	2022-12-12 15:31:50 +00:00
Edward Z. Yang	0b3316ad2c	Don't enable debug_fake_crossref for TORCH_COMPILE_DEBUG (#90666 ) It is kind of flaky, it doesn't work with dynamic shapes, and I think the debug interpreter is a better way to detect if you've had a size/stride propagation accident. Fixes https://github.com/pytorch/pytorch/issues/90652 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90666 Approved by: https://github.com/voznesenskym	2022-12-12 14:20:40 +00:00
Edward Z. Yang	f7365eca90	Add unbacked symints support; item works now (#90624 ) The big idea is to add `create_unbacked_symfloat` and `create_unbacked_symint` to ShapeEnv, allowing you to allocate symbolic floats/ints corresponding to data you don't know about at compile time. Then, instead of immediately erroring out when you try to call local_scalar_dense on a FakeTensor, we instead create a fresh symint/symfloat and return that. There a bunch of odds and ends that need to be handled: * A number of `numel` calls converted to `sym_numel` * When we finally return from item(), we need to ensure we actually produce a SymInt/SymFloat when appropriate. The previous binding code assumed that you would have to get a normal Python item. I add a pybind11 binding for Scalar (to PyObject only) and refactor the code to use that. There is some trickiness where you are NOT allowed to go through c10::SymInt if there isn't actually any SymInt involved. See comment. * One of our unit tests tripped an implicit data dependent access which occurs when you pass a Tensor as an argument to a sizes parameter. This is also converted to support symbolic shapes * We now support tracking bare SymInt/SymFloat returns in proxy tensor mode (this was already in symbolic-shapes branch) * Whenever we allocate an unbacked symint, we record the stack trace it was allocated at. These get printed when you attempt data dependent access on the symint (e.g., you try to guard on it) * Subtlety: unbacked symints are not necessarily > 1. I added a test for this. These unbacked symints are not very useful right now as you will almost always immediately raise an error later when you try to guard on them. The next logical step is adding an assertion refinement system that lets ShapeEnv learn facts about unbacked symints so it can do a better job eliding guards that are unnecessary. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90624 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2022-12-12 13:33:07 +00:00
PyTorch MergeBot	6702345416	[xla hash update] update the pinned xla hash (#90161 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90161 Approved by: https://github.com/pytorchbot	2022-12-12 10:27:03 +00:00
Michael Voznesensky	5adc18dcbc	Shape guard structure (#90679 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90679 Approved by: https://github.com/ezyang	2022-12-12 09:50:00 +00:00
Yanbo Liang	2e0ce24890	[Dynamo] Support access nn.Module keys (#90502 ) Fixes https://github.com/pytorch/torchdynamo/issues/1973 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90502 Approved by: https://github.com/jansel	2022-12-12 09:15:42 +00:00
Yuxin Wu	c8ed84ad06	Fix a static initialization order fiasco in c10d (#90149 ) The `TORCH_LIBRARY_IMPL` registrations in `OpsImpl.cpp` needs to happen after `ProcessGroup` is registered as a torch class -- which happens in `Ops.cpp`. However, the order of the registrations is undefined between the two files. If the registration in `OpsImpl.cpp` runs before `Ops.cpp`, we get a crash at program launch similar to #83255 . This happens in our internal build. This PR moves `OpsImpl.cpp` to the end of `Oops.cpp`. Because according to the omniscient lord of chatGPT: <img width="600" alt="2022-12-04_19-25" src="https://user-images.githubusercontent.com/1381301/205542847-3535b319-3c2a-4e8e-bc11-27913f6afb39.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90149 Approved by: https://github.com/kwen2501, https://github.com/H-Huang, https://github.com/soumith	2022-12-12 08:21:54 +00:00
XiaobingSuper	4ca2fc485c	inductor(CPU): add Conv+binary+unary fusion filter (#90259 ) For Conv+binary+unary fusion, we only support conv+add+relu, this PR adds a such check to fix TIMM failed models. TODO: enable more Conv+binary+unary fusion to improve TIMM models' performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90259 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2022-12-12 06:04:55 +00:00
Bert Maher	c318de4274	[dynamo] Get GPU names without calling nvidia-smi (#90474 ) Believe it or not, inductor can sometimes be used on machines that have CUDA GPUs but no nvidia-smi. Let's use torch APIs instead of subprocess. Differential Revision: [D41841930](https://our.internmc.facebook.com/intern/diff/D41841930/) Differential Revision: [D41841930](https://our.internmc.facebook.com/intern/diff/D41841930) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90474 Approved by: https://github.com/voznesenskym, https://github.com/anijain2305	2022-12-12 05:31:50 +00:00
Bert Maher	b95ea4f149	[pt2] Reset dynamo log level when exiting inductor debug context (#90473 ) When entering an inductor debug context we increase the log level of dynamo; I guess this makes sense, since if we're debugging inductor, and inductor calls into dynamo, we probably want visibility into what dynamo is doing. But when we exit that context, we probably want to go back to whatever level of dynamo-specific logging was in place before. Dynamo generates lots of debug info (guards, bytecode), and it's a lot to sift through if you're not specifically interested in it. Differential Revision: [D41841879](https://our.internmc.facebook.com/intern/diff/D41841879/) Differential Revision: [D41841879](https://our.internmc.facebook.com/intern/diff/D41841879) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90473 Approved by: https://github.com/mlazos, https://github.com/jansel	2022-12-12 04:39:37 +00:00
Bert Maher	d3d85e1c3b	Emit torch.cuda.synchronize() after every kernel call in inductor (#90472 ) Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1 and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace. doesn't necessarily guarantee that you'll get a stack trace pointing to the right kernel. This diff adds a config option to force a CUDA synchronize after every kernel call in inductor, for debugging those tricky cases. Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/) Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472 Approved by: https://github.com/jansel	2022-12-12 04:35:10 +00:00
Edward Z. Yang	8fd31ac4da	Preserve original GraphArgs for shape guard codegen (#90665 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90665 Approved by: https://github.com/voznesenskym	2022-12-12 02:35:23 +00:00
Edward Z. Yang	9447005ae3	Improve dynamo debug logging (#90664 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90664 Approved by: https://github.com/voznesenskym	2022-12-12 02:35:23 +00:00
Edward Z. Yang	450bd282e0	Slightly improve error messages on sympy failure (#90655 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90655 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2022-12-12 01:58:34 +00:00
Yuxin Wu	8127724c3b	Skip some unittests (#90609 ) * Skip a unittest that needs FFT if not built with FFT * Mark a test with "slow": `python test/test_ops.py -k TestCompositeComplianceCUDA.test_forward_ad_svd_lowrank_cuda_float32` took >5min on my machine. * Skip a flaky test that's marked "expectedFailure", similar to #90233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90609 Approved by: https://github.com/soumith	2022-12-11 23:53:05 +00:00
Michael Voznesensky	11442accc6	Make torch._guards, shuffle structures around for migration (#90636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90636 Approved by: https://github.com/ezyang	2022-12-11 23:16:07 +00:00
Edward Z. Yang	e1ed5ad5a5	Add a timeout to benchmark script (#90634 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90634 Approved by: https://github.com/voznesenskym	2022-12-11 23:12:29 +00:00
Yuxin Wu	5d8618dfbd	Some memory saving in large unittests (#90148 ) Two tests test_large_cumsum, test_large_cumprod use a lot of memory. This PR: * Reduces their memory usage by: avoid `self.assertEqual` and avoid a temporary python variable * Mark their memory requirement by decorator. related to https://github.com/pytorch/pytorch/issues/84944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90148 Approved by: https://github.com/soumith	2022-12-11 21:04:38 +00:00
Aaron Gokaslan	995d39c221	[Fix]: Add some missing moves in 90442 (#90661 ) @ezyang Noticed a couple of missing std::move for all the symints from #90442. Also I noticed a couple of helper functions didn't seem like they needed to take ownership. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90661 Approved by: https://github.com/ezyang	2022-12-11 20:23:40 +00:00
Edward Z. Yang	e33f1eeeb7	SymIntify resize_ and deduplicate memory format logic (#90442 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90442 Approved by: https://github.com/bdhirsh	2022-12-11 14:38:38 +00:00
Jiong Gong	181d37475d	Simple fix: add missing positional arg in init_optimizer() call (#90641 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90641 Approved by: https://github.com/kit1980	2022-12-11 13:18:05 +00:00
PyTorch MergeBot	15a4c60383	Revert "Make torch._guards, shuffle structures around for migration (#90636 )" This reverts commit 933b6c4eed675d33274d0bc1dfcb9d8446f412d8. Reverted https://github.com/pytorch/pytorch/pull/90636 on behalf of https://github.com/huydhn due to Breaking lint on master. Please rebase and run lintrunner -a before re-merging the PR	2022-12-11 10:15:47 +00:00
Shen Li	7ec1cb8553	[FSDP] Fix _pre_forward type annotation (#90621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90621 Approved by: https://github.com/awgu, https://github.com/Skylion007	2022-12-11 06:39:38 +00:00
Shen Li	80542add73	[FSDP] Allow MixedPrecision to skip inputs (#90620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90620 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-12-11 06:39:38 +00:00
Andrew Gu	31351c61dd	[FSDP] Tighten post-bwd cast to `reduce_dtype` (#90615 ) This lowers the `reduce_dtype` retrieval to the `handle` instead of the `state` in preparation for `fully_shard`, and this adds a guard to avoid a no-op `to()` call. Note that this change pretty much gets overridden in following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90615 Approved by: https://github.com/rohan-varma	2022-12-11 06:39:34 +00:00
Michael Voznesensky	933b6c4eed	Make torch._guards, shuffle structures around for migration (#90636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90636 Approved by: https://github.com/ezyang	2022-12-11 06:04:17 +00:00
Rohan Varma	c7d2fb7f86	Adopt state_dict_pre_hook in FSDP (#90436 ) Use register_state_dict_pre_hook in FSDP to simplify state_dict implementations & remove hacks. This removes `def state_dict` entirely and paves the path for composable API as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90436 Approved by: https://github.com/fegin	2022-12-11 03:54:26 +00:00
Andrew Gu	746c773d7c	[FSDP][Easy] Move to `_storage()` in test file (#90622 ) This is to silence some deprecation warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90622 Approved by: https://github.com/rohan-varma	2022-12-11 03:50:30 +00:00
Andrew Gu	6845598617	[FSDP] Uncomment test for `use_orig_params=True` (#90610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90610 Approved by: https://github.com/rohan-varma	2022-12-11 03:50:23 +00:00
Andrew Gu	e7efeb5282	[FSDP] Save `_stream_to_name` for debugging (#90611 ) This saves a data structure `_stream_to_name: Dict[torch.cuda.Stream, str]` that maps each FSDP stream to its name. This can help in debugging by checking `_stream_to_name[torch.cuda.current_stream()]` to see if it is `"default"` or `"unshard"` in the post-backward hook for example. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90611 Approved by: https://github.com/rohan-varma	2022-12-11 03:46:18 +00:00
Aaron Gokaslan	184f6b5787	Fix perf bug in #90528 (#90630 ) Fixes a minor I noticed in #90528 also a follow up to #89000. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/90630 Approved by: https://github.com/ezyang	2022-12-11 01:00:05 +00:00
Andrew Gu	9eccfedca2	[Reland][FSDP] Another fix for `DTensor`, `use_orig_params=True` (#90562 ) This is a reland of https://github.com/pytorch/pytorch/pull/89845 with nothing changed. This should avoid the internal breakage now that `DTensor` does not import `torchgen` (https://github.com/pytorch/pytorch/pull/90106). Pull Request resolved: https://github.com/pytorch/pytorch/pull/90562 Approved by: https://github.com/fduwjj	2022-12-10 22:50:30 +00:00
Shen Li	a69cdd9cf8	Add global registry to composable API contract (#90579 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90579 Approved by: https://github.com/awgu, https://github.com/yhcharles	2022-12-10 22:41:10 +00:00
Aaron Gokaslan	12671fe620	Reserve space for std::vector output in extract_tensors for nccl python bindings (#88203 ) Optimizes the nccl python bindings to reserve space when converting PythonObject* into Tensors. This should reduce the number of unnecessary allocations in the nccl bindings as the std::vector grows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88203 Approved by: https://github.com/ezyang	2022-12-10 20:28:19 +00:00
Aaron Gokaslan	583d216c1a	Fix: [ATen] add more missing moves - part 2 (#89000 ) Applies some more missing std::move found by static analysis. This should improve performance and reduce unnecessary copies. This PR only targets ATen for now. And before you ask about the edits, std::move is optimal in a ternary operator as copy ellision cannot happen one. The best thing is probably rewriting it as an if else, but ultimately this should be performant enough. Followup to #88512 and #88514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89000 Approved by: https://github.com/ezyang	2022-12-10 20:13:45 +00:00
Sergii Dymchenko	9ef1d55e6b	Fix non-existing parameters in docstrings in torch/nn (#90596 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90596 Approved by: https://github.com/lezcano	2022-12-10 14:37:31 +00:00
Edward Z. Yang	45109ec30a	Completely redo how ShapeEnv guards are generated (#90528 ) Instead of inferring shape mappings from a bunch of data structures that were plumbed in InstructionTranslator, we instead work out mappings by just iterating over the GraphArgs and mapping symbols to arguments as they show up. If multiple argument sizes/strides/offset map to the same symbol, this means they are duck sized, so we also generate extra equality tests that they must be equal. Finally, we generate 0/1 specialization guards. The resulting code is much shorter, and I think also easier to understand. TODO: Delete all the tensor ref tracking code, it's unnecessary Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90528 Approved by: https://github.com/voznesenskym	2022-12-10 13:35:04 +00:00
Edward Z. Yang	49c674e155	Revert guaranteed symint allocation (#90381 ) So, uh, I have a new strategy for generating dupe guards, one where I don't actually need to allocate symints for every tensor that is fakeified. So I'm reverting the changes I made from earlier PRs in this one. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90381 Approved by: https://github.com/voznesenskym	2022-12-10 13:17:34 +00:00
Edward Z. Yang	b68dead20c	Keep track of source name on all allocated SymInts (#90295 ) Wow, I had to sweat so much to get this PR out lol. This PR enforces the invariant that whenever we allocate SymInts as part of fakeification, the SymInt is associated with a Source, and in fact we store the string source name on SymbolWithSourceName. We use 'sname' as the shorthand for source name, as 'name' is already used by sympy to name symbols. In order to store source names, we have to plumb source names from Dynamo to PyTorch. This made doing this PR a bit bone crushing, because there are many points in the Dynamo codebase where we are improperly converting intermediate tensors into fake tensors, where there is no source (and there cannot be, because it's a frickin' intermediate tensor). I've fixed all of the really awful cases in earlier PRs in the stack. This PR is just plumbing in source names from places where we do have it. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90295 Approved by: https://github.com/voznesenskym	2022-12-10 13:17:34 +00:00
blzheng	f9aa099074	[Inductor] fix issue: redeclaration of float g_tmp_buffer_xxx (#90270 ) This pr is to fix the issue: redeclaration of 'float g_tmp_buffer_in_ptr1[16] = {0};' If a bool or uint8 tensor is used by multiple op, this tensor will be loaded multiple times. On load, it writes the declaration of this variable, i.e., `self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")`, which will introduce redeclaration error. ![image](https://user-images.githubusercontent.com/69951214/205869956-5c325761-dc09-4aa8-a9ed-fad7f4c85917.png) ![image](https://user-images.githubusercontent.com/69951214/205870695-ee252f17-8f54-484f-9b0a-3a424c479327.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90270 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2022-12-10 12:59:30 +00:00
Jiewen Tan	5a665a39d1	[LTC] Make some LazyGraphExecutor private data structures protected (#90598 ) Summary: This pull request makes some LazyGraphExecutor private data structures protected such that XLAGraphExecutor can reuse them. Here is the list: 1. DeviceLocker. 2. DeviceLockerArena. 3. DataCacheArena. In addition, it also introduces LazyGraphExecutor::ResetTrimCounter() such that XLAGraphExecutor can reuse the trim counter. Test Plan: CI. P.S. This is to re-land #90457. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90598 Approved by: https://github.com/JackCaoG	2022-12-10 08:19:12 +00:00
Larry Liu	ddf00c803b	[torchgen] Introduce Executorch types and signatures (#90591 ) Retry of #89595. Accidentally closed. ## Forked `BaseCppType` Created a module for Executorch: `torchgen.executorch`. In `torchgen.executorch.api.types.types`: * Define `BaseCppType` with `torch::executor` namespace. In `torchgen.executorch.api.et_cpp`: * Help generate `NamedCType` for `ExecutorchCppSignature` arguments. In `torchgen.executorch.api.types.signatures`: * Define the signature using these types. (`ExecutorchCppSignature`) In `torchgen.executorch.api.types.__init__`: * Suppress flake8 error for `import *`. Differential Revision: [D41501836](https://our.internmc.facebook.com/intern/diff/D41501836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90591 Approved by: https://github.com/iseeyuan	2022-12-10 04:34:02 +00:00
Larry Liu	de6beca838	[torchgen] Let native function declaration generation logic take a callable (#90590 ) Retry of #89594. Accidentally closed. This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`. Differential Revision: [D41501838](https://our.internmc.facebook.com/intern/diff/D41501838/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90590 Approved by: https://github.com/iseeyuan	2022-12-10 04:34:02 +00:00
Larry Liu	453ff96029	[torchgen] Refactor types (#90589 ) A retry of #89487. Accidentally closed. ## Split `torchgen.api.types` into `types_base`, `types` and `signatures`. In `types_base`: * Created base class `CType`. `BaseCType` and `ConstRefCType` etc are inheriting `CType`. * Only keep abstract type model definitions, such as `BaseCppType`. In `types`: * Define `BaseCppType` with `at` and `c10` namespaces. * All the signatures using these types. In `signatures`: * Define all the signatures. In `__init__`: * `from ... import `, suppress flake8 error. Differential Revision: [D41455634](https://our.internmc.facebook.com/intern/diff/D41455634/) NOTE FOR REVIEWERS*: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41455634/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/90589 Approved by: https://github.com/iseeyuan	2022-12-10 04:34:00 +00:00
Zachary DeVito	0457020d2c	[dims] Fix large array inputs (#88596 ) Variable length arguments can overflow the arena being used to keep overhead low for torch dims. If we hit this case, we know the amount of work being done is already relatively big, so we just fallback to standard memory allocation. Fixes #88586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88596 Approved by: https://github.com/ezyang	2022-12-10 03:49:16 +00:00
PyTorch MergeBot	bb9fc32fe0	[vision hash update] update the pinned vision hash (#90586 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90586 Approved by: https://github.com/pytorchbot	2022-12-10 03:22:35 +00:00
Digant Desai	d3a3604581	[pthreadpool] Don't recreate threadpool if the counts are same (#90478 ) Summary: Don't do anything if the incoming count and current threadpool size are same Test Plan: CI Reviewed By: salilsdesai Differential Revision: D41628132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90478 Approved by: https://github.com/salilsdesai	2022-12-10 03:17:08 +00:00
Zachary DeVito	3b3ed25109	Add a way to visualize memory snapshot traces (#90348 ) This adds a d3-based interactive visualization for exploring the memory allocation traces that the caching allocator can capture. This visualization code can also be attached to kineto trace information in the future to also provide visualization for the memory events captured there, which come with addition information about the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90348 Approved by: https://github.com/robieta	2022-12-10 02:45:11 +00:00
Yanli Zhao	2bac4d1fae	[reland] add save and load stats in memory_tracker (#90510 ) reland https://github.com/pytorch/pytorch/pull/90144, this PR removed temporary path "memory.trace" in the unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/90510 Approved by: https://github.com/rohan-varma	2022-12-10 01:39:22 +00:00
BowenBao	1b2c59ad24	[ONNX] Introduce ONNX reference evaluator for verification (#89808 ) Reference evaluator requires ONNX >= 1.13. Running in CI is blocked by unable to bump onnx submodule version, like in #83201. Local tests pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89808 Approved by: https://github.com/justinchuby	2022-12-10 01:29:12 +00:00
Wanchao Liang	7afba50508	[dtensor] delete unused torch_function (#90449 ) torch_function is not actually getting used yet today, deleting it first and we can revisit once we really need it Pull Request resolved: https://github.com/pytorch/pytorch/pull/90449 Approved by: https://github.com/fduwjj	2022-12-10 01:29:02 +00:00
Sherlock Huang	45b64e8c61	Populate Canonical Aten Ops (Batch 2) (#90456 ) acos argmax argmin acosh asinh atanh asin atan logical_not logical_and logical_or cos cosh empty_strided full isnan sin sinh scatter_reduce.two bitwise_xor.Tensor sign fmod.Tensor remainder.Tensor pow.Tensor_Tensor is_inf ne.Scalar ne.Tensor eq.Tensor ge.Tensor le.Tensor gt.Tensor lt.Tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/90456 Approved by: https://github.com/ezyang	2022-12-10 00:27:37 +00:00
BowenBao	79f9672249	[ONNX] Use `VerificationOptions` to wrap option arguments (#89807 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89807 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2022-12-09 23:49:51 +00:00
Angela Yi	6de216a2e8	[fx] Have replace_pattern return replaced nodes (#90244 ) Summary: Modified replace_pattern in the subgraph rewriter to return a list of pairs of matches along with their corresponding replacement nodes in the modified graph (`List[Tuple[Match, List[Node]]]`). This allows us to easily modify the replaced nodes, including setting the metadata. Test Plan: CI Differential Revision: D41737056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90244 Approved by: https://github.com/SherlockNoMad	2022-12-09 23:43:16 +00:00
Jiawen Liu	4a1633ca69	[Inductor] GEMM Shape Padding Optimization (#90425 ) Summary: Optimize the shape padding in the following perspectives: - Add BFloat16 support for AMP training and Float16 support for inference - Optimize microbenchmark to avoid peak memory issue, and include profiling memory ops to make more accurate decision - Add a flag to turn off/on padding dims N and M in `torch.bmm` due to expensive memory copy of `.contiguous` to avoid peak memory issues in internal models Test Plan: CI Differential Revision: D41724868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90425 Approved by: https://github.com/jianyuh	2022-12-09 22:48:02 +00:00
PyTorch MergeBot	b7dfbf876f	Revert "[LTC] Make some LazyGraphExecutor private data structures protected (#90457 )" This reverts commit 93aa6e3e36c022a01076d84047acd58b59244348. Reverted https://github.com/pytorch/pytorch/pull/90457 on behalf of https://github.com/clee2000 due to broke xla somehow `93aa6e3e36` https://github.com/pytorch/pytorch/actions/runs/3659842773/jobs/6186552659	2022-12-09 22:28:24 +00:00
Angela Yi	02eb0bdbc1	[fx] Added better tests to pass infra (#90432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90432 Approved by: https://github.com/SherlockNoMad	2022-12-09 21:43:18 +00:00
Sergii Dymchenko	f51f6aa387	Fix non-existing parameters in docstrings (#90505 ) Continuation after https://github.com/pytorch/pytorch/pull/90163. Here is a script I used to find all the non-existing arguments in the docstrings (the script can give false positives in presence of args/*kwargs or decorators): _Edit:_ I've realized that the indentation is wrong for the last `break` in the script, so the script only gives output for a function if the first docstring argument is wrong. I'll create a separate PR if I find more issues with corrected script. ``` python import ast import os import docstring_parser for root, dirs, files in os.walk('.'): for name in files: if root.startswith("./.git/") or root.startswith("./third_party/"): continue if name.endswith(".py"): full_name = os.path.join(root, name) with open(full_name, "r") as source: tree = ast.parse(source.read()) for node in ast.walk(tree): if isinstance(node, ast.FunctionDef): all_node_args = node.args.args if node.args.vararg is not None: all_node_args.append(node.args.vararg) if node.args.kwarg is not None: all_node_args.append(node.args.kwarg) if node.args.posonlyargs is not None: all_node_args.extend(node.args.posonlyargs) if node.args.kwonlyargs is not None: all_node_args.extend(node.args.kwonlyargs) args = [a.arg for a in all_node_args] docstring = docstring_parser.parse(ast.get_docstring(node)) doc_args = [a.arg_name for a in docstring.params] clean_doc_args = [] for a in doc_args: clean_a = "" for c in a.split()[0]: if c.isalnum() or c == '_': clean_a += c if clean_a: clean_doc_args.append(clean_a) doc_args = clean_doc_args for a in doc_args: if a not in args: print(full_name, node.lineno, args, doc_args) break ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90505 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2022-12-09 21:43:09 +00:00
Bin Bao	fd3f5d7bf7	[inductor] Update TIMM skip list (#90188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90188 Approved by: https://github.com/anijain2305	2022-12-09 21:30:23 +00:00
Andrew Gu	1a735a8094	[FSDP] Subtest `CPUOffload` for `test_fsdp_grad_acc.py` (#90545 ) In preparation for the next PR, I wanted to reduce the time to run these gradient accumulation tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90545 Approved by: https://github.com/mrshenli	2022-12-09 21:28:27 +00:00
Driss Guessous	912748e3b7	[SDP] Fix alignment check for efficient_attention (#90413 ) Fixes a bug found using head_dim_size==100 on an a100 gpu. This PR contains stricter guards on the input shape. These constraints are taken from xformers: https://github.com/facebookresearch/xformers/blob/gh/danthe3rd/60/orig/xformers/ops/fmha/cutlass.py#L23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90413 Approved by: https://github.com/mikekgfb	2022-12-09 21:09:25 +00:00
Nikita Shulga	669f7461ac	Use some `if constexpr` in the code (#90483 ) As PyTorch is C++17 project now. Replace `c10::guts::if_constexpr` with `if constexpr` Deliberately delaying changes in headers until at least one nightly cycle is complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90483 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2022-12-09 20:41:50 +00:00
Animesh Jain	d91d7a3221	[reland][dynamo] use optimizers correctly in benchmarking (#87492 ) Reland https://github.com/pytorch/pytorch/pull/87311 mlazos: updated to use SGD to not add a bunch of additional memory allocations (like Adam) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87492 Approved by: https://github.com/desertfire	2022-12-09 20:32:53 +00:00
Michael Lazos	9c4189f82d	[dynamo] Add is_compiling for dynamo (#90329 ) `is_tracing` returns True during dynamo tracing and False when run in Eager Pull Request resolved: https://github.com/pytorch/pytorch/pull/90329 Approved by: https://github.com/jansel	2022-12-09 20:19:41 +00:00
Shen Li	082450609c	[FSDP] Allow nested FSDP wrapper to use different mixed precision (#90523 ) The main change is to move `args` and `kwargs` dtype convertion from `_root_pre_forward` to `_pre_forward`, so that every FSDP has a chance to apply its own precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90523 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-12-09 20:06:05 +00:00
mfkasim1	eedf7a4989	Log1p complex for CUDA (#90422 ) Another pull request in the direction of solving #89205: log1p for complex numbers in CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90422 Approved by: https://github.com/lezcano	2022-12-09 19:53:22 +00:00
PyTorch MergeBot	b2795d3c4e	Revert "[inductor] New approach for computing triton load/store masks (#89566 )" This reverts commit c6c2de586d7f6ecd6a3eb5139870824f33a1f916. Reverted https://github.com/pytorch/pytorch/pull/89566 on behalf of https://github.com/clee2000 due to broke test_invalid_operand_issue1_cuda in inductor/test_torchinductor on https://github.com/pytorch/pytorch/actions/runs/3657444733/jobs/6181700572	2022-12-09 19:36:25 +00:00
Yuxin Wu	4e1881b8b7	use proper temp directories in test_tensorboard.py (#89826 ) The old `temp_dir` is created under `PWD`. But `PWD` may not be writable and in general is not a good place to create temporary directories. Use the standard `tempfile` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89826 Approved by: https://github.com/soumith	2022-12-09 19:33:03 +00:00
Frederik Gerzer	09ccda0d94	Fix: Make `__len__` of datapipes dynamic (#88302 ) Fixes #88074 Several datapipes have their lengths cached on being executed for the first time. However, source datapipes might change in length (most prominently, whenever `apply_sharding` is called). The behaviour is counter-intuitive because we do not expect `__len__` to have side-effects. This PR makes `__len__` dynamically computed. Changes: - Add note to the `datapipes` README that `__len__` should be dynamic and why. - Remove caching of length computations in `ConcaterIterDataPipe`, `MultiplexerIterDataPipe`, `ZipperIterDataPipe`, `BatcherIterDataPipe`, `ConcaterMapDataPipe`, and `BatcherMapDataPipe`. - This required removal of the `length` attribute in setstate/getstate of `MultiplexerIterDataPipe`. I am unsure whether to remove this completely and risk breaking saved checkpoints (as I did) or whether to just ignore the `length` of the loaded `state`. - This also means the classes above no longer have a `length` attribute. I have found no uses of this, though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88302 Approved by: https://github.com/NivekT	2022-12-09 19:15:53 +00:00
Jiewen Tan	93aa6e3e36	[LTC] Make some LazyGraphExecutor private data structures protected (#90457 ) Summary: This pull request makes some LazyGraphExecutor private data structures protected such that XLAGraphExecutor can reuse them. Here is the list: 1. DeviceLocker. 2. DeviceLockerArena. 3. DataCacheArena. In addition, it also introduces LazyGraphExecutor::ResetTrimCounter() such that XLAGraphExecutor can reuse the trim counter. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90457 Approved by: https://github.com/JackCaoG	2022-12-09 18:28:13 +00:00
Thiago Crepaldi	bcf7036be5	Disable BUILD_CAFFE2 from ONNX builds (#90475 ) Fixes https://github.com/microsoft/onnx-converters-private/issues/132 @kit1980 and @malfet agreed in disabling ONNX tests for Caffe2 builds. With this change, exporting models with `operator+export_type=ONNX_ATEN_FALLBACK` will properly test non-caffe2 builds, which is the only scenario for aten fallback after caffe2 deprecation Pull Request resolved: https://github.com/pytorch/pytorch/pull/90475 Approved by: https://github.com/kit1980, https://github.com/BowenBao	2022-12-09 18:02:48 +00:00
Michael Lazos	730e44bbc7	Add logging for aot autograd and unified debug flag (#88987 ) - Adds `log_level` to aot's config - Outputs log to `<graph_name>_<log_level>.log` in aot_torchinductor subfolder of the debug directory - Modifies the Inductor debug context to use the graph name when naming the folder instead of the os pid - Adds `TORCH_COMPILE_DEBUG` flag to enable it, (as well as separate dynamo and inductor logs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88987 Approved by: https://github.com/Chillee	2022-12-09 17:28:10 +00:00
Manuel Candales	983d4f6fbb	[Vulkan] Enable QInt8 weights and test quantized convolution with QInt8 weights and QInt32 bias (#90441 ) Summary: - Enable convolution with QInt8 weights - Modify test_quantized_conv2d function to allow testing with QInt8 weights and QInt32 bias. - Added multiple tests for regular, depthwise and pointwise convolution with QInt8 weights and QInt32 bias. Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: kimishpatel Differential Revision: D41562053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90441 Approved by: https://github.com/kimishpatel	2022-12-09 17:08:48 +00:00
Bin Bao	282dfe8ba4	[inductor][Reland] Use decomposition for _to_copy (#90494 ) Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90494 Approved by: https://github.com/ngimel	2022-12-09 16:51:50 +00:00
PyTorch MergeBot	6581063583	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" This reverts commit db0ce4acf3c84d54e468154ead6d773539a2b597. Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board	2022-12-09 16:32:25 +00:00
Edward Z. Yang	eeb3f8aa54	Add missing infer_size_symdimvector implementation. (#90405 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90405 Approved by: https://github.com/voznesenskym	2022-12-09 14:02:53 +00:00
Fabio Rocha	c6c2de586d	[inductor] New approach for computing triton load/store masks (#89566 ) This PR changes the way masks for loads/stores are computed in triton backend of inductor. New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and `r3` it adds `xmask`, `ymask` and `rmask` respectively. For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created. I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89566 Approved by: https://github.com/jansel, https://github.com/ngimel	2022-12-09 12:43:19 +00:00
mikey dagitses	c8954a8907	simplify implementation of c10::isIntegralType (#90193 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90193). * __->__ #90193 simplify implementation of c10::isIntegralType Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90193 Approved by: https://github.com/ezyang	2022-12-09 12:22:06 +00:00
Alex Settle	6b7efac3c9	Reland "Add heirachical module names to torchFX graph.node" (#90205 ) Fixes #87659 Reland of PR #87742 Resolves errors that caused the changes to be backed out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90205 Approved by: https://github.com/jerryzh168	2022-12-09 06:20:31 +00:00
Sean Ross-Ross	0a00858095	Implement checks for vmap escaped errors (#89585 ) Follow on to https://github.com/pytorch/pytorch/pull/89077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89585 Approved by: https://github.com/zou3519	2022-12-09 05:58:07 +00:00
HDCharles	c71b12851d	[ao] public vs private for ao.quantization._X (#88392 ) Summary: added all for these modules without altering names since they tend to be experimental Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015543](https://our.internmc.facebook.com/intern/diff/D41015543) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88392 Approved by: https://github.com/jcaip	2022-12-09 05:39:29 +00:00
HDCharles	6050a7a3d9	[ao] backend_config moving all to top (#88391 ) Summary: moved __all__ to top of functions, removed private funcitons from all Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41015538](https://our.internmc.facebook.com/intern/diff/D41015538) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88391 Approved by: https://github.com/jcaip	2022-12-09 05:39:29 +00:00
Xilun Wu	3759777edc	[threaded PG] fix long hang issue in testing (#90515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90515 Approved by: https://github.com/wanchaol	2022-12-09 05:24:08 +00:00
Mark Saroufim	db0ce4acf3	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-12-09 04:32:31 +00:00
PyTorch MergeBot	b4c27c86b7	[vision hash update] update the pinned vision hash (#90513 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90513 Approved by: https://github.com/pytorchbot	2022-12-09 03:46:40 +00:00
Mauricio Villegas	aacafd2cba	Fixed a couple of mistakes in type annotations in optim package (#90216 ) Doing some tests with all Optimizer and LRScheduler classes in optim package, I noticed a couple of mistakes in type annotations, so created a pull request to fix them. - In Optimizer class, incorrectly named parameter `default` instead of `defaults` in pyi file - In SGD class, type for `maximize` and `differentiable` not available in either py or pyi files I don't know if there is a plan to move all types from pyi to py files, so wasn't too sure where to fix what. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90216 Approved by: https://github.com/janeyx99	2022-12-09 03:20:21 +00:00
BowenBao	78da18345e	[ONNX] Extend PR approver list (#90490 ) Extending the list of ONNX exporter related PR approvers. All had a long track record for contributions in PyTorch/ONNX. @justinchuby - https://github.com/pytorch/pytorch/pulls?q=author%3Ajustinchuby @shubhambhokare1 - https://github.com/pytorch/pytorch/pulls?q=author%3Ashubhambhokare1 @thiagocrepaldi - https://github.com/pytorch/pytorch/pulls?q=author%3Athiagocrepaldi @titaiwangms - https://github.com/pytorch/pytorch/pulls?q=author%3Atitaiwangms @wschin - https://github.com/pytorch/pytorch/pulls?q=author%3Awschin Pull Request resolved: https://github.com/pytorch/pytorch/pull/90490 Approved by: https://github.com/thiagocrepaldi, https://github.com/malfet	2022-12-09 03:08:15 +00:00
Jerry Zhang	797544f1c4	[dynamo][ez] Change module type to str for easier downstream parsing (#90429 ) Summary: att Test Plan: NA Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90429 Approved by: https://github.com/SherlockNoMad	2022-12-09 02:00:18 +00:00
Jerry Zhang	f978a8b026	[quant][be] Remove special casing for getitem in prepare (#90393 ) Summary: This PR cleans up previous special casing for getitem, it should be configured through BackendConfig Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41846185](https://our.internmc.facebook.com/intern/diff/D41846185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90393 Approved by: https://github.com/andrewor14	2022-12-09 01:59:02 +00:00
Nikita Shulga	6fb79b7004	Bump version: 1.14.0->2.0.0 (#90491 ) Except for the usual location, had to update the version in one of ONNX expect patterns, namely here: `43660051d8/test/onnx/expect/TestOperators.test_avg_pool2d.expect (L3)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90491 Approved by: https://github.com/jansel, https://github.com/albanD	2022-12-09 01:08:08 +00:00
Yuxin Wu	ff5a3592e7	Fix static initialization issue for static build (#90133 ) Fixes #83255 Code comes from #83258 after fixing merge conflicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90133 Approved by: https://github.com/soumith, https://github.com/malfet	2022-12-09 01:01:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	c8f5c194ca	Fix bug in dynamic shapes multiply (#90336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90336 Approved by: https://github.com/ezyang	2022-12-09 00:59:50 +00:00
Andrew Gu	2cf703214b	[Composable API][Easy] Fix some follow-ups (#90471 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90471 Approved by: https://github.com/mrshenli	2022-12-09 00:26:38 +00:00
William Wen	eb5b4c21e1	Deepcopy GraphModule in minifier (#90401 ) Fixes https://github.com/pytorch/pytorch/issues/90397. Remove deepcopy calls in minifier tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90401 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2022-12-08 23:59:05 +00:00
Howard Huang	80150788bc	[21/N] Add alltoall_base custom op with CPU/CUDA implementations (#89813 ) Differential Revision: [D41812670](https://our.internmc.facebook.com/intern/diff/D41812670) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89813 Approved by: https://github.com/kwen2501	2022-12-08 23:39:26 +00:00
Howard Huang	e65ee3975f	[20/N] Add recv_any_source custom op with CPU/CUDA implementations (#89505 ) Differential Revision: [D41812671](https://our.internmc.facebook.com/intern/diff/D41812671) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89505 Approved by: https://github.com/kwen2501	2022-12-08 23:39:26 +00:00
Rohan Varma	43660051d8	[Ez] Omit HSDP Z2 from doc (#90503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90503 Approved by: https://github.com/awgu	2022-12-08 23:05:49 +00:00
Sergii Dymchenko	912a1f7b27	Fix issue 38095 TODOs in test_quantized_tensor.py (#90344 ) Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90344 Approved by: https://github.com/malfet	2022-12-08 22:28:15 +00:00
clee2000	fec39f6310	Don't update vision hash on push (#90498 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90498 Approved by: https://github.com/malfet, https://github.com/seemethere	2022-12-08 22:03:24 +00:00
William Wen	9bb16cd3ca	Track torch.compile calls (#90310 ) Title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90310 Approved by: https://github.com/colin2328, https://github.com/anijain2305	2022-12-08 21:41:15 +00:00
Michael Lazos	76f440f20a	[dynamo] Rewrite inplace addcdiv and inplace add (#90330 ) Rewrite inplace addcdiv to a div, mul and inplace add to avoid graph break Rewrite inplace add to a mul and inplace add to avoid graph break Needed to close optimizer graph breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/90330 Approved by: https://github.com/jansel	2022-12-08 21:19:23 +00:00
Stephen Macke	0c972fb5c7	[rfc][pkg] check spec for module source before falling back to file in package exporter (#90258 ) Summary: To get source for a particular module, the "correct" thing to do is to check the module's spec and use `get_source` if it's a SourceFileLoader, since subclasses may look elsewhere than the `__file__`, and the spec will give the source of truth. For torch packager, however, we prefer to use linecache, but the loader could still change the file, so we figure out the file for the module using the spec's loader rather than using `module.__file__`, if possible. Test Plan: This code path will get exercised by CI. Also added a test for remapped files. Differential Revision: D41412983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90258 Approved by: https://github.com/PaliC	2022-12-08 20:24:45 +00:00
Zheng Yan	e1674d7dc0	avoid fork in torch/__init__.py for deploy/multipy (#90492 ) Summary: We should not fork in deploy when initializing torch. Traceback (most recent call last): File "<string>", line 38, in <module> File "<string>", line 36, in __run File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/zyan/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/multipy/runtime/__test_py__/test_py#link-tree/multipy/runtime/test_py.py", line 61, in <module> import torch # has to be done serially otherwise things will segfault File "/data/users/zyan/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/multipy/runtime/__test_py__/test_py#link-tree/torch/__init__.py", line 158, in <module> platform.system() != 'Windows': File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 891, in system return uname().system File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 857, in uname processor = _syscmd_uname('-p', '') File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 613, in _syscmd_uname output = subprocess.check_output(('uname', option), Test Plan: override a local script run trigger init and set `subprocess.check_output` to None Reviewed By: yinghai, houseroad Differential Revision: D41848592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90492 Approved by: https://github.com/PaliC	2022-12-08 20:22:01 +00:00
Elias Ellison	b651e06049	Add Pointwise Tag from pointwise set in DTensor, use in aot_autograd partitioner (#90029 ) Takes the pointwise op list from [DTensor](https://github.com/pytorch/pytorch/blob/master/torch/distributed/_tensor/ops/pointwise_ops.py#L36) as an initially starting point for pointwise ops, and feeds them to the aot autograd partitioner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90029 Approved by: https://github.com/ezyang	2022-12-08 20:21:17 +00:00
Edward Z. Yang	8ca1c910fb	Refactor test_inductor_XXX to reduce code duplication (#90443 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90443 Approved by: https://github.com/desertfire	2022-12-08 19:58:58 +00:00
Richard Zou	7342251281	functorch.grad support for autograd.Function (#89860 ) Happy to split this PR more if it helps. This PR adds functorch.grad support for autograd.Function. There's a lot going on; here is the high level picture and there are more details as comments in the code. Mechanism (PyOperator) - Somehow, autograd.Function needs to dispatch with functorch. This is necessary because every layer of functorch needs to see the autograd.Function; grad layers need to preserve the backward pass. - The mechanism for this is via PyOperator. If functorch transforms are active, then we wrap the autograd.Function in a `custom_function_call` PyOperator where we are able to define various rules for functorch transforms. - `custom_function_call` has a rule for the functorch grad transform. autograd.Function changes - I needed to make some changes to autograd.Function to make this work. - First, this PR splits autograd.Function into a _SingleLevelFunction (that works with a single level of functorch transform) and autograd.Function (which works with multiple levels). This is necessary because functorch's grad rule needs some way of specifying a backward pass for that level only. - This PR changes autograd.Function's apply to eitehr call `custom_function_call` (if functorch is active) or super().apply (if functorch isn't active). Testing - Most of this PR is just testing. It creates an autograd.Function OpInfo database that then gets passed to the functorch grad-based tests (grad, vjp, vjpvjp). - Since functorch transform tests are autogenerated from OpInfo tests, this is the easiest way to test various autograd.Function with functorch. Future - jvp and vmap support coming next - better error message (functorch only supports autograd.Function that have the optional setup_context staticmethod) - documentation to come when we remove the feature flag Pull Request resolved: https://github.com/pytorch/pytorch/pull/89860 Approved by: https://github.com/soulitzer	2022-12-08 19:31:04 +00:00
Richard Zou	eb314f9b1a	Add setup_context staticmethod to autograd.Function (#89859 ) Adds a setup_context staticmethod to autograd.Function. If it exists, then the user splits the ctx-specific logic from the forward() and puts it in the setup_context staticmethod. Docs will come later when we remove the feature flag. Test Plan: - some light tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/89859 Approved by: https://github.com/soulitzer	2022-12-08 19:31:04 +00:00
Richard Zou	103be1f164	Add feature flag for the autograd.Function extension (#89858 ) This PR adds a private runtime feature flag for the feature work we're going to do with extending autograd.Function. The motivation of the feature flag is: - to guard the feature against unsuspecting users - control the release of the feature to when we are ready to release it We might not even need the feature flag (because we hope to have the work done in the next month), but it is good practice and it does touch currently public API (autograd.Function). Concretely, "autograd.Function extension" refers to: - adding an optional `setup_context` staticmethod to autograd.Function - adding an optional `vmap` staticmethod to autograd.Function - autograd.Function support for functorch Test Plan: - new test that the feature flag works Pull Request resolved: https://github.com/pytorch/pytorch/pull/89858 Approved by: https://github.com/soulitzer	2022-12-08 19:31:01 +00:00
Yuxin Wu	1ba5c55992	skip flaky tests (rather than expectedFailure) (#90233 ) They are flaky but don't always fail. So `expectedFailure` is incorrect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90233 Approved by: https://github.com/mruberry, https://github.com/soumith	2022-12-08 18:29:11 +00:00
PyTorch MergeBot	e89685b0b5	Revert "[inductor] Use decomposition for _to_copy (#90314 )" This reverts commit 3fdb5f2dda7164f6282e80c39799843527d135e7. Reverted https://github.com/pytorch/pytorch/pull/90314 on behalf of https://github.com/desertfire due to regresses performance on hf_Bert	2022-12-08 18:29:06 +00:00
Jiewen Tan	b738da8c8e	[LTC] Tweak LazyTensor Class for XLATensor (#90363 ) Summary: This pull request makes some tweaks on LazyTensor class such that it's easier for XLATensor to inherit. 1. It replaces data_ptr() with data() which now returns a const shared_ptr& type. 2. It adds a temporary ctor to LazyTensor::Data such that XLATensor::Data can easily inherits it. 3. It moves LazyTensor(std::shared_ptr<Data>) and SetTensorData(at::Tensor) to protected for XLATensor to access. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90363 Approved by: https://github.com/JackCaoG	2022-12-08 18:23:17 +00:00
Denis Vieriu	b71c710db1	Add additional tests for view slice tensors (#86282 ) Fixes https://github.com/pytorch/pytorch/issues/83995 and https://github.com/pytorch/pytorch/issues/84489 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86282 Approved by: https://github.com/kulinseth	2022-12-08 17:59:55 +00:00
PyTorch MergeBot	465005c1e0	Revert "Fix issue 38095 TODO in test_multiprocessing.py (#90335 )" This reverts commit cbb2d5af81dcfaf181db7e9083b9c41b29fdb4eb. Reverted https://github.com/pytorch/pytorch/pull/90335 on behalf of https://github.com/clee2000 due to somehow caused test_multiprocessing to timeout `cbb2d5af81` https://github.com/pytorch/pytorch/actions/runs/3645873711/jobs/6159998523	2022-12-08 17:12:10 +00:00
Driss Guessous	8ea90d926f	Add support to foreach torch empty for bfloat16s (#90437 ) # Summary When training a model with SGD(..., foreach=true) found that bfloat16 model was erroring with no cuda support for empty. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90437 Approved by: https://github.com/soumith	2022-12-08 17:02:06 +00:00
Bin Bao	d2ee94231e	[inductor] Fallback for index with None in the middle of indices (#90022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90022 Approved by: https://github.com/ngimel	2022-12-08 16:18:57 +00:00
Ankur Verma	b62cfbca84	Remove TORCH_API from inline at::internal::lazy_init_num_thread (#89511 ) The function signature in its current state is ambiguous. Its an inline function that is also declared to be imported from the DLL. which leaves it subject to compilers decision to choose one or the other and depending on what the compiler/linker may choose we may get one of the two behaviors for the `aten::init_num_threads` call: 1. Once-per-dll-in-a-thread (if its inlined) 2. Once-per-thread (if its imported) I suspect once-per-dll-in-a-thread is already the case currently because it being tagged inline So removing the inline will simply make it a little more consistent and clear. The function exists to avoid repeated calls to aten::init_num_threads. Being in an "internal" namespace, the function isnt expected to be called by external plugins which means that the "once-per-dll-in-a-thread" behavior isn't that much of a problem anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/89511 Approved by: https://github.com/malfet	2022-12-08 16:18:38 +00:00
Rohan Varma	793a999ce0	Hybrid Sharded Data Parallel (#89915 ) Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. Acknowledgements - @awgu 's excellent prototype: `5ad3a16d48` - @liangluofb For ideation, feedback, and initial implementation and experimentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915 Approved by: https://github.com/awgu	2022-12-08 16:18:03 +00:00
Peter Bell	454361435c	Implement correction argument in torch.masked.{std,var} (#87118 ) This makes the signature of `torch.masked.std` and `var` more consistent with the global namespace variant and also updates the sample inputs to repurpose the existing `sample_inputs_std_var` inputs which fully exercise the `correction` argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87118 Approved by: https://github.com/cpuhrsch	2022-12-08 15:59:09 +00:00
Andrew Gu	a6593d6622	[Composable API][Easy] Use `policy=None` since that is supported (#90400 ) I believe that @mrshenli used `ModuleWrapPolicy({UnitModule})` when applying `fully_shard` to `UnitModule`s because `policy=None` was not supported. However, he added that support in a previous PR, so this PR simplifies to using `policy=None` to make the intention more clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90400 Approved by: https://github.com/mrshenli	2022-12-08 15:55:20 +00:00
Andrew Gu	21a0e809c2	[Composable API] Match `fully_shard()` comm. schedule with wrapper FSDP (#90387 ) - This PR introduces a new concept, the _communication module_ (denoted `comm_module`), that represents the module responsible for the unshard/reshard pair for a `FlatParamHandle`. This is well-defined because the current design assumes that each `FlatParamHandle` only has _one_ unshard/reshard pair for either the forward or backward pass. - For the wrapper code path, the `comm_module` is exactly the module already being passed to the `FlatParamHandle` constructor. - For the composable code path, the `comm_module` is not necessarily the module already being passed to the `FlatParamHandle`. This is because the module already being passed is always the local FSDP root module to give complete FQNs, instead of local FQNs. Distinguishing the communication module from the local FSDP root module can provide more flexibility for non-recursive wrapping designs in the future. - This PR adds a unit test `test_unshard_reshard_order` that explicitly checks that `_unshard` and `_reshard` are called in the exactly the same order across the two code paths. - This PR does not fix `test_checkpoint_fsdp_submodules_use_reentrant`. However, the error message changes, so this PR accommodates that. - The error is now the same as if we used the equivalent wrapper FSDP: ``` test_model.u1 = FSDP(test_model.u1, use_orig_params=True) test_model.u2 = FSDP(test_model.u2, use_orig_params=True) ``` - The error is also the same as if we used wrapper FSDP with `use_orig_params=False`, so it is not unique to `use_orig_params=True`. --- `comm_module` Example ``` model = Model( seq1: nn.Sequential( nn.Linear nn.ReLU nn.Linear nn.ReLU ) seq2: nn.Sequential( nn.Linear nn.ReLU nn.Linear nn.ReLU ) ) policy = ModuleWrapPolicy({nn.Sequential}) fully_shard(model, policy=policy) FullyShardedDataParallel(model, auto_wrap_policy=policy) ``` - This policy constructs two `FlatParamHandle`s, one for `seq1` and one for `seq2`. - `FullyShardedDataParallel` will pass `seq1` and `seq2` as the `module` argument to the two `FlatParamHandle`s, respectively. - `fully_shard()` will pass `model` as the `module` argument to every `FlatParamHandle`. - `FullyShardedDataParallel` will pass `seq1` and `seq2` as the `comm_module` argument to the two `FlatParamHandle`s, respectively. - `fully_shard()` will pass `seq1` and `seq2` as the `comm_module` argument to the two `FlatParamHandle`s, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90387 Approved by: https://github.com/mrshenli	2022-12-08 15:55:20 +00:00
Andrew Gu	4011597dd4	[Composable API] Refactor `test_fully_shard.py` to use common models (#90386 ) Unlike for FSDP, where we already diverged to using per-test-file models, let us try to use the same set of models for the composable API effort. This can improve debugging efficiency because we know which module structures we support and which we do not _across all of our composable APIs_. This PR had to perform some surgery for `test_materialize_meta_module`. Writing a correct parameter initialization function for meta device initialization is not easy, and we should revisit this. The old implementation, which followed the style of the previous unit tests--namely, using `module.to_empty()`--is actually incorrect for nested FSDP applications because `module.to_empty()` will re-initialize already materialized parameters and the module materialization proceeds bottom up. The existing unit test in `test_fsdp_meta.py` passes because it sets every parameter to ones (`self.weight.fill_(1)`), which is idempotent to re-initialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90386 Approved by: https://github.com/mrshenli	2022-12-08 15:32:36 +00:00
Andrew Gu	5ca4e95f6c	[Composable API] Move test models to common file (#90385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90385 Approved by: https://github.com/mrshenli	2022-12-08 15:32:36 +00:00
Bin Bao	3fdb5f2dda	[inductor] Use decomposition for _to_copy (#90314 ) Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90314 Approved by: https://github.com/ngimel	2022-12-08 15:25:44 +00:00
yanbing-j	dc40b6d043	Upgrade oneDNN to v2.7.2 (#90051 ) This PR is to upgrade oneDNN to v2.7.2. ### oneDNN v2.7.1 & 2.7.2 changes: Fixes #89104 Updated ITT API version to 3.23.0 ### Performance Benchmark Use TorchBench test in ICX with 40 cores Intel OpenMP & tcmalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/205240855-04e2d50f-8b3a-4097-9038-fdd0c0fc93b9.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90051 Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5	2022-12-08 09:41:02 +00:00
Till Hoffmann	b485781440	Add a transform for positive-definite matrices. (#76777 ) The `PositiveDefiniteTransform` is required to transform from an unconstrained space to positive definite matrices, e.g. to support testing the Wishart mode in #76690. It is a simple extension of the `LowerCholeskyTransform`. I've also added a small test that ensures the generated data belong to the domain of the associated transform. Previously, the data generated for the inverse transform of the `LowerCholeskyTransform` wasn't part of the domain, and the test only passed because the comparison uses `equal_nan=True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76777 Approved by: https://github.com/lezcano, https://github.com/fritzo, https://github.com/soumith	2022-12-08 09:18:44 +00:00
Yuxin Wu	c00b135adf	Remove deprecated call to tf.io.gfile.get_filesystem (#89832 ) Fixes #30966 . Fixes #47139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89832 Approved by: https://github.com/soumith	2022-12-08 08:53:27 +00:00
Yuxin Wu	ecd784667c	Avoid overflow in tensorboard image summary (#90423 ) Fix #90419 Added some code such that the test will update the expect files when `expecttest.ACCEPT` is True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90423 Approved by: https://github.com/soumith	2022-12-08 08:31:52 +00:00
Jiewen Tan	1978773399	[LTC] Overlap data creation and ir_value setting (#90438 ) Summary: Upstreaming changes from torch_xla to lazy tensor core: https://github.com/pytorch/xla/pull/4011. It overlaps data creation and ir_value setting with previous executions. To be noted, this is a clone of https://github.com/pytorch/pytorch/pull/87119, and the author is @aws-rhsoln. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90438 Approved by: https://github.com/JackCaoG	2022-12-08 08:11:01 +00:00
Rohan Varma	9c80f13692	[Resubmit] state_dict_pre_hook (#90435 ) Resubmit of https://github.com/pytorch/pytorch/pull/88541 which got stale. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90435 Approved by: https://github.com/fegin	2022-12-08 07:54:14 +00:00
Jesse Cai	de016b3799	[pruning][core][feature] Implement prune for structured pruning (#89777 ) Summary: This PR implements `prune` in BaseStructuredSparsifier: `prune` is a function that takes in a model with structured sparsity parametritizations (the result of `prepare`) and will return a resized model with the masked out weights removed. `prune` is defined by a mapping from patterns to different pruning functions. - patterns are just sequences of operations, for example `(nn.Linear, activation, nn.Linear)` - pruning functions are functions that take in an matched pattern as args and will resize the appropriate layer sizes and weights. ``` def prune_linear_activation_linear(linear1, activation, linear2): pass ``` - This is one line in the pattern config `(nn.Linear, activation, nn.Linear): prune_linear_activation_linear` At a high level `prune` works by finding instances of the graph that match different patterns and then calling the mapped pruning functions on those matched patterns. This is unlike the previous code which attempted to do both at the same time. There may be some gaps in the patterns compared to the previous implementation, but the conversion functionality support should be the same. Currently we have pruning functions for the following patterns: - linear -> linear - linear -> activation -> linear - conv2d -> conv2d - conv2d -> activation -> conv2d - conv2d -> activation -> pool -> conv2d - conv2d -> pool -> activation -> conv2d - conv2d -> adaptive pool -> flatten -> linear Added in MyPy type hints as well for the prune_functions. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89777 Approved by: https://github.com/vkuzo	2022-12-08 07:13:24 +00:00
Jiewen Tan	c20d41253f	[LTC] Tweak LazyGraphExecutor for XLA (#90420 ) Summary: This patch moves some of the data structures from private to protected such that XLAGraphExecutor can reuse them. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90420 Approved by: https://github.com/JackCaoG	2022-12-08 06:56:23 +00:00
fduwjj	1a48ae96ba	[PT-D][Easy] Reformat the optim code within PTD code base (#90399 ) Just run two commands: ``` ufmt format torch/distributed/optim/ ufmt format test/distributed/optim/ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90399 Approved by: https://github.com/awgu	2022-12-08 06:38:59 +00:00
Sergii Dymchenko	cbb2d5af81	Fix issue 38095 TODO in test_multiprocessing.py (#90335 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90335 Approved by: https://github.com/clee2000	2022-12-08 06:27:08 +00:00
titaiwang	06c98e673f	[ONNX] Fix ignored small eps in layer normalization in fp16 (#89869 ) Prior to this change, the symbolic_fn `layer_norm` (before ONNX version 17) always lose precision when eps is smaller than Float type, while PyTorch always take eps as Double. This PR adds `onnx::Cast` into eps related operations to prevent losing precision during the calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89869 Approved by: https://github.com/BowenBao	2022-12-08 06:13:09 +00:00
PyTorch MergeBot	5f3ca208c5	Revert "add save and load stats in memory_tracker (#90144 )" This reverts commit 1f137c1e2f738d9021b5e22fb6e52d41b780a1a8. Reverted https://github.com/pytorch/pytorch/pull/90144 on behalf of https://github.com/ezyang due to dirty git working copy broke master	2022-12-08 05:16:56 +00:00
PyTorch MergeBot	22a249e44e	Revert "[Inductor] More robust stride and offset extraction from index expressions (#90184 )" This reverts commit 71f27f768839394ec226c37a763bd524d8589f07. Reverted https://github.com/pytorch/pytorch/pull/90184 on behalf of https://github.com/ngimel due to catastrophically regresses performance	2022-12-08 05:04:15 +00:00
Han Qi (qihqi)	25eb7c3ae3	Clean up dependancy for flatbuffer_loader (#86041 ) Test Plan: waitforsandcastle Differential Revision: D38445936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86041 Approved by: https://github.com/cccclai	2022-12-08 03:48:04 +00:00
Edward Z. Yang	37892041a1	Always compile tiny graphs with AOTAutograd (#89775 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89775 Approved by: https://github.com/anjali411, https://github.com/bdhirsh	2022-12-08 03:41:29 +00:00
Iris	b8b7480065	[Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212 ) This is the last PR for integrating 2D into core distributed. This PR does the following: 1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state. 2. Update default_planner.py to support 2D checkpoint. 3. Add test_fsdp_optim_state.py as a unit test for No. 1. 4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py 5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328 Docstring and integration test will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212 Approved by: https://github.com/wanchaol	2022-12-08 02:53:29 +00:00
Nikita Shulga	36ac095ff8	Migrate PyTorch to C++17 (#85969 ) With CUDA-10.2 gone we can finally do it! This PR mostly contains build system related changes, invasive functional ones are to be followed. Among many expected tweaks to the build system, here are few unexpected ones: - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code. - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it. Some prerequisites: - https://github.com/pytorch/pytorch/pull/89297 - https://github.com/pytorch/pytorch/pull/89605 - https://github.com/pytorch/pytorch/pull/90228 - https://github.com/pytorch/pytorch/pull/90389 - https://github.com/pytorch/pytorch/pull/90379 - https://github.com/pytorch/pytorch/pull/89570 - https://github.com/facebookincubator/gloo/pull/336 - https://github.com/facebookincubator/gloo/pull/343 - `919676fb32` Fixes https://github.com/pytorch/pytorch/issues/56055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85969 Approved by: https://github.com/ezyang, https://github.com/kulinseth	2022-12-08 02:27:48 +00:00
Digant Desai	f2d95765e4	[pthreadpool] Set max threadlimit to tsan limit (#89453 ) Summary: This will make sure we don't run into an internal assert for clang tsan which has a cap of 63 on concurrently held lock count. Seems like it is failing with 64 since the comparison is `<`, so setting it to 63 here. ``` llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" ``` Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: CI Sandcastle run Reviewed By: kimishpatel, salilsdesai Differential Revision: D41444710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89453 Approved by: https://github.com/mcr229	2022-12-08 02:02:53 +00:00
Will Constable	772b726068	Revert "Disable dynamo tracing torchrec.distributed (#90087 )" (#90416 ) This reverts commit 7e9a8a1361a090cee86544a3c029b9b4ed622e9c. This revert fixes a torchbench dlrm amp crash. Auto revert fails due to conflict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90416 Approved by: https://github.com/yanboliang, https://github.com/malfet	2022-12-08 01:50:54 +00:00
Sergii Dymchenko	00118f5c30	Fix issue 38095 TODO in test_jit_fuser_te.py (#90246 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90246 Approved by: https://github.com/clee2000	2022-12-08 01:39:26 +00:00
Richard Barnes	ad188a227e	Introduce CUDA Device Assertions Infrastructure (#84609 ) Summary: This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of `CUDA_KERNEL_ASSERT2` A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state. Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary. `TORCH_DSA_KERNEL_ARGS` This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified. `c10::cuda::get_global_cuda_kernel_launch_registry()` This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected). `TORCH_DSA_KERNEL_LAUNCH` This host-side preprocessor macro replaces the standard ``` kernel_name<<<blocks, threads, shmem, stream>>>(args) ``` invocation with ``` TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args); ``` Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging. `c10::cuda::c10_retrieve_device_side_assertion_info` This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with: 1. Information (file, line number) of what kernel was launched. 2. Information (file, line number, message) about the device-side assertion 3. Information (file, line number) about where the failure was detected. Checking for device-side assertions Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating > CUDA error: device-side assert triggered CUDA kernel errors Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!) # Notes on special cases * Multiple assertions from the same block are recorded * Multiple assertions from different blocks are recorded * Launching kernels from many threads on many streams seems to be handled correctly * If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue * X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition * X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition Differential Revision: D37621532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-08 01:26:07 +00:00
Sergii Dymchenko	f99f239531	Fix issue 38095 TODOs in gloo tests (#89985 ) Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89985 Approved by: https://github.com/ZainRizvi	2022-12-08 01:12:37 +00:00
Atul Jangra	1ba94b3882	Support pickle version 4 by adding missing ops (#90223 ) Summary: In this logic, we are traversing the entries to find the module for STACK_GLOBAL entries. According to `2837241f22/Lib/pickletools.py (L1799)` we need to look for GET, BINGET and LONG_BINGET. So this diff updates that. Also while testing, I found some cases of empty modules, for cases such as tanh. For this I added the option to skip processing when this is the case. Test Plan: Tested with f392778829 Differential Revision: D41748595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90223 Approved by: https://github.com/PaliC	2022-12-08 01:06:40 +00:00
Edward Z. Yang	d5c6a74699	Rewrite dynamo cond() handling to not recursively call export (#90286 ) The original implementation of cond() operator support in dynamo operated by recursively calling export() on the inner subgraph. This is problematic for a number of reasons: * My original motivating reason: the original implementation had to play tricks to feed real tensors to the recursive export call, which means that it doesn't work well with tracing with dynamic shapes (where we MUST stay in fake tensors to accurately track dynamic shapes across the cond invocation) * If there are pending side effects, the recursive export() call won't see those side effects (as they are only tracked by Dynamo, not actually applied to the Python environment.) You can see an example where dynamo cond tracing does the wrong thing at https://github.com/pytorch/pytorch/pull/90208 * If there were side effects inside the true/false branch, these side effects were silently lost (as the export only returns the graph of tensor operations, and not any of the residual Python bytecodes necessary to reapply any side effects.) This could have substantive effects on the export of subsequent parts of the model, as those parts of the models could rely on the side effects. * It was not possible to track NN module accesses inside the true/false branches, necessitating a hack where the NN module was explicitly passed in as an input to cond https://github.com/pytorch/pytorch/pull/87020#issuecomment-1338842844 which doesn't really make any sense from a backend compilation perspective * Guards induced from the inside of the true/false branch were not properly propagated to the top level guards; they were just silently dropped (in fact, the original implementation checked that the true/false branch produce the same guards which... is not useful? Like, I don't think that actually is even necessary for correctness) This PR replaces the old implementation with a new implementation based on graphstate checkpointing. The basic idea is to process a cond(), we checkpoint the state of our interpreter, run the true branch, rollback to our checkpoint, run the false branch, rollback to our checkpoint and then merge the changes from both of the checkpoints. I require the true/false branches to have exactly the same side effects, but union their guards. Some of the details: * Dynamo is too aggressive with tracking side effects when processing closures, c.f. https://github.com/pytorch/torchdynamo/pull/233/files#r1040480078 The basic problem is whenever I define a closure, this immediately counts as a side effect, even if I didn't actually mutate anything. This triggered on the nested cond export example. To prevent this from happening, I optimistically avoid tracking side effects, but if a STORE_DEREF happens, I restart analysis with the relevant Source.name() added to `mutated_closure_cell_contents` so we start tracking on closure allocation. This is enough to fix the relevant test. * For the most part, I assert that the graph states must be equivalent after applying the true/false branches. During debugging, I found it useful to be able to compare two graph states and give a better description about what the divergence was. You can test this using the `diff()` method I've added to a few structures. * The implementation now supports NestedUserFunctionVariable, which is nice as it allows the true/false branches to be defined closer to the cond implementation. * I fixed the naming of the true/false subgraphs; previously they were named `name_0`, `name_1`, now they are named `cond_true_0` and `cond_false_0` * I added `name_to_input` to the saved graph state. I don't actually know if this is necessary, but it seemed like a good idea. * I have to play some tricks to get the speculating execution of the true/false branch to record into a subgraph. After a careful read of OutputGraph, I found that what would work is overriding graph with a fresh Graph that we want to write things into, and manually setting up the inputs/outputs. It's a little delicate as you have to make sure you reset the Graph to its original before you restore a checkpoint, as checkpoints don't actually save graph for efficiency, and just undo changes on the graph. This capability may usefully get refactored to OutputGraph but I didn't do it in this PR for simplicity. There are some further problems with the cond() implementation that I leave for future work. Most of these were preexisting with the original implementation. * Not a problem per se, but if an NN module is used by both the true/false branch, it will show up in the final graph twice (since it has to be a submodule of the GraphModule that makes use of it.) I hope the export pipeline can deal with this. * List of tensor output for cond is not supported. * The true/false return values may not have consistent sizes/dims/etc, and we don't check them for consistency. * If we modify fake tensors in the true/false branches, we aren't rolling them back, c.f. https://github.com/pytorch/torchdynamo/issues/1840 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90286 Approved by: https://github.com/voznesenskym	2022-12-08 01:05:12 +00:00
Edward Z. Yang	54d344b0b7	Type torch._dynamo.side_effects (#90202 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90202 Approved by: https://github.com/voznesenskym	2022-12-08 01:05:12 +00:00
Edward Z. Yang	ca5f69ef19	Convert InstructionTranslatorGraphState and OutputGraphState to NamedTuple (#90186 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90186 Approved by: https://github.com/voznesenskym	2022-12-08 01:05:12 +00:00
Edward Z. Yang	1119aac485	Type torch._dynamo.symbolic_convert (#90185 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90185 Approved by: https://github.com/voznesenskym	2022-12-08 01:05:12 +00:00
Edward Z. Yang	7abd035b2f	Add missing mypy-nofollow.ini (#90179 ) I'm not sure how lintrunner worked without this lol. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90179 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2022-12-08 01:05:12 +00:00
Jerry Zhang	47071c3d47	[quant] Add support for symmetric quant in executorch (#90304 ) Summary: This PR adds symmetric quant in the backend config for executorch Test Plan: NA, will be tested in meta internal flow Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90304 Approved by: https://github.com/cccclai, https://github.com/jcaip, https://github.com/andrewor14	2022-12-08 01:03:00 +00:00
PyTorch MergeBot	9f7bc7bc24	Revert "[Quant][fx][bc-breaking] Make convert.py smaller (#90189 )" This reverts commit 824641b083860df4d7ffef06a798ea2702bc4bde. Reverted https://github.com/pytorch/pytorch/pull/90189 on behalf of https://github.com/seemethere due to Fails internal tests due to potential circular import, see https://www.internalfb.com/diff/D41817429?dst_version_fbid=1453307181865235&transaction_fbid=899728221278938	2022-12-08 00:51:13 +00:00
Bin Bao	d7c30e11c6	[inductor] Remove .to from lowering (#90280 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90280 Approved by: https://github.com/ngimel	2022-12-08 00:40:41 +00:00
Nikita Shulga	b8b439aede	C++17 friendly iterator implementation (#90379 ) Get rid of std::iterator inheritance/references for `c10::DictIterator`, `c10::IListRefIterator` and `c10::ListIterator` Followup after https://github.com/pytorch/pytorch/pull/90174 Fixes deprecation warning and extension compilation failures using VC++ that raises following errors: ``` C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\include\ATen/core/IListRef.h(517): error C4996: 'std::iterator<std::bidirectional_iterator_tag,T,ptrdiff_t,T ,T &>::value_type': warning STL4015: The std::iterator class template (used as a base class to provide typedefs) is deprecated in C++17. (The <iterator> header is NOT deprecated.) The C++ Standard has never required user-defined iterators to derive from std::iterator. To fix this warning, stop deriving from std::iterator and start providing publicly accessible typedefs named iterator_category, value_type, difference_type, pointer, and reference. Note that value_type is required to be non-const, even for constant iterators. You can define _SILENCE_CXX17_ITERATOR_BASE_CLASS_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning. C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\include\ATen/core/List.h(169): error C4996: 'std::iterator<std::random_access_iterator_tag,T,ptrdiff_t,T ,T &>::difference_type': warning STL4015: The std::iterator class template (used as a base class to provide typedefs) is deprecated in C++17. (The <iterator> header is NOT deprecated.) The C++ Standard has never required user-defined iterators to derive from std::iterator. To fix this warning, stop deriving from std::iterator and start providing publicly accessible typedefs named iterator_category, value_type, difference_type, pointer, and reference. Note that value_type is required to be non-const, even for constant iterators. You can define _SILENCE_CXX17_ITERATOR_BASE_CLASS_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning. ``` Discovered while working on https://github.com/pytorch/pytorch/pull/85969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90379 Approved by: https://github.com/ezyang, https://github.com/dagitses	2022-12-08 00:30:20 +00:00
Michael Wootton	5351176caa	Kineto activity fix (#89785 ) Continuation of https://github.com/pytorch/pytorch/pull/88207 A compile time guard was preventing ActivityType::CUDA from being available on rocm. This caused both the GPU_FALLBACK and CUDA modes to be active at the same time. So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times. This caused incorrect (and often negative) cuda times, in e.g. table(). Previously a cmake variable was not being propagated to a '-D', causing an issue on Windows, which uses cuda but not cupti. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89785 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2022-12-08 00:24:55 +00:00
Peter Bell	79406378ae	[primTorch] Add prim and ref for as_strided_scatter (#88426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88426 Approved by: https://github.com/mruberry	2022-12-08 00:17:39 +00:00
Yanli Zhao	1f137c1e2f	add save and load stats in memory_tracker (#90144 ) add save and load stats in memory_tracker, so that users could plot the traces in another place, rather than just inside trainer Pull Request resolved: https://github.com/pytorch/pytorch/pull/90144 Approved by: https://github.com/rohan-varma	2022-12-08 00:17:21 +00:00
Natalia Gimelshein	bc93454e4a	correctly set strides for expanded/unsqueezed dimensions (#90341 ) Fixes https://github.com/pytorch/torchdynamo/issues/1959, #90260 However, I wasn't able to make existing stride tests fail before the fix, even though I'm comparing all, not just significant strides. Separately running refs on meta tensors produces wrong strides as shown in #90260, however, it looks like in meta tests some other way of computing meta info is used (I've been running ``` pytest -s -v test/test_meta.py -k test_meta_outplace_expand_cuda_float64 ``` and verified that it has sample input that should fail, and that it indeed compares all the strides, but the produced `meta_rs` results somehow still had correct strides). Edit: @SherlockNoMad helped me figure out how to fail the tests, and now I've set the correct ops for checking. `expand` fails for some test inputs because it special-cases 0-dim input case, correctly modeling it in prims would require a lot of changes, so skipping that for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90341 Approved by: https://github.com/SherlockNoMad	2022-12-07 23:38:33 +00:00
Xiaodong Wang	50ec416599	Fix C2 Ambiguous namespace (#89534 ) Summary: cuda:: is a ambiguous namespace. Make it explicit c10::cuda Differential Revision: D41469007 /caffe2/caffe2/core/context_gpu.cu(564): error: "caffe2::cuda" is ambiguous/caffe2/caffe2/core/context_gpu.cu(564): error: expected a ";"/caffe2/caffe2/core/context_gpu.cu(568): warning #12-D: parsing restarts here after previous syntax error Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"/caffe2/caffe2/core/context_gpu.cu(569): error: "caffe2::cuda" is ambiguous/caffe2/caffe2/core/context_gpu.cu(628): error: "caffe2::cuda" is ambiguous 4 errors detected in the compilation of "/caffe2/caffe2/core/context_gpu.cu". Pull Request resolved: https://github.com/pytorch/pytorch/pull/89534 Approved by: https://github.com/malfet	2022-12-07 23:36:41 +00:00
Manuel Candales	56ab94d6e4	[Vulkan][TCC] Add tests for quantized convolution with QUInt8 activation, weights and bias (#90012 ) Summary: - Registered vulkan_prepack::create_qconv2d_context to the QuantizedCPU backend. - Registered vulkan_prepack::run_qconv2d_context to the Vulkan backend. - Added function test_quantized_conv2d, in order to test Vulkan Quantized Conv2d with QUInt8 activation, weight and bias (all QUInt8). - Added multiples tests for vulkan quantized conv2d (regular, depthwise and pointwise). All these tests make use of the test_quantized_conv2d function. This function tests the correctness of vulkan quantized conv2d, by comparing the following two processes: (we start with randomly generated float cpu tensors) - random float cpu tensors -> to vulkan -> quantize them -> apply vulkan conv2d quantized op -> dequantize result -> to cpu - random float cpu tensors -> quantize them -> dequantize -> apply cpu floating point conv2d op on dequantized tensors -> quantize result -> dequantize This function takes three boolean flags that modify its behavior: - prepacking: - if false, then we directly call at::native::vulkan::ops::quantized_conv2d - if true, then we call vulkan_prepack::create_qconv2d_context and vulkan_prepack::run_qconv2d_context. - compute_quantization_params & random_quantization_params: - if both are false, all quantization params are fixed (given as input) - if compute_quantization_params is true, all params are computed - if random_quantization_params is true, the input params are random and the output params are computed. (compute_quantization_params takes precedence over random_quantization_params) Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: SS-JIA Differential Revision: D41047096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90012 Approved by: https://github.com/salilsdesai	2022-12-07 23:21:57 +00:00
Nikita Shulga	e0f681aa85	Add manual cuda deps search logic (#90411 ) If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem. Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders. Test plan: ``` docker pull amazonlinux:2 docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"' ``` Fixes https://github.com/pytorch/pytorch/issues/88869 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90411 Approved by: https://github.com/atalman	2022-12-07 23:06:51 +00:00
Facebook Community Bot	3ef4fc2012	Automated submodule update: FBGEMM (#74729 ) This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM). New submodule commit: `f99e161663` Test Plan: Ensure that CI jobs succeed on GitHub before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/74729 Approved by: https://github.com/malfet	2022-12-07 22:36:35 +00:00
Andrew Gu	ecd418673b	[FSDP][Easy] ufmt files (#90384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90384 Approved by: https://github.com/H-Huang	2022-12-07 21:18:23 +00:00
Manuel Candales	32973651e6	[Vulkan] Enable copying QInt8 and QInt32 tensors from cpu to vulkan. (#90357 ) Summary: Copying QInt8 and QInt32 from cpu to vulkan: - Added shader nchw_to_image_int8 - Added shader nchw_to_image_int32 Copying QInt8 and QInt32 from vulkan to cpu Note: This functionality is currently disabled until issues on Android are resolved. - Added shader image_to_nchw_int32 - QInt8 works with the same existing image_to_nchw_quantized shaders Added multiple tests for each supported dtype: - cpu_to_vulkan_and_dequantize: These tests check the correctness of copying quantized cpu tensor to vulkan by comparing the output of the following: - cpu float tensor -> quantize -> to vulkan -> dequantize -> to cpu - cpu float tensor -> quantize -> dequantize - cpu_to_vulkan_and_vulkan_to_cpu (currently disabled until copying vulkan quantized to cpu is enabled): These tests check the correctness of copying from cpu to vulkan and from vulkan to cpu by creating a random cpu float tensor, quantizing it, then copying it to vulkan, then back to cpu and comparing the output tensor to the original quantized tensor. - quantize_per_tensor_and_vulkan_to_cpu (currently disabled until copying vulkan quantized to cpu is enabled): These tests check the correctness of copying quantized tensor from vulkan to cpu by comparing the output of the following: - cpu float tensor -> to vulkan -> quantize -> to cpu - cpu float tensor -> quantize Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: kimishpatel Differential Revision: D41654287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90357 Approved by: https://github.com/SS-JIA	2022-12-07 21:17:35 +00:00
Angela Yi	a076bdb357	[fx] Copy codegen in legalize_graph (#90023 ) Test Plan: CI Differential Revision: D41666330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90023 Approved by: https://github.com/SherlockNoMad	2022-12-07 21:09:38 +00:00
Edward Z. Yang	6dcc214ac2	Fix AssertionError fake_mode is not None in distributed (#90392 ) Fixes https://github.com/pytorch/pytorch/issues/90375 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90392 Approved by: https://github.com/voznesenskym	2022-12-07 20:12:39 +00:00
Edward Z. Yang	2ad6ed8ac9	Fix some typed storage is deprecated warnings. (#89867 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89867 Approved by: https://github.com/albanD	2022-12-07 20:09:57 +00:00
PyTorch MergeBot	1b1301f16a	Revert "[pruning][core][feature] Implement prune for structured pruning (#89777 )" This reverts commit 3531e44307fa58460e2488bcaace948678d6cf9f. Reverted https://github.com/pytorch/pytorch/pull/89777 on behalf of https://github.com/clee2000 due to breaking test_ao_sparcity due to import `3531e44307` https://github.com/pytorch/pytorch/actions/runs/3641476330/jobs/6147830487, probably a landrace with 824641b083860df4d7ffef06a798ea2702bc4bde?	2022-12-07 19:41:15 +00:00
Chien-Chin Huang	44779d9bc6	[FSDP][optim_state_dict][2/N] Add _get_fqn_to_fsdp_param_info to map from original FQN to flat_param (#89899 ) Motivation: Add a helper to map from the FQN to the corresponding flat_param. The helper will directly get flat_param from fsdp_state and flat_handler as flat_param is not registered to the module if `use_orig_params` is True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89899 Approved by: https://github.com/awgu	2022-12-07 19:40:47 +00:00
Bin Bao	f7cdd3a7a0	[inductor] Use a large tolerance for botnet26t_256 (#90383 ) Summary: botnet26t_256 shows random tolerance failure on CI. The root cause of this randomness is still to-be-invesitgated, but let's use a larger tolerance for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90383 Approved by: https://github.com/ezyang	2022-12-07 19:35:06 +00:00
YJ Shi	2b0b4bb6fd	[Dynamo] Fix llvm target for meta schedule & add torch to tvm ndarray helper func (#90214 ) Fixes #90213. Also a torch.tensor to tvm.nd.array helper function is added to avoid data copy with dlpack. @jansel @Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/90214 Approved by: https://github.com/wconstab	2022-12-07 19:23:56 +00:00
Sergii Dymchenko	6a7659f304	Fix issue 38095 TODO in test_autograd.py (#90031 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90031 Approved by: https://github.com/clee2000	2022-12-07 19:09:43 +00:00
Richard Zou	4b1053497c	[vmap] Prepend "legacy" to files for old vmap implementation (#90324 ) We have an older torch.vmap implementation. It is no longer supported. It still needs to exist somewhere for the sake of BC with torch.autograd.functional. This PR makes it clear what files are meant for implementing the old vmap implementation. I've seen a couple of PRs recently adding support for the old vmap implementation, so this will lessen the confusion. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324 Approved by: https://github.com/samdow	2022-12-07 18:46:15 +00:00
Nikita Shulga	94d800ffd1	Make Transformers compilable by C++17 (#90389 ) `register` keyword is removed in C++17, but keeping it there under ifdef as I have not measured the perf implication on older compiler, though there shouldn't be any: all modern compilers supposed to downright ignore it. This code originates from https://github.com/facebookresearch/xformers/pull/375 will propose similar PR to remove register keyword usage to that repo. Yet another thing discovered while working on https://github.com/pytorch/pytorch/pull/85969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90389 Approved by: https://github.com/drisspg	2022-12-07 18:10:44 +00:00
Jesse Cai	3531e44307	[pruning][core][feature] Implement prune for structured pruning (#89777 ) Summary: This PR implements `prune` in BaseStructuredSparsifier: `prune` is a function that takes in a model with structured sparsity parametritizations (the result of `prepare`) and will return a resized model with the masked out weights removed. `prune` is defined by a mapping from patterns to different pruning functions. - patterns are just sequences of operations, for example `(nn.Linear, activation, nn.Linear)` - pruning functions are functions that take in an matched pattern as args and will resize the appropriate layer sizes and weights. ``` def prune_linear_activation_linear(linear1, activation, linear2): pass ``` - This is one line in the pattern config `(nn.Linear, activation, nn.Linear): prune_linear_activation_linear` At a high level `prune` works by finding instances of the graph that match different patterns and then calling the mapped pruning functions on those matched patterns. This is unlike the previous code which attempted to do both at the same time. There may be some gaps in the patterns compared to the previous implementation, but the conversion functionality support should be the same. Currently we have pruning functions for the following patterns: - linear -> linear - linear -> activation -> linear - conv2d -> conv2d - conv2d -> activation -> conv2d - conv2d -> activation -> pool -> conv2d - conv2d -> pool -> activation -> conv2d - conv2d -> adaptive pool -> flatten -> linear Added in MyPy type hints as well for the prune_functions. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89777 Approved by: https://github.com/vkuzo	2022-12-07 17:52:01 +00:00
Jesse Cai	d680ea7e36	[quant]Fix public bindings for DTypeWithConstraint (#90315 ) Summary: Need this to fix `test_public_bindings`. Test Plan: `python test/test_public_bindings.py` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90315 Approved by: https://github.com/HDCharles	2022-12-07 17:52:01 +00:00
Michael Voznesensky	4cdc96fb4f	Add hooks structure for passing around user provided hooks, add a new guard_failure_fn (#90371 ) This PR introduces a new function we can pass to torch._dynamo.optimize - guard_failure_fn. Usage is in the PR, and the one stacked on top of it, but the gist of it is that it emits failed guard reason strings alongside code. This is useful for tests and debugging, as it gives far finer grained assertions and control than the compile counter alone. This is a resubmit of https://github.com/pytorch/pytorch/pull/90129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90371 Approved by: https://github.com/ezyang	2022-12-07 17:51:53 +00:00
Nikita Shulga	c92cf6bee3	[BE][CI] Add windows test run instructions (#90388 ) Specifies how to activate VisualStudio, Anaconda and set `PYTHONPATH` to run tests in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/90388 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2022-12-07 17:41:54 +00:00
andrewor14	824641b083	[Quant][fx][bc-breaking] Make convert.py smaller (#90189 ) Summary: This commit moves helper functions that are not core to the convert logic out of convert.py, which was more than 1000 lines. This helps with readability since a new developer won't have to scroll through hundreds of lines of util functions to understand the core logic. There should be no change in functionality in this commit. BC-breaking notes: The following helper functions that were previously exposed under the `torch.ao.quantization.fx.convert` namespace are now made private. Many of these are moved to the new convert_utils.py ``` convert_custom_module convert_standalone_module convert_weighted_module get_module_path_and_prefix, has_none_qconfig, insert_dequantize_node, is_conversion_supported, maybe_recursive_remove_dequantize, replace_observer_or_dequant_stub_with_dequantize_node, restore_state, run_weight_observers, ``` Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/90189 Approved by: https://github.com/jerryzh168	2022-12-07 16:16:25 +00:00
Charlie Yan	99fb39f508	reland #89243 : [Composable API] replicate: add support for DDP args (#90255 ) reland https://github.com/pytorch/pytorch/pull/89243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90255 Approved by: https://github.com/zhaojuanmao	2022-12-07 15:22:33 +00:00
Peter Bell	e6a7278753	Give std/var correction overloads proper defaults (#56398 ) The correction overloads defaults were left off for forward compatibility reasons, but this FC window expired well over a year ago at this point. Differential Revision: [D29625593](https://our.internmc.facebook.com/intern/diff/D29625593) Pull Request resolved: https://github.com/pytorch/pytorch/pull/56398 Approved by: https://github.com/mruberry	2022-12-07 15:15:00 +00:00
Nikita Shulga	b0bd5c4508	[MPS] Fix median_out_mps caching (#90326 ) We should cache graph based on input tensor type Fixes https://github.com/pytorch/pytorch/issues/90311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90326 Approved by: https://github.com/kulinseth	2022-12-07 07:24:58 +00:00
fduwjj	85ae28b454	Reformat optim import (#90294 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90294 Approved by: https://github.com/awgu	2022-12-07 07:11:12 +00:00
Pruthvi Madugundu	15949fc248	[ROCm] Enable few test_prim UTs for ROCm (#88983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88983 Approved by: https://github.com/IvanYashchuk, https://github.com/jeffdaily, https://github.com/malfet	2022-12-07 06:21:31 +00:00
Bert Maher	26d1dbc4f8	[inductor] More correct check for fbcode environment (#90312 ) Summary: importing torch.fb seemed like a good idea, but we don't always have torch.fb inside fbcode. Testing for torch.version.git_version is more reliable, since we'll never have a git_version inside fbcode, which is an hg repo. Test Plan: `buck2 run mode/dev-nosan //caffe2/test/inductor:smoke` Reviewed By: soumith, jansel Differential Revision: D41777058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90312 Approved by: https://github.com/soumith	2022-12-07 04:50:11 +00:00
Ram Rachum	351d73b97f	Fix exception causes all over the codebase (#90271 ) This is the continuation to #90134 and hopefully the final PR in this series. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271 Approved by: https://github.com/kit1980	2022-12-07 04:29:00 +00:00
David Berard	8f079b895b	[Dynamo+FSDP] Update benchmarks with use_orig_params=True (#90100 ) After https://github.com/pytorch/pytorch/pull/89523, we now need to assert use_orig_params=True, even in the non-recursive case where (I think) we wouldn't otherwise need to run with use_orig_params=True. Tested with `python benchmarks/dynamo/torchbench.py --training --accuracy --only hf_T5 --fsdp` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90100 Approved by: https://github.com/wconstab	2022-12-07 03:33:58 +00:00
Yanbo Liang	898b46d6cc	[Dynamo][Easy] capture more exceptions when import skip modules (#90338 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90338 Approved by: https://github.com/williamwen42	2022-12-07 02:05:39 +00:00
Peter Bell	71f27f7688	[Inductor] More robust stride and offset extraction from index expressions (#90184 ) Currently the stride and offset are determined by substituting 1 and 0 for different indices, which will fail for any expression that doesn't match the expected stride calculation. Instead, this uses `sympy.match` and returns `None` for any variables used in non-standard index calculation, e.g. `torch.roll` which uses `ModularIndexing`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90184 Approved by: https://github.com/jansel	2022-12-07 01:43:21 +00:00
Peter Bell	4f44877983	[Inductor] Add test for Scheduler fusions (#90014 ) Currently there is `test_vertical_fusion1` which fuses entirely during the lowering stage and no buffers are realized. This adds `test_scheduler_vertical_fusion1` which is the same test but with several intermediate calculations realized so the scheduler is left to do the fusion. To support the test, this PR also adds: - `metrics.ir_nodes_pre_fusion` which when compared with `generated_kernel_count` tells us how many nodes were fused. - `torch._test_inductor_realize` which is an identity operator in eager, but under inductor also forces the input to be realized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90014 Approved by: https://github.com/jansel	2022-12-07 01:33:25 +00:00
andrewor14	13fcc412be	[Quant][fx][bc-breaking] Remove unused functions in fx/utils.py (#90025 ) Summary and BC-breaking notes: This commit removes the following unused functions from both the `torch.quantization` and the `torch.ao.quantization` namespaces: ``` graph_pretty_str get_per_tensor_qparams quantize_node get_qconv_op create_qparam_nodes node_return_type_is_int is_get_tensor_info_node ``` Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestAOMigrationQuantizationFx Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/90025 Approved by: https://github.com/HDCharles	2022-12-07 01:31:28 +00:00
PyTorch MergeBot	f28927e9c4	Revert "[MPS] Fix median_out_mps caching (#90326 )" This reverts commit 23c192c3df2fd53a2110d179eabb549ceb7beeef. Reverted https://github.com/pytorch/pytorch/pull/90326 on behalf of https://github.com/malfet due to Modified wrong key	2022-12-07 00:43:31 +00:00
Jerry Zhang	887249b2bb	[quant] Add fused "q - qlinear - dq" operator with skipped quant op for output of linear (#89882 ) Summary: Added two ops: * torch.ops.quantized.linear_with_input_q_dq_qweight_dq_output_fp32 * torch.ops.quantized.linear_with_input_q_dq_qweight_dq_relu_output_fp32 corresponding pattern for `linear_with_input_q_dq_qweight_dq_output_fp32` would be: ``` input -> q* -> dq* -> linear* -> qweight -> dq* / ``` Test Plan: python test/test_quantization.py -k TestQuantizedLinear.test_qlinear_with_input_q_dq_qweight_dq Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89882 Approved by: https://github.com/vkuzo	2022-12-07 00:10:19 +00:00
Manuel Candales	22e363348c	[Vulkan] Partially fix and then disable copying of vulkan quantized tensors to cpu (#90275 ) Summary: Before this diff, copying of vulkan quantized tensors to cpu was broken. This was mainly caused because the shader only works properly with specific global and local work group sizes, and those specific sizes had been modified in earlier refactoring. As part of this fix, an optimized version of the shader that performs the copying was written, to take advantage of the special case when the plane size (x*y) is multiple of 4). After fixing this, and writing comprehensive tests, it was discovered that the copying still has issues on Android for specific input sizes, e.g. [1, 1, 11, 17]. These issues are currently unresolved, so, copying of quantized vulkan tensors to cpu has been disabled. What is contained in this diff? - Fix for existing issue - New optimized shader (image_to_nchw_quantized_mul4) - New comprehensive tests (which have been disabled) - Disable the copying of quantized vulkan tensors to cpu until issues on Android are fixed. Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: kimishpatel Differential Revision: D41047098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90275 Approved by: https://github.com/kimishpatel	2022-12-06 23:33:52 +00:00
Nikita Shulga	23c192c3df	[MPS] Fix median_out_mps caching (#90326 ) We should cache graph based on input tensor type Fixes https://github.com/pytorch/pytorch/issues/90311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90326 Approved by: https://github.com/kulinseth	2022-12-06 23:21:54 +00:00
Tran Le	b769005924	[fx][passes] Implement annotate getitem node FX passes (#90237 ) Summary: One common cause of jit unscriptability issue is loss of node type annotations on local names after one or several FX transform(s). One way to improve the type coverage is to eagerly annotate the type for `getitem` nodes from its parent sequence node. This diff introduces an fx pass to do that. Test Plan: ``` buck2 test //caffe2/test:fx_experimental ``` Reviewed By: xush6528 Differential Revision: D41749744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90237 Approved by: https://github.com/xush6528	2022-12-06 23:18:55 +00:00
Jerry Zhang	0e182c9441	[quant][fx] Add support for matching constant in the custom matcher code in quantization (#90092 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx.test_pattern_match_constant Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90092 Approved by: https://github.com/jcaip	2022-12-06 22:47:41 +00:00
Peter Bell	5caa27a3fd	as_strided: Fix default storage_offset for reference implementation (#89513 ) This fixes the default storage_offset to take it from the input. This was previously untested, so I've also added a new OpInfo which includes samples with non-zero storage_offsets on the input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513 Approved by: https://github.com/ezyang, https://github.com/ngimel	2022-12-06 22:39:21 +00:00
Edward Z. Yang	3d4b92b171	Ensure that we fakeify tensor subclasses when they are initially tracked (#90009 ) The old code didn't actually fakeify traceable tensor subclasses at the time they are added as a GraphArg to the module; now we do, by ignoring the subclass during fakeification and relying on Dynamo to simulate the subclass on top. See comments for more details. BTW, this codepath is super broken, see filed issues linked on the inside. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90009 Approved by: https://github.com/wconstab, https://github.com/voznesenskym	2022-12-06 22:36:32 +00:00
Sergii Dymchenko	f09e7b5ce7	Replace assertEqualIgnoreType in test_nn.py (#90242 ) See https://github.com/pytorch/pytorch/issues/38095. Also removed some redundant separate `dtype` checks when `dtype` is already checked by the next line's `assertEqual`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90242 Approved by: https://github.com/malfet	2022-12-06 22:34:01 +00:00
Nikita Shulga	6c195881b1	[CI] Relax CMake requirements (#90307 ) To `3.22.*` as cmake-3.22.1 is available on conda, but not on conda-forge see https://anaconda.org/conda-forge/cmake/files?version=3.22.2 but https://anaconda.org/anaconda/cmake/files?version=3.22.1 Also, for whatever reason we already specify cmake dependency in `acaef1ae39/.github/actions/setup-miniconda/action.yml (L172)` so may be it could be removed from this file already Pull Request resolved: https://github.com/pytorch/pytorch/pull/90307 Approved by: https://github.com/kit1980	2022-12-06 22:32:50 +00:00
Michael Voznesensky	3b9a386d48	Add `TORCH_FAKE_TENSOR_DEBUG` use it to enable storage of traces on fake tensors at init time (#90215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90215 Approved by: https://github.com/ezyang	2022-12-06 22:28:52 +00:00
William Wen	d224ac7f77	Remove logging.CODE (#90234 ) Fixes https://github.com/pytorch/torchdynamo/issues/1932 Discussed with @mlazos: if we still want to separate streams for code logging and the rest of info, we can use a separate logger object with a unique name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90234 Approved by: https://github.com/ezyang	2022-12-06 22:24:43 +00:00
Sergii Dymchenko	14894a7311	Remove non-existing parameter from docstring (#90163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90163 Approved by: https://github.com/clee2000	2022-12-06 22:22:17 +00:00
Yanbo Liang	7e9a8a1361	Disable dynamo tracing torchrec.distributed (#90087 ) Summary: Context at T138318923 Test Plan: mannual test Reviewed By: yf225 Differential Revision: D41631076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90087 Approved by: https://github.com/yf225	2022-12-06 22:17:16 +00:00
Eli Uriegas	27ad2605c8	Hotfix to unblock TRT unit tests internally (#90313 ) Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Export of [D41778303](https://www.internalfb.com/diff/D41778303) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90313 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-06 22:14:37 +00:00
eqy	62e450d55f	[CUDA Graphs] Add option to dump a captured graph for debugging (#85519 ) CC @xwang233 @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/85519 Approved by: https://github.com/ngimel	2022-12-06 22:03:05 +00:00
fduwjj	1abe264ef0	[Upstream _NamedOptimzer] Reland PR (89480) (#90293 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Reland https://github.com/pytorch/pytorch/pull/89480/ * #90294 * __->__ #90293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293 Approved by: https://github.com/awgu	2022-12-06 21:47:12 +00:00
Andrew Gu	7436b19eb2	[FSDP] Clarify loss dtype check in `_test_fsdp_parity` (#90251 ) A recent PR deprecated `torch.testing.assert_allclose` in favor of `torch.testing.assert_close` and left a `TODO`. This PR follows up to confirm that we do intend to have `check_dtype=False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90251 Approved by: https://github.com/rohan-varma	2022-12-06 21:28:40 +00:00
Andrew Gu	919e09f26a	[FSDP][BE] Clean up dead code from `clip_grad_norm_()` testing (#90250 ) `FSDP.clip_grad_norm_()` is tested separately in `test_fsdp_clip_grad_norm.py`. This PR removes the dead non-run code from `common_fsdp.py` and `test_fsdp_core.py` related to `clip_grad_norm_()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90250 Approved by: https://github.com/rohan-varma	2022-12-06 21:28:40 +00:00
Andrew Gu	3b578edd04	[FSDP] Test `use_orig_params=True` in `test_fsdp_ignored_modules.py` (#90290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90290 Approved by: https://github.com/zhaojuanmao	2022-12-06 21:28:40 +00:00
Yanbo Liang	25f39c1bce	Fix uniform ref implementation (#90094 ) Fixes https://github.com/pytorch/torchdynamo/issues/1954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90094 Approved by: https://github.com/ngimel	2022-12-06 21:28:17 +00:00
Edward Z. Yang	a1ab06ab65	ShapeEnv.create_symbolic_sizes_strides_storage_offset (#89962 ) Instead of having storage offset hang out on its own, allocate all of these symbols all in one go. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89962 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2022-12-06 21:27:02 +00:00
Charlie Yan	e818c36647	reland #89222 : [Composable API] replicate: change to per module call, remove mark_root_module() (#90254 ) reland https://github.com/pytorch/pytorch/pull/89222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90254 Approved by: https://github.com/zhaojuanmao	2022-12-06 21:17:53 +00:00
Andrew Gu	bd9ad89a6d	[FSDP] Fix accidental change in `_test_fsdp_parity` (#90252 ) I accidentally changed the semantics of this line when refactoring a while ago. The [previous version](https://github.com/pytorch/pytorch/pull/80873/files#diff-7b5c66f99161fa6a3d9042e80f8c8cc140a64e43445feede46f55e53154f6c3dL635) used to say: ``` if not mixed_precision: ``` which is actually the opposite of ``` if mixed_precision is not None: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90252 Approved by: https://github.com/zhaojuanmao	2022-12-06 20:13:21 +00:00
mfkasim1	ce21262808	Log1p for complex in CPU (#89691 ) Another PR for https://github.com/pytorch/pytorch/issues/89205: making torch.log1p accepts complex numbers in CPU. I haven't done the GPU version because I'm not sure which file(s) to change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89691 Approved by: https://github.com/jgong5, https://github.com/lezcano	2022-12-06 19:12:24 +00:00
Wanchao Liang	9e314bd822	[dtensor] handle the case where output of op is Optional[Tensor] (#90241 ) Observed by @aazzolini, some op might have Optional[Tensor] returns where it return None (i.e. native_layer_norm_backward), it's a mismatch between C++ aten op signature and python None, but we need to handle it in the python side Pull Request resolved: https://github.com/pytorch/pytorch/pull/90241 Approved by: https://github.com/aazzolini	2022-12-06 18:17:20 +00:00
Edward Z. Yang	eace084815	Use Sized not Iterable to test for len (#90182 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90182 Approved by: https://github.com/albanD	2022-12-06 16:13:14 +00:00
mingfeima	c6942dbbfb	add shape check for random_samples in fractional_max_pool{2d\|3d} (#89992 ) This PR add shape checks for `random_samples` in fractional_max_pool2d and fractional_max_pool3d., to provide more meaningful warnings instead of SegFault when the input is illegal. For more details, please check https://github.com/pytorch/pytorch/issues/89648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89992 Approved by: https://github.com/jgong5, https://github.com/ezyang	2022-12-06 14:14:41 +00:00
mikey dagitses	be5108d5f9	replace memset with value-initialization (#90048 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90048). * #89865 * #89852 * #89851 * __->__ #90048 replace memset with value-initialization Summary: This is equivalent to zero initialization for any members that are scalar or have implicit default constructors. Note that aside from the reset at the beginning, blockmask and philox_args are not touched by this function. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90048 Approved by: https://github.com/drisspg, https://github.com/malfet	2022-12-06 13:48:05 +00:00
Xia, Weiwen	97e47a52b8	[Quant] Add fused linear-leaky_relu op for onednn backend (#88478 ) Summary Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `linear-leaky_relu` op for `onednn` backend, which will be used for int8 inference with `onednn` backend. Cannot call this op with other quantization backends otherwise an error is thrown. Test Plan python test_quantization.py TestQuantizedLinear Pull Request resolved: https://github.com/pytorch/pytorch/pull/88478 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-06 08:32:59 +00:00
AllenTiTaiWang	41bfa49db9	[ONNX] Add src/index dynamic axes support for aten::scatter_add (#90090 ) Extend from #89787 , and answer from https://github.com/onnx/onnx/issues/4672, dynamically catching shape of index can let converter further support on this op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90090 Approved by: https://github.com/BowenBao	2022-12-06 07:56:20 +00:00
PyTorch MergeBot	176b962f4b	Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480 )" This reverts commit 31ec1a1ef7032508fc36f0b70692832acbeed72d. Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names	2022-12-06 07:22:37 +00:00
Ryan Spring	3c9431f505	Add factory functions to python frontend (#89230 ) - Add `full` nvprim to support factory functions because the full reference uses `empty` and `fill` while we have a full factory function. - Change `full_like` reference to call `full` to avoid defining another nvprim. - Enable support for new_zeros to enable `cudnn_batch_norm` decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89230 Approved by: https://github.com/kevinstephano, https://github.com/mruberry	2022-12-06 07:16:21 +00:00
PyTorch MergeBot	e645771e95	Revert "as_strided: Fix default storage_offset for reference implementation (#89513 )" This reverts commit ba70a8be03f2fca222deee030bf7d9d15260b549. Reverted https://github.com/pytorch/pytorch/pull/89513 on behalf of https://github.com/kit1980 due to Broke multiple workflows, 2 unexpected successes for autograd tests	2022-12-06 07:14:16 +00:00
Arek Sredzki	44dac51c36	Improve Autograd Documentation Clarity (#89401 ) This makes minor adjustments to the autograd docs, improving clarity and resolving grammatical errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/89401 Approved by: https://github.com/kit1980	2022-12-06 06:45:04 +00:00
Manuel Candales	49ccc41d57	[Vulkan] Enable QInt8 and QInt32 quantization (#89788 ) Summary: Enabled Vulkan quantization for dtypes QInt8 and QInt32 Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Differential Revision: D41561661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89788 Approved by: https://github.com/digantdesai	2022-12-06 06:27:40 +00:00
Andrew Gu	45b40be078	[FSDP()] Fix `fully_shard` fwd hook registration (#90201 ) I need to rebase later after Shen's PRs land. The idea is to only register the pre/post-forward hook on the _root modules_ among the modules that consume a `FlatParameter`. (Yes, the term _root module_ is heavily overloaded. We may want to clarify that at some point. Here, _root_ is being used in the graph sense, meaning parent-less, and the scope is only among the modules consuming a `FlatParameter`.) This avoids unnecessary pre/post-forward hooks running, which would lead to errors because the unshard is not truly idempotent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90201 Approved by: https://github.com/mrshenli, https://github.com/rohan-varma	2022-12-06 06:09:03 +00:00
Sean Ross-Ross	2b7fcfa399	fix: Moving operators to FuncTorchBatchedDecomposition (#89762 ) Some of the easy to move operators I've moved over and removed an xfail. I found this from the test that I implemented in https://github.com/pytorch/pytorch/pull/89465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89762 Approved by: https://github.com/zou3519	2022-12-06 05:59:47 +00:00
Sean Ross-Ross	bb673fb1d9	fix: update error when tensor escapes vmap (#89077 ) Fixes https://github.com/pytorch/functorch/issues/1054 @zou3519, I played around with it, but I am unsure of how to repro the cases for gen_vmap_inplace_plumbing and below in gen_vmap_plumbing_no_returns I've also seen that there are 24 other instances of the `TORCH_INTERNAL_ASSERT(maybe_layer.has_value());` assert, should I change all of these and add tests? Pull Request resolved: https://github.com/pytorch/pytorch/pull/89077 Approved by: https://github.com/zou3519	2022-12-06 05:52:09 +00:00
Wanchao Liang	2c2cce73d4	[dtensor] remove torchgen function schema and parse manually (#90106 ) This PR get rids of torchgen FunctionSchema parsing and parse it manually, it should resolve torchgen package issue and also provide some perf wins when running DTensor eagerly Pull Request resolved: https://github.com/pytorch/pytorch/pull/90106 Approved by: https://github.com/awgu	2022-12-06 05:45:00 +00:00
Yanli Zhao	a0c7b88861	remove backward hook in memory_tracker (#90143 ) remove backward hook in memory_tracker, as it does not work well with jagged tensor in some cases, it is OK to remove this hook for now as it does not really track any stats Pull Request resolved: https://github.com/pytorch/pytorch/pull/90143 Approved by: https://github.com/rohan-varma	2022-12-06 05:39:59 +00:00
Sergii Dymchenko	6bbcd025bd	Fix issue 38095 TODO in onnx/test_utility_funs.py (#90085 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90085 Approved by: https://github.com/BowenBao	2022-12-06 05:29:50 +00:00
Masaki Kozuki	508916128d	[ReduceOp] ameliorate custom `__eq__` (#90088 ) Improve the completeness of `ReduceOp.__eq__`. Should support the equal operator with the first argument of `RedOpType` and the second of `ReduceOp` in a follow-up. Fixes #90072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90088 Approved by: https://github.com/kwen2501	2022-12-06 05:13:50 +00:00
Michael Lazos	2d9267ba30	[dynamo] Rewrite addcdiv in dynamo to its constituent ops (#90227 ) This avoids a graph break when `value` is used. This fixes a graph break in the variants of Adam and Adagrad optimizers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90227 Approved by: https://github.com/jansel	2022-12-06 05:08:44 +00:00
Ram Rachum	77f9b2e8bf	Fix exception causes in fx, nn and onnx packages (#90134 ) This is a continuation of #90118 @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90134 Approved by: https://github.com/kit1980	2022-12-06 04:34:58 +00:00
fduwjj	31ec1a1ef7	[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480 ) In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters. Now composability workstream need a FQN based way to index optimizer state_dict for parameters.. For example, SGD optimizer might have something in its `state_dict` like: ``` {'state': {0: {'momentum_buffer': tensor(...)}, {1: {'momentum_buffer': tensor(...)}, ... } 'param_groups': [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}] } ``` And in NamedOptimizer we want the `state_dict` can be: ``` {'state': {'net1.0.weight': {'momentum_buffer': tensor(...)}, {'net1.0.bias': {'momentum_buffer': tensor(...)}, ... } 'param_groups': [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}] } ``` We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer. For the next couple PR/diffs, we also need to: 1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components. 2. Make `NamedOptimizer` works well with apply_optim_in_backward 3. Upstream also `CombinedOptimizer`. Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480 Approved by: https://github.com/rohan-varma	2022-12-06 04:34:19 +00:00
HDCharles	cee396fa07	[ao][ns] PNP demo for exposing arbitrary model transforms (#90153 ) adding way to use arbitrary prepare and convert functions with PNP. note this is a recreation of https://github.com/pytorch/pytorch/pull/89892 which was reverted due to landing not syncing between github and fbcode python test/test_quantization.py TestFxNumericSuiteNShadows.test_custom_functions_and_tracer Differential Revision: [D41723892](https://our.internmc.facebook.com/intern/diff/D41723892/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90153 Approved by: https://github.com/vkuzo	2022-12-06 04:24:54 +00:00
Sherlock Huang	42705bd7b3	Disallow registering meta function for CompositeImplicitAutograd ops (#90222 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90222 Approved by: https://github.com/ezyang	2022-12-06 04:22:31 +00:00
Natalia Gimelshein	a88400e0cc	pad low precision matmuls when requested (#90235 ) Matmul padding is beneficial not only for fp32, fp16/bf16 with amp can benefit as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90235 Approved by: https://github.com/jiawenliu64	2022-12-06 04:13:24 +00:00
Peter Bell	ba70a8be03	as_strided: Fix default storage_offset for reference implementation (#89513 ) This fixes the default storage_offset to take it from the input. This was previously untested, so I've also added a new OpInfo which includes samples with non-zero storage_offsets on the input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513 Approved by: https://github.com/ezyang, https://github.com/ngimel	2022-12-06 04:07:16 +00:00
Danni Li	05ccbd6d94	Functionalization: skip meta block computation if compute_reference_meta is false (#90219 ) Skip computing meta block when `compute_reference_meta` is `False`. Issue: #89914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90219 Approved by: https://github.com/ezyang	2022-12-06 04:03:01 +00:00
Edward Z. Yang	962ebe88a2	Assert there are no outstanding side effects before calling cond (#90208 ) The current cond implementation is silently incorrect when there are outstanding side effects, since the locally tracked side effects are lost when the recursive export call is made. At least we raise an assert now. I'm working on a refactor of cond which should be able to sidestep this problem. Maybe. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D41746973](https://our.internmc.facebook.com/intern/diff/D41746973) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90208 Approved by: https://github.com/voznesenskym	2022-12-06 03:53:48 +00:00
PyTorch MergeBot	0d8e53dfe7	Revert "[Composable API] `replicate`: change to per module call, remove `mark_root_module()` (#89222 )" This reverts commit 65a0dcffd8d387bb8c90216e63fdabb6e33e4e4d. Reverted https://github.com/pytorch/pytorch/pull/89222 on behalf of https://github.com/malfet due to Included unintended submodule updates	2022-12-06 03:26:28 +00:00
PyTorch MergeBot	73565ce320	[vision hash update] update the pinned vision hash (#90239 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90239 Approved by: https://github.com/pytorchbot	2022-12-06 03:25:17 +00:00
PyTorch MergeBot	3749b9dc73	Revert "[Composable API] `replicate`: add support for DDP args (#89243 )" This reverts commit 0f274ed385d676cb28c792ca104114ca63210055. Reverted https://github.com/pytorch/pytorch/pull/89243 on behalf of https://github.com/malfet due to Depends on https://github.com/pytorch/pytorch/pull/89222 that introduced spurious module updates	2022-12-06 03:22:18 +00:00
XiaobingSuper	2597d5d722	TorchDynamo: always convert flexiblelayout to be FixedLayout when given a stride_order (#89904 ) For convolution, we always call require_stride_order to convert the input to the target stride order, if the original input's layout is flexiblelayout, there always have a memory copy because the is_stride_order_storage_and_layout only checks the init stride order, I think for flexiblelayout, means it's layout can be changed, if the user gives a stride order, I think we always need to convert the flexiblelayout to be FixedLayout using given strider order. Given a CV user case, the max_pooling's output is used by two convolutions, there has two memory copies: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1, float* __restrict__ out_ptr2) { #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<3; i3+=1) { { { auto tmp0 = in_ptr0[i3 + (6i2) + (42i1) + (147i0)]; auto tmp1 = in_ptr0[3 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp3 = in_ptr0[6 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp5 = in_ptr0[21 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp7 = in_ptr0[24 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp9 = in_ptr0[27 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp11 = in_ptr0[42 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp13 = in_ptr0[45 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp15 = in_ptr0[48 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6); auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8); auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10); auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12); auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14); out_ptr0[i3 + (3i2) + (9i1) + (27i0)] = tmp16; } } } } } } #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<9; i2+=1) { { { auto tmp0 = out_ptr0[i1 + (3i2) + (27i0)]; out_ptr1[i1 + (3i2) + (27i0)] = tmp0; out_ptr2[i1 + (3i2) + (27i0)] = tmp0; } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) buf2 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) buf4 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf4.data_ptr())) del arg4_1 del buf0 buf3 = torch.ops.mkldnn._convolution_pointwise(buf2, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3)) del arg0_1 del arg1_1 del buf2 buf5 = torch.ops.mkldnn._convolution_pointwise(buf4, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf5, (128, 3, 3, 3), (27, 1, 9, 3)) del arg2_1 del arg3_1 return (buf3, buf5, ) ``` After this PR, the generated code will remove the redundant memory copy: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<3; i3+=1) { { { auto tmp0 = in_ptr0[i3 + (6i2) + (42i1) + (147i0)]; auto tmp1 = in_ptr0[3 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp3 = in_ptr0[6 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp5 = in_ptr0[21 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp7 = in_ptr0[24 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp9 = in_ptr0[27 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp11 = in_ptr0[42 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp13 = in_ptr0[45 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp15 = in_ptr0[48 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6); auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8); auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10); auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12); auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14); out_ptr0[i3 + (3i2) + (9i1) + (27i0)] = tmp16; } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args args.clear() buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32) kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg4_1 buf2 = torch.ops.mkldnn._convolution_pointwise(buf0, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf2, (128, 3, 3, 3), (27, 1, 9, 3)) del arg0_1 del arg1_1 buf3 = torch.ops.mkldnn._convolution_pointwise(buf0, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '') assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3)) del arg2_1 del arg3_1 return (buf2, buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89904 Approved by: https://github.com/jansel	2022-12-06 03:07:53 +00:00
Bin Bao	29233a18c7	[inductor] Add test_ops_gradients running with inductor (#89792 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89792 Approved by: https://github.com/janeyx99, https://github.com/clee2000, https://github.com/huydhn	2022-12-06 02:26:29 +00:00
William Wen	ebeecbf833	Dynamo FX graph stack traceback fix (#87136 ) Migration from https://github.com/pytorch/torchdynamo/pull/1655. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87136 Approved by: https://github.com/voznesenskym	2022-12-06 02:22:16 +00:00
Nikita Shulga	a268b9e53c	Fix yet another C++17 Windows build issue (#90228 ) Not sure why, but top-level `using namespace` directive causes VC++ fail with (if C++17 standard is used, but everything is fine with C++14): ``` C:\actions-runner\_work\pytorch\pytorch\third_party\pybind11\include\pybind11\detail\../pytypes.h(1520): error C2872: 'attr': ambiguous symbol C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/core/interned_strings.h(349): note: could be 'c10::attr' C:\actions-runner\_work\pytorch\pytorch\torch/csrc/jit/ir/ir.h(75): note: or 'torch::jit::attr' C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\pybind11\include\pybind11/pybind11.h(1094): note: see reference to function template instantiation 'pybind11::str pybind11::str::format<_Ty1&>(_Ty1 &) const' being compiled with [ _Ty1=pybind11::handle ] ``` Solve this by replacing global `using namespace torch::jit;` with specific usages of objects/methods from namespaces Another prep change for https://github.com/pytorch/pytorch/70188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90228 Approved by: https://github.com/kit1980, https://github.com/albanD	2022-12-06 01:35:19 +00:00
Kimish Patel	55b10e6b1d	[Pytorch][Vulkan] Use specalized shader for 3x3 depthwise conv (#89953 ) This diff uses specialized implementation for 3x3 and 5x5 dw conv. Differential Revision: [D41006638](https://our.internmc.facebook.com/intern/diff/D41006638/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89953 Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign	2022-12-06 00:56:57 +00:00
Kimish Patel	a17765a127	[Pytorch][Vulkan] Templatize depth wise convolution and specialize for 3x3 and 5x5 (#89952) 5x5 This diff does not yet integrate with the runtime. Differential Revision: [D41006640](https://our.internmc.facebook.com/intern/diff/D41006640/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89952 Approved by: https://github.com/salilsdesai	2022-12-06 00:54:59 +00:00
Kimish Patel	bd456fb549	[Pytorch][Vulkan] shader codegen use ordered dictionary (#89951 ) When not using ordered dictionary, it can result in parameter values have different order for each specialization. This can result shader names which are not consistent in their naming and meaning of the template parameter values that appear in the meaning of their names. For example if you have: conv2d_pw: default_values: - X: 1 - Y: 2 parameter_values: - Y: 3 Default parameter value can generate shader with 'my_shader_1x2' where 1x2 is for X, Y parameters respectively. Then, for non default values, of which there is only 1, we have Y=3 and with existing implementation you can end up genreating shader with 'my_shader_3x1'. Here 3 is for Y and 1 is for X. This leads to confusing shader names. THis diff fixes this by 1. using ordered dict. 2. non default values are updated by first copying default values and then updating them. Differential Revision: [D41006639](https://our.internmc.facebook.com/intern/diff/D41006639/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89951 Approved by: https://github.com/salilsdesai	2022-12-06 00:49:35 +00:00
Kimish Patel	cb68dcbd6b	[Pytorch][vulkan] Simplify depthwise conv to remove bounds compute (#89950 ) Right now we are doing bounds check and reduce compute according to bounds check. However this can lead to thread divergence. Furthermore since textures provide handling of border region, it should be safe to use negative indexing. Differential Revision: [D41006645](https://our.internmc.facebook.com/intern/diff/D41006645/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89950 Approved by: https://github.com/salilsdesai	2022-12-06 00:47:17 +00:00
Kimish Patel	876b70245a	[Vulkan] output benchmark numbers for aibench parsing (#89949 ) Add this util so as to easily benchmark shaders and summarize the output. Eventually the shader benchmarking should obsolete the need for this. Differential Revision: [D41244028](https://our.internmc.facebook.com/intern/diff/D41244028/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41244028/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89949 Approved by: https://github.com/digantdesai, https://github.com/salilsdesai	2022-12-06 00:01:49 +00:00
Kimish Patel	841eba6382	[pytorch][vulkan] realistic benchmark size for depthwise (#89948 ) Update benchmark size to be bigger tensors to get mor realistic numbers Differential Revision: [D41006643](https://our.internmc.facebook.com/intern/diff/D41006643/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89948 Approved by: https://github.com/digantdesai, https://github.com/salilsdesai	2022-12-05 23:59:25 +00:00
Atul Jangra	564905c8e1	[Caffe2] Fix the assert message (#89816 ) Summary: As title. dev1/2 is invalid. It should be dev_1/2 instead Test Plan: Sandcastle Differential Revision: D41569982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89816 Approved by: https://github.com/PaliC	2022-12-05 23:40:08 +00:00
JackCaoG	2ea32f41f4	Fix XLA dynamo CI (#90229 ) Fixes https://github.com/pytorch/xla/issues/4274 We should not access `subgraph` once it is deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90229 Approved by: https://github.com/voznesenskym	2022-12-05 22:38:11 +00:00
Shen Li	5d6aa99c45	Add sharding strategy to fully_shard (#90192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90192 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-12-05 22:20:25 +00:00
Shen Li	e4670885b9	Add a repro for fully_shard _unshard error (#90190 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90190 Approved by: https://github.com/awgu	2022-12-05 22:20:25 +00:00
Charlie Yan	0f274ed385	[Composable API] `replicate`: add support for DDP args (#89243 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89243 Approved by: https://github.com/zhaojuanmao	2022-12-05 21:38:23 +00:00
Chien-Chin Huang	72fdfad4ad	[FSDP][optim_state_dict][1/N] Restructure _optim_state_dict to prepare the support of use_orig_param (#89898 ) Motivation: Restructure some APIs in _optim_state_dict.py to allow better future extension, mostly for supporting use_orig_params. NO logic change in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89898 Approved by: https://github.com/awgu	2022-12-05 21:01:48 +00:00
Sergii Dymchenko	2b20a3d3ef	Simplify by using yield from (#90160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90160 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-12-05 20:48:05 +00:00
Sergii Dymchenko	54858cce4e	Fix issue 38095 TODOs in NCCL tests (#90033 ) Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90033 Approved by: https://github.com/awgu	2022-12-05 20:33:23 +00:00
David Berard	7571134f69	[NNC] Use New PassManager for LLVM >= 15 (#89978 ) This is needed because TargetMachine::adjustPassManager was removed in https://reviews.llvm.org/D137796. However, we need to keep around the old pass manager implementation for LLVM < 12. Based on this: https://llvm.org/docs/NewPassManager.html Tests: `./build/bin/test_tensorexpr` passes. RUN_TORCHBENCH: nvfuser Differential Revision: [D41636445](https://our.internmc.facebook.com/intern/diff/D41636445) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89978 Approved by: https://github.com/bertmaher	2022-12-05 19:19:36 +00:00
Edward Z. Yang	5de5c5e462	Assume that co_firstlineno is always defined (#90180 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/90180 Approved by: https://github.com/albanD	2022-12-05 19:15:35 +00:00
Natalia Gimelshein	1ea20cdb33	workaround for indexing formulas with negative terms (#89933 ) Fixes https://github.com/pytorch/torchdynamo/issues/1928 For `ModularIndexing` we generate indexing code with `//` and `%` operators. When `ModularIndexing` base is negative (that can happen after valid simplifications), `//` in triton produces wrong results https://github.com/openai/triton/issues/619/. For `//` op coming from pytorch, we have codegen workarounds, but I'm reluctant to apply these workarounds to very common indexing computation patterns, both for code readability and perf considerations. Similarly, we replace `ModularIndexing` with `IndexingDiv` when we can prove that base is small, but those assumptions break when `ModularIndexing` base is negative (`ModularIndexing` is always positive, `IndexingDiv` isn't). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89933 Approved by: https://github.com/jansel	2022-12-05 19:12:29 +00:00
mikey dagitses	368a1cbd02	fix c10::detail::integer_iterator for C++17 (#90174 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90174). * __->__ #90174 fix c10::detail::integer_iterator for C++17 Summary: std::iterator is deprecated. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90174 Approved by: https://github.com/clee2000, https://github.com/malfet	2022-12-05 18:39:47 +00:00
Michael Voznesensky	5423c2f0e2	Light refactor to how we get shape_env for graph lowering (#90139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90139 Approved by: https://github.com/ezyang	2022-12-05 18:35:30 +00:00
Michael Voznesensky	32639a822c	Fix missing line in XLA backend after mergebot + ghstack gap (#90197 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90197 Approved by: https://github.com/clee2000	2022-12-05 18:30:05 +00:00
Jiewen Tan	7e034193bb	[LTC] Restore default ctor for LazyTensor (#90086 ) Summary: This pull request introduced a temporarily change that make XLA's LTC migration easier. One step among is to make XLATensor naively inherits LazyTensor and that requires LazyTensor to have a default constructor. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90086 Approved by: https://github.com/JackCaoG, https://github.com/kit1980	2022-12-05 18:26:37 +00:00
Charlie Yan	65a0dcffd8	[Composable API] `replicate`: change to per module call, remove `mark_root_module()` (#89222 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89222 Approved by: https://github.com/zhaojuanmao	2022-12-05 17:54:55 +00:00
PyTorch MergeBot	8845a8f899	Revert "as_strided: Fix default storage_offset for reference implementation (#89513 )" This reverts commit eded97ac7224ad5f80334acf57a3b0c24f83d89f. Reverted https://github.com/pytorch/pytorch/pull/89513 on behalf of https://github.com/peterbell10 due to broke master	2022-12-05 17:53:23 +00:00
Thiago Crepaldi	6d794f6a4a	[ONNX] Fix concat with empty tensors (#87620 ) Fixes #54410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87620 Approved by: https://github.com/BowenBao	2022-12-05 17:36:31 +00:00
Lukas N Wirz	301d9c0556	Remove deprecated usage of is_pod/is_pod_v (#88918 ) … as equivalent replacements for std::is_pod and std::is_pod_v because they are deprecated in C++20. When consuming libtorch header files in a project that uses C++20, there are warnings about std::is_pod being deprecated. This patch fixes that issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88918 Approved by: https://github.com/ezyang	2022-12-05 16:50:00 +00:00
erjia	b1eb42bcfd	[4/4][DataPipe] Remove iterator depletion in Zipper (#89974 ) Fixes: https://github.com/pytorch/data/issues/865 I will add another PR in torchdata to validate this change would solve the infinite datapipe problem (I have tested locally). This is one of the most annoying stack of PRs cause by separation between TorchData and PyTorch. There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](`4451eb24e6/torch/utils/data/datapipes/iter/combining.py (L573-L583)`) to retrieve all remaining data to make sure generator function is fully executed) However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: https://github.com/pytorch/data/issues/865 So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more. Differential Revision: [D41699469](https://our.internmc.facebook.com/intern/diff/D41699469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89974 Approved by: https://github.com/NivekT, https://github.com/wenleix	2022-12-05 16:45:34 +00:00
Peter Bell	eded97ac72	as_strided: Fix default storage_offset for reference implementation (#89513 ) This fixes the default storage_offset to take it from the input. This was previously untested, so I've also added a new OpInfo which includes samples with non-zero storage_offsets on the input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513 Approved by: https://github.com/ezyang, https://github.com/ngimel	2022-12-05 15:52:49 +00:00
Shen Li	199b8b6025	Remove deprecated flatten_params_wrapper.py from lintrunner config (#90154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90154 Approved by: https://github.com/awgu	2022-12-05 15:21:47 +00:00
Shen Li	7a08261a9c	Fix fully_shard error when policy is not provided (#90151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90151 Approved by: https://github.com/awgu	2022-12-05 15:21:47 +00:00
vfdev-5	777ac632fb	Added vectorized flip for uint8 (#90013 ) Following https://github.com/pytorch/pytorch/pull/89414#discussion_r1036224613 just refactoring and adding `flip` method for `Vectorized<uint8>`. This should speed up torch.flip horizontal implementation similarly to what is reported in https://github.com/pytorch/pytorch/pull/89414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90013 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2022-12-05 12:23:28 +00:00
Nikita Karetnikov	226e803ecb	[Inductor] handle non-positive exponents in `Pow` (#90146 ) Fixes #90125. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90146 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-12-05 09:16:35 +00:00
Michael Voznesensky	41c3b41b92	Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] (#90039 ) After all of the preparatory commits, this is a subset of the changes in https://github.com/pytorch/pytorch/pull/89392 that actually change us to propagating fake tensors to backends. Signed-off-by: Edward Z. Yang <ezyangfb.com> This is the merger of Ed's PR #89672, which is a rewrite of an older PR of mine (#89392), with CI Fixes on top of it (#89773) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90039 Approved by: https://github.com/ezyang	2022-12-05 01:56:50 +00:00
PyTorch MergeBot	4648baa911	Revert "Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] (#90039 )" This reverts commit ef0c7ec958439caf44a98fb7b70d920c6c2264b9. Reverted https://github.com/pytorch/pytorch/pull/90039 on behalf of https://github.com/clee2000 due to broke xla tests `ef0c7ec958` https://github.com/pytorch/pytorch/actions/runs/3606308473/jobs/6077646142	2022-12-04 21:57:30 +00:00
Richard Barnes	a580a63448	[codemod][llvm15] LLVM-15 fixes for caffe2/test/cpp/jit/test_module_api.cpp (#89938 ) Summary: This fixes issues which block `caffe2/test/cpp/jit/test_module_api.cpp` from compiling with LLVM-15. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D41603454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89938 Approved by: https://github.com/soumith	2022-12-04 12:50:14 +00:00
Gao, Xiang	d6c8603b98	Fix warning: use of bitwise '&' with boolean operands (#90131 ) ``` [130/1102] Building CXX object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cudnn/LossCTC.cpp.o /home/gaoxiang/nvfuser5/aten/src/ATen/native/cudnn/LossCTC.cpp:97:11: warning: use of bitwise '&' with boolean operands [-Wbitwise-instead-of-logical] (target_lengths[b] < 256) & (target_lengths[b] <= input_lengths[b]); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ && /home/gaoxiang/nvfuser5/aten/src/ATen/native/cudnn/LossCTC.cpp:97:11: note: cast one or both operands to int to silence this warning 1 warning generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/90131 Approved by: https://github.com/kit1980	2022-12-04 08:47:20 +00:00
xiny	57bb4cd046	[Doc][Distributed] Add missing functions to distributed.rst (#89905 ) Add missing documents for `torch.distributed.all_to_all_single` and other functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89905 Approved by: https://github.com/kit1980	2022-12-04 07:22:54 +00:00
David Boetius	f3aeed4960	Add generator argument to torch.rand docstring (#90071 ) The documentation of `torch.rand` was missing the `generator` keyword argument in the function signature. However, the argument is explained in the documentation and `torch.rand` accepts that argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90071 Approved by: https://github.com/janeyx99	2022-12-04 07:19:24 +00:00
Chung-chieh Shan	1a25e6f3c3	Fix indentation (#90110 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90110 Approved by: https://github.com/kit1980	2022-12-04 07:13:53 +00:00
Ram Rachum	7322f73c8f	Fix exception cause in storage.py (#90118 ) This change causes the correct message to be shown between the two tracebacks when an error is shown. More context here: https://blog.ram.rachum.com/post/621791438475296768/improving-python-exception-chaining-with Pull Request resolved: https://github.com/pytorch/pytorch/pull/90118 Approved by: https://github.com/kit1980	2022-12-04 06:51:25 +00:00
Zheng Yan	c00d395f05	Revert D41682843: Multisect successfully blamed D41682843 for test or build failures (#90132 ) Summary: This diff is reverting D41682843 D41682843 has been identified to be causing the following test or build failures: Tests affected: - https://www.internalfb.com/intern/test/281475048939643/ Here's the Multisect link: https://www.internalfb.com/intern/testinfra/multisect/1444954 Here are the tasks that are relevant to this breakage: T93770103: 5 tests started failing for oncall assistant_multimodal in the last 2 weeks We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. Test Plan: NA Reviewed By: zyan0, atuljangra, YazhiGao Differential Revision: D41710749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90132 Approved by: https://github.com/awgu	2022-12-04 05:35:17 +00:00
erjia	bda6ff0990	[1/4][DataPipe] Properly cleanup unclosed files within generator function (#89973 ) There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](`4451eb24e6/torch/utils/data/datapipes/iter/combining.py (L573-L583)`) to retrieve all remaining data to make sure generator function is fully executed) However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: https://github.com/pytorch/data/issues/865 So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more. Differential Revision: [D41699470](https://our.internmc.facebook.com/intern/diff/D41699470) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89973 Approved by: https://github.com/NivekT	2022-12-04 04:04:46 +00:00
Jongsoo Park	2bca280a31	Revert D41683102: Multisect successfully blamed D41683102 for test or build failures (#90117 ) Summary: This diff is reverting D41683102 D41683102 has been identified to be causing the following test or build failures: Tests affected: - https://www.internalfb.com/intern/test/281475051072735/ Here's the Multisect link: https://www.internalfb.com/intern/testinfra/multisect/1444960 Here are the tasks that are relevant to this breakage: T124964606: 41 tests started failing for oncall ads_trainer_release in the last 2 weeks We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. Test Plan: NA Reviewed By: jspark1105 Differential Revision: D41710842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90117 Approved by: https://github.com/soumith	2022-12-03 19:54:04 +00:00
Andrew Gu	e47af44eb8	[FSDP][Easy] Remove unused methods (#89229 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89229 Approved by: https://github.com/mrshenli	2022-12-03 17:55:27 +00:00
Andrew Gu	1ee189ce8e	[FSDP] Issue warning when clamping to `NO_SHARD` (#90060 ) Fixes https://github.com/pytorch/pytorch/issues/90050. I hope that this was not meant as an onboarding task :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/90060 Approved by: https://github.com/zhaojuanmao	2022-12-03 15:58:25 +00:00
Richard Zou	4068c5467d	[Reland] Move functorch/_src to torch/_functorch (#88756 ) (#90091 ) This will be the last disruptive functorch internals change. Why are we moving these files? - As a part of rationalizing functorch we are moving the code in functorch/_src to torch/_functorch - This is so that we can offer the functorch APIs as native PyTorch APIs (coming soon) and resolve some internal build issues. Why are we moving all of these files at once? - It's better to break developers all at once rather than many times Test Plan: - wait for tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/90091 Approved by: https://github.com/anijain2305, https://github.com/ezyang	2022-12-03 14:17:15 +00:00
eqy	f7520cb51e	Reduce memory usage requirement of `test_pdist_norm_large` in `test_torch.py` (#90075 ) Basically the same fix as #85373, `/usr/bin/time` indicates that the memory requirement on the host-side was actually ~64GiB before the workaround and ~30GiB after. CC @ptrblck @davidberard98 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90075 Approved by: https://github.com/davidberard98	2022-12-03 05:28:21 +00:00
PyTorch MergeBot	61bd7fbacb	[vision hash update] update the pinned vision hash (#90095 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90095 Approved by: https://github.com/pytorchbot	2022-12-03 03:10:09 +00:00
Shen Li	e53a0e391b	[Easy] Remove unused parametrization (#90079 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90079 Approved by: https://github.com/awgu	2022-12-03 03:03:13 +00:00
Shen Li	dd060f359e	Test composable checkpoint wrapping FSDP submodules (#90078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90078 Approved by: https://github.com/awgu	2022-12-03 03:03:13 +00:00
Sergii Dymchenko	a775204499	Fix issue 38095 TODO in test_dataloader.py (#90084 ) Fix TODO related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90084 Approved by: https://github.com/clee2000, https://github.com/NivekT	2022-12-03 03:01:52 +00:00
Michael Voznesensky	ef0c7ec958	Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] (#90039 ) After all of the preparatory commits, this is a subset of the changes in https://github.com/pytorch/pytorch/pull/89392 that actually change us to propagating fake tensors to backends. Signed-off-by: Edward Z. Yang <ezyangfb.com> This is the merger of Ed's PR #89672, which is a rewrite of an older PR of mine (#89392), with CI Fixes on top of it (#89773) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90039 Approved by: https://github.com/ezyang	2022-12-03 01:19:55 +00:00
Jesse Cai	9a1c6fd506	[pruning][core][feature] Align BaseStructuredPruner with existing pruning flow (#88436 ) Summary: This PR aligns the "eager" mode of the structured pruning flow with the existing unstructured pruning flow. The base pruner has been moved from and has been renamed from BasePruner to BaseStructuredPruner `torch/ao/pruning/_experimental/pruner/base_pruner.py -> /torch/ao/pruning/_experimental/pruner/base_structured_pruner.py` Support for pruning batchnorm modules in the config have been removed, so now the structured pruning code can use more of the BaseSparsifier logic and we don't need to override as many functions. Since we aim to only support a single flow, we have only updated ZeroesParametrizations (FakeStructuredSparsity) and BiasHook. The parameterizations have also been rewritten to use a bool mask tensor for keeping track of pruned rows, instead of using sets before. This better aligns structured and unstructured sparsity. The BaseStructuredSparsifier tests have also been updated to reflect the above changes. I also removed `squash_mask` tests because they were breaking CI and `squash_mask` is no longer used. We will migrate the structured pruning code out of this folder in a later PR. Test Plan: ``` python test/test_ao_sparsity -- TestBaseStructuredPruner ``` Reviewers: z-a-f vkuzo Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88436 Approved by: https://github.com/vkuzo	2022-12-03 00:53:53 +00:00
Jerry Zhang	d3f20a20b8	[reland][quant] Explictly set default quantized engine instead of relying on the order of supported_qengines (#89804 ) (#90036 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/86404 Test Plan: ossci + sandcastle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/90036 Approved by: https://github.com/andrewor14	2022-12-03 00:12:00 +00:00
Sergii Dymchenko	65f38160f0	Fix issue 38095 TODOs in test_quantized_op.py (#89883 ) Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89883 Approved by: https://github.com/clee2000	2022-12-03 00:05:23 +00:00
andrewor14	29d1d8f3ef	[Quant] Remove explicitly default QConfigMapping settings (#90066 ) Summary: Previously we explicitly set a qconfig for ops like conv and linear in the default QConfigMapping. However, this makes it difficult for user to override the global and have the new global take effect for basic ops. This commit removes these explicit settings so the user can simply run the following to quantize these ops. ``` qconfig_mapping = get_default_qconfig_mapping() qconfig_mapping.set_global(my_qconfig) ``` There is no change in behavior for the default use case of not setting anything on the default QConfigMapping. Test Plan: python test/test_quantization.py TestQuantizeFx.test_default_qconfig_mapping_override_global Reviewers: vkuzo, jerryzh168 Subscribers: vkuzo, jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90066 Approved by: https://github.com/vkuzo, https://github.com/jerryzh168	2022-12-02 23:33:47 +00:00
Christian Puhrsch	a306f85ea7	Update Persons of Interest (#90069 ) Creates sections for contributors to MaskedTensor and NestedTensor and updates torchaudio. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90069 Approved by: https://github.com/drisspg, https://github.com/mikaylagawarecki, https://github.com/nateanl	2022-12-02 23:06:57 +00:00
David Berard	9d54d3bec2	[NVFuser] undo v100 OOM skips (#90070 ) Summary: I think these were just caused by parallel tests. After adjusting test settings to 1 thread, these stopped OOMing. Test Plan: ``` $ buck2 test -j 1 mode/dev-nosan //caffe2/torch/csrc/jit/codegen/cuda:nvfuser ``` https://www.internalfb.com/intern/testinfra/testrun/6473924590389963 Differential Revision: D41643827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90070 Approved by: https://github.com/jjsjann123	2022-12-02 21:58:24 +00:00
Shen Li	74a090a744	Add integration test for composable fully_shard and checkpoint (#90041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90041 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-12-02 21:57:08 +00:00
PyTorch MergeBot	cba96366a2	Revert "remove torch.equal usages (#89527 )" This reverts commit 4095ef8b809f922f2e0e09011afd00037d20a771. Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests `4095ef8b80` https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502	2022-12-02 21:36:13 +00:00
Yanbo Liang	e1532af0bb	Fix meta registration for aten._cdist_forward (#90042 ) Error from [7k github model](https://github.com/pytorch/torchdynamo/issues/1884). Pull Request resolved: https://github.com/pytorch/pytorch/pull/90042 Approved by: https://github.com/ezyang, https://github.com/eellison	2022-12-02 21:13:52 +00:00
Andrew Gu	eb56b08f96	[FSDP] Fix `clip_grad_norm_()` for low prec grads (#90028 ) For PyTorch FSDP, the only way that gradients are in low precision is if `keep_low_precision_grads=True` or if the user turns on AMP. This PR adds tests for the former and improves the documentation for `clip_grad_norm_()`, especially around these non-full-precision cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90028 Approved by: https://github.com/rohan-varma	2022-12-02 21:10:45 +00:00
Andrew Gu	688b767265	[FSDP] Fix `keep_low_precision_grads=True` for `use_orig_params=True` (#90027 ) For any `flat_param.data = flat_param.to(...)` or `flat_param.grad.data = flat_param.grad.to(...)`, we must also refresh sharded parameter/gradient views, respectively, if the storage changes. For `keep_low_precision_grads=True` and a sharded strategy, we cast the gradient back to the low precision using `.data` to bypass the PyTorch check that a parameter and its gradient have the same dtype. For `use_orig_params=True` before this PR, the gradient would incorrectly still be in full precision, not low precision, since we did not refresh views (this can actually be considered a memory leak since we have two copies of the gradient now, one in low precision and one in full precision). This PR refreshes the views. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90027 Approved by: https://github.com/mrshenli	2022-12-02 21:10:45 +00:00
PyTorch MergeBot	f5fbb5001f	Revert "[follow-up] Python Attr Serialization (#88913 )" This reverts commit 086b251f9aeceaad95059de860ae81fd06526533. Reverted https://github.com/pytorch/pytorch/pull/88913 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-12-02 20:14:11 +00:00
Driss Guessous	78bdb858f9	Call _sdp_attention in nn.functional.mha (#89470 ) # Summary Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470 Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb	2022-12-02 19:46:22 +00:00
Yanbo Liang	3916d729c8	[Dynamo] tensor.type() should return tensor types with CPU and GPU variants (#90021 ) Fix errors from [7k github models](https://github.com/pytorch/torchdynamo/issues/1884) ``` Traceback (most recent call last): File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1062, in get_fake_value return wrap_fake_exception( File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 739, in wrap_fake_exception return fn() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1063, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1112, in run_node raise RuntimeError( RuntimeError: Failed running call_function <function einsum at 0x7fd8f246a4c0>(('i,j->ij', FakeTensor(FakeTensor(..., device='meta', size=(4,)), cpu), FakeTensor(FakeTensor(..., device='meta', size=(2,)), cuda:0)), *{}): Unhandled FakeTensor Device Propagation for aten.mul.Tensor, found two different devices cpu, cuda:0 (scroll up for backtrace) ``` The root cause is: ```tensor.type()``` should return ```torch.cuda.FloatTensor``` rather than ```torch.FloatTensor``` if it's on GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90021 Approved by: https://github.com/jansel	2022-12-02 18:57:43 +00:00
Alexander Grund	538f6279db	Fix access to unitialized memory in VSX vector functions (#89833 ) This results in e.g. failures in TestNNDeviceTypeCPU.test_groupnorm_nhwc_cpu_float32 So simply initialize the stack array with zeroes as expected and done in other implementations Fixes #32502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89833 Approved by: https://github.com/ezyang	2022-12-02 18:19:42 +00:00
Elias Ellison	acd68f9097	[Reland] dont clone args (#89766 ) Reland of https://github.com/pytorch/pytorch/pull/89519. Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning because of the 250mb cache clearing in triton benchmarking. Reland bc previously we weren't accounting for inplace buffer reuse correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89766 Approved by: https://github.com/jansel	2022-12-02 17:20:40 +00:00
Nikita Shulga	59101b6fe4	Fix binary iOS uploads (#90058 ) curl on CircleCI MacOS runners does not support `--retry-all-errors` Should fix https://app.circleci.com/pipelines/github/pytorch/pytorch/618606/workflows/6f104c19-3a3a-479d-a686-4961ddd87657/jobs/17233205 Yet another fallback of https://github.com/pytorch/pytorch/pull/89157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90058 Approved by: https://github.com/jeanschmidt	2022-12-02 14:28:19 +00:00
Jean Schmidt	f62e54df8f	Reland "Dynamo, FX, Inductor Progress Bars (#88384 )" … (#90055 ) This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly. Original commit: #88384 (011452a2a1c745d4b12f83f89eca039f482d134b) Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3) Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f22804be6909e54fc09e07f891ab0886774 Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055 Approved by: https://github.com/DanilBaibak, https://github.com/malfet	2022-12-02 13:28:00 +00:00
Pearu Peterson	b87682f555	Fix gradcheck for CSR and CSC inputs. (#89786 ) Partially fix-es https://github.com/pytorch/pytorch/issues/87085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89786 Approved by: https://github.com/albanD	2022-12-02 12:35:20 +00:00
Pearu Peterson	526e4aa5f8	Update to_sparse docs regarding the layout and blocksize kw arguments. (#89912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89912 Approved by: https://github.com/cpuhrsch	2022-12-02 12:23:15 +00:00
PyTorch MergeBot	cf3c3f2280	Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 )" This reverts commit bcf4292f04eda6c21cab18aa70cad6b2887c8b78. Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits	2022-12-02 09:57:31 +00:00
Wang, Eikan	0bde810572	Add more debug information for Inductor (#90008 ) - Add graph index to the profile information of the Inductor kernel for better debugability. The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation. - Label arbitrary code ranges for `eager` and `opt` modes for better debugability The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-02 09:34:48 +00:00
Anupam Bhatnagar	6f4dea562d	Implement post and pre hooks for optimizer (#89176 ) Fixes #88446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89176 Approved by: https://github.com/albanD	2022-12-02 07:03:45 +00:00
Angel Avila	adc1a94ef4	Add tests for custom pybind type_casters (#89897 ) This is a followup to #89115 which Fixes #88958 This adds tests to verify at runtime that the types returned by custom pybind type_casters are correctly specified in the second argument to `PYBIND11_TYPE_CASTER`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89897 Approved by: https://github.com/ezyang	2022-12-02 07:02:09 +00:00
alexmsettle	b703e4b3c2	Add hierarchical module names to torchFX graph.node #87659 (#87742 ) Fixes #87659 Pass down the module hierarchy from module.named_modules() to the name field of graph.node. This makes it so the name of each node contains descriptive information about the network architecture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87742 Approved by: https://github.com/jerryzh168	2022-12-02 05:58:06 +00:00
chengscott	9dffc56008	Intel compiler support in c10/util/TypeIndex.h (#89610 ) Build passed with icc (ICC) 2021.7.1 20221019. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89610 Approved by: https://github.com/kit1980	2022-12-02 05:32:21 +00:00
HDCharles	9013c92a9f	[ao] making QConfigMapping print in a user friendly way (#89932 ) Summary: added __repr__ to QConfigMapping and QConfigMultiMapping loosely based on __repr__ for BaseSparsifier example output: ``` >>> import torch >>> print(torch.ao.quantization.qconfig_mapping.get_default_qconfig_mapping()) QConfigMapping ( global_qconfig QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) object_type_qconfigs reshape: QConfig(activation=<class 'torch.ao.quantization.observer.ReuseInputObserver'>, weight=<class 'torch.ao.quantization.observer.NoopObserver'>) <class 'torch.nn.modules.conv.Conv1d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.conv.Conv2d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.conv.Conv3d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.conv.ConvTranspose1d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <class 'torch.nn.modules.conv.ConvTranspose2d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <class 'torch.nn.modules.conv.ConvTranspose3d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <class 'torch.nn.modules.linear.Linear'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <built-in method conv1d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <built-in method conv2d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <built-in method conv3d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <built-in method conv_transpose1d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <built-in method conv_transpose2d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <built-in method conv_transpose3d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <built-in function linear>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.activation.ReLU'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <function relu at 0x7f08ad57bc10>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <built-in method relu of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.batchnorm.BatchNorm1d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.batchnorm.BatchNorm2d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <class 'torch.nn.modules.batchnorm.BatchNorm3d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){}) <function layer_norm at 0x7f08ad57fca0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=<class 'torch.ao.quantization.observer.PlaceholderObserver'>) <class 'torch.nn.modules.normalization.LayerNorm'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=<class 'torch.ao.quantization.observer.PlaceholderObserver'>) <class 'torch.nn.modules.activation.Hardsigmoid'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <function hardsigmoid at 0x7f08ad57f670>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) hardsigmoid: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) hardsigmoid_: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <class 'torch.nn.modules.activation.Sigmoid'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <built-in method sigmoid of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) sigmoid: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) sigmoid_: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <class 'torch.nn.modules.activation.Softmax'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <class 'torch.nn.modules.activation.Tanh'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) <built-in method tanh of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) tanh: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) tanh_: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){}) module_name_regex_qconfigs OrderedDict() module_name_qconfigs OrderedDict() module_name_object_type_order_qconfigs OrderedDict() ) ``` Test Plan: python test/test_quantization.py TestFXNumericSuiteNShadows.test_qconfig_multi_mapping_repr python test/test_quantization.py TestQuantizeFx.test_qconfig_mapping_repr Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89932 Approved by: https://github.com/vkuzo	2022-12-02 05:24:47 +00:00
Sean Ross-Ross	5f881ac2d1	Adding dispatch alias 'FuncTorchBatchedDecomposition' (#88771 ) part of https://github.com/pytorch/functorch/issues/1009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88771 Approved by: https://github.com/zou3519	2022-12-02 04:38:28 +00:00
Elias Ellison	6addc8d923	[Inductor] add expm1 lowering (#89961 ) Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961 Approved by: https://github.com/ngimel	2022-12-02 04:29:54 +00:00
XiaobingSuper	42f27c322b	TorchDynamo: don't compute index for max_pooling when return_index is false (#89838 ) For max_pooling, if return_index is False, we don't need compute the index. Before: ``` extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<3; i3+=1) { { { auto tmp0 = in_ptr0[i3 + (6i2) + (42i1) + (147i0)]; auto tmp2 = in_ptr0[3 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp7 = in_ptr0[6 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp12 = in_ptr0[21 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp17 = in_ptr0[24 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp22 = in_ptr0[27 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp27 = in_ptr0[42 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp32 = in_ptr0[45 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp37 = in_ptr0[48 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp1 = static_cast<long>((2i2) + (14i1)); auto tmp3 = static_cast<long>(1 + (2i2) + (14i1)); auto tmp4 = tmp2 > tmp0; auto tmp5 = tmp4 ? tmp3 : tmp1; auto tmp6 = (tmp0 != tmp0) ? tmp0 : std::max(tmp2, tmp0); auto tmp8 = static_cast<long>(2 + (2i2) + (14i1)); auto tmp9 = tmp7 > tmp6; auto tmp10 = tmp9 ? tmp8 : tmp5; auto tmp11 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6); auto tmp13 = static_cast<long>(7 + (2i2) + (14i1)); auto tmp14 = tmp12 > tmp11; auto tmp15 = tmp14 ? tmp13 : tmp10; auto tmp16 = (tmp11 != tmp11) ? tmp11 : std::max(tmp12, tmp11); auto tmp18 = static_cast<long>(8 + (2i2) + (14i1)); auto tmp19 = tmp17 > tmp16; auto tmp20 = tmp19 ? tmp18 : tmp15; auto tmp21 = (tmp16 != tmp16) ? tmp16 : std::max(tmp17, tmp16); auto tmp23 = static_cast<long>(9 + (2i2) + (14i1)); auto tmp24 = tmp22 > tmp21; auto tmp25 = tmp24 ? tmp23 : tmp20; auto tmp26 = (tmp21 != tmp21) ? tmp21 : std::max(tmp22, tmp21); auto tmp28 = static_cast<long>(14 + (2i2) + (14i1)); auto tmp29 = tmp27 > tmp26; auto tmp30 = tmp29 ? tmp28 : tmp25; auto tmp31 = (tmp26 != tmp26) ? tmp26 : std::max(tmp27, tmp26); auto tmp33 = static_cast<long>(15 + (2i2) + (14i1)); auto tmp34 = tmp32 > tmp31; auto tmp35 = tmp34 ? tmp33 : tmp30; auto tmp36 = (tmp31 != tmp31) ? tmp31 : std::max(tmp32, tmp31); auto tmp38 = static_cast<long>(16 + (2i2) + (14i1)); auto tmp39 = tmp37 > tmp36; auto tmp40 = tmp39 ? tmp38 : tmp35; auto tmp41 = (tmp36 != tmp36) ? tmp36 : std::max(tmp37, tmp36); out_ptr0[i3 + (3i2) + (9i1) + (27i0)] = tmp41; } } } } } } } ''') ``` After: ``` extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma GCC ivdep for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<3; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<3; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<3; i3+=1) { { { auto tmp0 = in_ptr0[i3 + (6i2) + (42i1) + (147i0)]; auto tmp1 = in_ptr0[3 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp3 = in_ptr0[6 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp5 = in_ptr0[21 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp7 = in_ptr0[24 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp9 = in_ptr0[27 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp11 = in_ptr0[42 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp13 = in_ptr0[45 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp15 = in_ptr0[48 + i3 + (6i2) + (42i1) + (147i0)]; auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0); auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2); auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4); auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6); auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8); auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10); auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12); auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14); out_ptr0[i3 + (3i2) + (9i1) + (27i0)] = tmp16; } } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89838 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-02 04:15:45 +00:00
Nikita Shulga	f623b123f0	[Inductor] Do not install g++12 by default (#90038 ) Unless `TORCH_INDUCTOR_INSTALL_GXX` environment variable is define (which is the case for CI) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90038 Approved by: https://github.com/albanD	2022-12-02 04:13:58 +00:00
XiaobingSuper	b058a02786	TorchDynamo: enable convolution bn folding for functional bn (#89746 ) Motivation: for Timm model, there is always use customer-defined BN which using F.batch_norm: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/layers/norm_act.py#L26, and the fx graph will be like: ``` ------------- ---------------------- --------------------------------------- --------------------------------------------------------------------------------------------------------- -------- placeholder x x () {} call_module self_conv self_conv (x,) {} get_attr self_bn_running_mean_1 self_bn_running_mean () {} get_attr self_bn_running_var self_bn_running_var () {} get_attr self_bn_weight self_bn_weight () {} get_attr self_bn_bias self_bn_bias () {} call_function batch_norm <function batch_norm at 0x7f07196cdf70> (self_conv, self_bn_running_mean_1, self_bn_running_var, self_bn_weight, self_bn_bias, False, 0.1, 1e-05) {} call_module self_bn_drop self_bn_drop (batch_norm,) ``` the original conv+bn folding path doesn't work for F.batch_norm, but for F.batch_norm case, if its' parameters are const(attr of the module and will not be updated), we can also do the const folding's optimization. This PR will enable it and will improve the Timm models' performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89746 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-12-02 04:13:34 +00:00
Animesh Jain	3162a48a77	[dynamo][benchmarks] Call zero grad (#90026 ) Hoping that it might reduce some flakiness Pull Request resolved: https://github.com/pytorch/pytorch/pull/90026 Approved by: https://github.com/williamwen42	2022-12-02 04:05:57 +00:00
Taylor Robie	63e57280fc	[Profiler] Memory profiler part 13: Add sizes to timeline. (#89356 ) If we see an allocation the size is unambiguous. Otherwise we have to use sizes and strides to bound the underlying storage. Differential Revision: [D40868660](https://our.internmc.facebook.com/intern/diff/D40868660/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89356 Approved by: https://github.com/chaekit	2022-12-02 03:55:22 +00:00
Taylor Robie	6727e537a7	[Profiler] Memory profiler part 12: Emit timeline of memory events. (#89355 ) Add a simple interface to get a flat representation of the memory profile. Differential Revision: [D40868663](https://our.internmc.facebook.com/intern/diff/D40868663/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89355 Approved by: https://github.com/chaekit	2022-12-02 03:55:22 +00:00
Jerry Zhang	342139589c	[quant][fx] Add support for matching multiple arguments in patterns (#89986 ) Summary: This PR adds support for matching patterns that has multiple arguments, it's needed for quantization in PyTorch 2.0 early prototype Before this PR, we only support patterns like: ``` x -> conv -> bn -> relu (relu, (bn, conv)) ``` where each operator has a single node, the code breaks when we want to match a pattern that has an op that has multiple arguments, such as: ``` shape \ transpose -> reshape -> output -> ``` where `reshape` has two arguments Test Plan: python test/test_quantization.py TestQuantizeFx.test_match_pattern_with_multiple_args Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89986 Approved by: https://github.com/vkuzo	2022-12-02 03:28:32 +00:00
PyTorch MergeBot	4176102407	[vision hash update] update the pinned vision hash (#90035 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90035 Approved by: https://github.com/pytorchbot	2022-12-02 03:17:56 +00:00
Catherine Lee	39937b84cd	Change periodic concurrency group (#89850 ) it hasnt been running the mem leak check b/c it keeps getting cancelled due to a higher priority job Pull Request resolved: https://github.com/pytorch/pytorch/pull/89850 Approved by: https://github.com/malfet, https://github.com/seemethere	2022-12-02 02:40:27 +00:00
Animesh Jain	d09c52e4fd	[inductor] Deterministic kernel names (#89713 ) `node.origins` is a set and does not have an order. Therefore, inductor w and w/o cudagraphs experiments generate different kernel names, making it hard to debug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89713 Approved by: https://github.com/soumith, https://github.com/mlazos, https://github.com/ngimel	2022-12-02 02:37:36 +00:00
XiaobingSuper	8b2f9887bf	update quantization doc: add x86 backend as default backend of server inference (#86794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86794 Approved by: https://github.com/jgong5, https://github.com/kit1980	2022-12-02 02:10:25 +00:00
Jiewen Tan	69d7afc799	[LTC] Remove noop_execution_mode_ (#89989 ) Summary: noop_execution_mode_ doesn't seem to be useful anymore. Let's remove it. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89989 Approved by: https://github.com/desertfire, https://github.com/JackCaoG	2022-12-02 01:51:30 +00:00
Michael Lazos	342d78d1a2	Cache guards once per variable tracker, rather than re-propagating them repeatedly (#89827 ) This improves tracing performance of optimizer tracing significantly (2x). In essence this just removes the recursion from propagate because it is not necessary. ListVariables and ConstDictVariables already contain the guards from the items contained in them. Adds two other optimizations for special cases of `recursively_contains` helps with https://github.com/pytorch/torchdynamo/issues/1803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89827 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-12-02 01:45:05 +00:00
Zheng Yan	6efedfd774	Revert D41609017: Multisect successfully blamed D41609017 for test or build failures (#90034 ) Summary: This diff is reverting D41609017 D41609017 has been identified to be causing the following test or build failures: Tests affected: - https://www.internalfb.com/intern/test/281475052567659/ - https://www.internalfb.com/intern/test/562950029295825/ Here's the Multisect link: https://www.internalfb.com/intern/testinfra/multisect/1440332 Here are the tasks that are relevant to this breakage: T93368156: 5 tests started failing for oncall admarket_predictor_pushmaster in the last 2 weeks We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. Test Plan: NA Reviewed By: zyan0 Differential Revision: D41656946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90034 Approved by: https://github.com/awgu	2022-12-02 01:31:50 +00:00
Michael Lazos	c63afb283c	Disable dynamo on optimizer lazy initialization (#89902 ) Helps with https://github.com/pytorch/torchdynamo/issues/1803 Separate out the group initialization and disable dynamo on it Pull Request resolved: https://github.com/pytorch/pytorch/pull/89902 Approved by: https://github.com/soumith, https://github.com/albanD	2022-12-02 01:15:11 +00:00
Nikita Shulga	d94f5c784c	Fix binary testing if torchtrition is mandatory (#90017 ) Prep-change for a builder, where torchtrition is installed from custom nightly downloads repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/90017 Approved by: https://github.com/seemethere	2022-12-02 01:05:01 +00:00
Salil Desai	f628f2ed73	[QNNPACK] Fix Memory Leak in QNNPACK QSoftmax Op (#89544 ) Summary: The deleter of the operator's unique_ptr doesn't get called unless the unique_ptr is created after the op has been created This fixes the problem reported in https://fb.workplace.com/groups/pytorch.edge.users/posts/1210708329799458/ Test Plan: # Testing memory leak fix With test code added in D41487340: ``` cd ~/fbsource/xplat buck run caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test:qsoftmax_test ``` Before this diff: ``` ==2060866==ERROR: LeakSanitizer: detected memory leaks Direct leak of 608 byte(s) in 1 object(s) allocated from: #0 0x41bcd27 in calloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcd27) #1 0x405b692 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:77 Indirect leak of 1024 byte(s) in 1 object(s) allocated from: #0 0x41bcb7f in malloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcb7f) #1 0x405b6a8 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:85 SUMMARY- AddressSanitizer: 1632 byte(s) leaked in 2 allocation(s). ``` After this diff: - No errors ___ # Testing op correctness ``` cd ~/fbsource/fbcode buck test caffe2/test/quantization:quantization -- test_qsoftmax ``` Passes - https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749908834332/ Differential Revision: D41487341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89544 Approved by: https://github.com/mcr229	2022-12-01 23:34:36 +00:00
Shen Li	7bd284495a	Add non-reentrant checkpoint to composable APIs (#90015 ) Differential Revision: [D41661027](https://our.internmc.facebook.com/intern/diff/D41661027) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90015 Approved by: https://github.com/zhaojuanmao	2022-12-01 23:05:55 +00:00
Aidyn-A	a5430e1067	[UCC] Properly finalize unsuccessful collective posts (#89306 ) This PR add a `ucc_collective_finalize` call if `ucc_collective_post` and `ucc_collective_triggered_post` were not successful. According to the [UCC documentation](https://openucx.github.io/ucc/api/v1.1/html/group___u_c_c___c_o_l_l_e_c_t_i_v_e_s.html): ``` On error, request handle becomes invalid, user is responsible to call ucc_collective_finalize to free allocated resources. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89306 Approved by: https://github.com/kwen2501	2022-12-01 23:01:45 +00:00
PyTorch MergeBot	063bbeb3ba	Revert "[quant] Explictly set default quantized engine instead of relying on the order of supported_qengines (#89804 )" This reverts commit 607ff6f4c10914a2a46bab90577cd083a6b3d46d. Reverted https://github.com/pytorch/pytorch/pull/89804 on behalf of https://github.com/clee2000 due to breaking tests `607ff6f4c1` https://github.com/pytorch/pytorch/actions/runs/3596841274/jobs/6058297637 trunk label didnt kick off workflows fast enough	2022-12-01 22:39:46 +00:00
jiaruifang	29ea1c9c8e	[doc] update dtensor readme (#89991 ) I fixed some import erros in readme of dtensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89991 Approved by: https://github.com/wanchaol	2022-12-01 22:16:39 +00:00
Soumith Chintala	6f5945e4bb	triton supports devices < 7.0, not 6.0 (#90020 ) triton is still buggy with Pascal devices, so make the error checker reflect that. Also, this < 6.0 never worked, as the `has_triton` definition in utils.py was checking >= 7.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90020 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2022-12-01 22:01:41 +00:00
Jerry Zhang	607ff6f4c1	[quant] Explictly set default quantized engine instead of relying on the order of supported_qengines (#89804 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/86404 Test Plan: ossci + sandcastle Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41635738](https://our.internmc.facebook.com/intern/diff/D41635738) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89804 Approved by: https://github.com/andrewor14	2022-12-01 21:52:59 +00:00
Manuel Candales	d04480a6b5	[Vulkan][TCC] Add tests for quantized add, sub, mul and div (#89578 ) Summary: Added randomized test for quantized add, sub, mul and div Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Differential Revision: D41047094 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89578 Approved by: https://github.com/digantdesai	2022-12-01 21:38:27 +00:00
Jerry Zhang	8aee768025	[quant][be] Merge qconfig_mapping_utils.py in quantization and fx folders (#89979 ) Summary: att, no functionality changes Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89979 Approved by: https://github.com/vkuzo	2022-12-01 21:25:53 +00:00
Ajay Hotchandani	0ad6715b7b	[aarch64] add sleef_arm dependency (#89988 ) Reviewed By: kimishpatel, psaab Differential Revision: D41601965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89988 Approved by: https://github.com/soumith	2022-12-01 21:10:53 +00:00
kshitij12345	07be48de37	[chalf] relax tolerance : conv_transpose2d (#89993 ) Fixes https://github.com/pytorch/pytorch/issues/87332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89993 Approved by: https://github.com/lezcano	2022-12-01 21:03:14 +00:00
Wanchao Liang	ca5526cf1f	[tp] ufmt test/distributed/tensor (#89970 ) formatting stack to make dtensor and tp align with pytorch format standard. cmd: `ufmt format test/distributed/tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89970 Approved by: https://github.com/fduwjj	2022-12-01 20:58:16 +00:00
Wanchao Liang	9b5e6b029f	[tp] umft distributed.tensor.parallel (#89969 ) cmd: `ufmt format torch/distributed/tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89969 Approved by: https://github.com/fduwjj	2022-12-01 20:58:16 +00:00
Wanchao Liang	c37c5163da	[dtensor] ufmt test/distributed/_tensor (#89968 ) cmd: `ufmt format test/distributed/_tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89968 Approved by: https://github.com/fduwjj	2022-12-01 20:58:15 +00:00
Wanchao Liang	bf23e0bdbd	[dtensor] ufmt distributed._tensor (#89967 ) cmd: `ufmt format torch/distributed/_tensor` copy from Andrew: Notes For VSCode users, Install ufmt: https://pypi.org/project/ufmt/ Install VSCode ufmt extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt Include in settings.json: ``` { "[python]": { "editor.defaultFormatter": "omnilib.ufmt", "editor.formatOnSave": true, }, } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89967 Approved by: https://github.com/fduwjj	2022-12-01 20:58:13 +00:00
Nikita Shulga	768bd3fb4a	Add `torch.compile` implementation (#89607 ) `torch.compile` can be used either as decorator or to optimize model directly, for example: ``` @torch.compile def foo(x): return torch.sin(x) + x.max() ``` or ``` mod = torch.nn.ReLU() optimized_mod = torch.compile(mod, mode="max-autotune") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89607 Approved by: https://github.com/soumith	2022-12-01 20:17:52 +00:00
Eli Uriegas	bcf4292f04	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 ) This breaks in environments that use the fake tqdm `015b05af18/torch/hub.py (L26)` which doesn't support the 'desc' kwarg and is not iterable Original try using pytorchbot did not go through because of a merge conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489 This reverts commit 011452a2a1c745d4b12f83f89eca039f482d134b. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018 Approved by: https://github.com/drisspg, https://github.com/dbort	2022-12-01 20:17:07 +00:00
Svetlana Karslioglu	015b05af18	Editorial pass on Dyamo docs (#89921 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89921 Approved by: https://github.com/msaroufim	2022-12-01 18:53:16 +00:00
AllenTiTaiWang	b2f340557a	[ONNX] Supports scatter_add with different static shape of src and index (#89787 ) Prior to this change, the converter doesn't support `scatter_add` with different shape of `src` and `index`, while [it's claimed to be supported by PyTorch](https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_add_.html#torch.Tensor.scatter_add_) in a way that scatter shape would be accommodated to index shape. This PR adds `onnx::Slice` to adjust the shape of `src` when a static and mismatched shape is found. However, if both of the shape (src and index) is set to dynamic, they are expected to be the same shape from ONNX due to the spec. More ScatterElements details on https://github.com/onnx/onnx/issues/4672 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89787 Approved by: https://github.com/BowenBao	2022-12-01 18:25:22 +00:00
andrewor14	d80056312a	[Quant][fx][bc-breaking] Rename fx/patterns.py (#89872 ) Summary: This commit renames fx/quantization_patterns.py to fx/quantize_handler.py, and fx/fusion_patterns.py to fx/fuse_handler.py. This is because these files contain only QuantizeHandler and FuseHandler respectively, so the new names are more descriptive. A future commit will further break BC by removing all the empty QuantizeHandler classes. BC-breaking notes: The following classes under the `torch.ao.quantization.fx.quantization_patterns` namespace are migrated to the `torch.ao.quantization.fx.quantize_handler` namespace: ``` QuantizeHandler BinaryOpQuantizeHandler CatQuantizeHandler ConvReluQuantizeHandler LinearReLUQuantizeHandler BatchNormQuantizeHandler EmbeddingQuantizeHandler RNNDynamicQuantizeHandler DefaultNodeQuantizeHandler FixedQParamsOpQuantizeHandler CopyNodeQuantizeHandler GeneralTensorShapeOpQuantizeHandler CustomModuleQuantizeHandler StandaloneModuleQuantizeHandler ``` The following classes under the `torch.ao.quantization.fx.fusion_patterns` namespace are migrated to the `torch.ao.quantization.fx.fuse_handler` namespace: ``` DefaultFuseHandler FuseHandler ``` Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/89872 Approved by: https://github.com/jerryzh168	2022-12-01 17:37:07 +00:00
Philip Meier	314e7c37c3	fix citation file in MANIFEST (#89994 ) #86200 changed the `CITATION` file to `CITATION.cff`, but this change was not reflected in the `MANIFEST.in`. Meaning, `CITATION.cff` will not be included in wheels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89994 Approved by: https://github.com/malfet	2022-12-01 15:21:54 +00:00
Rohan Varma	a5532929da	Remove DDP import (#89982 ) This import is only used for typing, removing it to avoid circular ref in next diffs Differential Revision: [D41636897](https://our.internmc.facebook.com/intern/diff/D41636897/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89982 Approved by: https://github.com/zhaojuanmao	2022-12-01 14:56:48 +00:00
Shen Li	5a36d99845	Add error repro test for FSDP ignored modules with mixed precision (#89971 ) The ignored modules are still using the original precision, which leads to the following error. ``` RuntimeError: mat1 and mat2 must have the same dtype ``` This is not blocking me at the moment, but the fix seems not too hard. We can add a pre-forward hook to each ignored module to convert activations to original precision, and a post-forward hook to convert it back to the specified precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89971 Approved by: https://github.com/awgu	2022-12-01 14:56:40 +00:00
Kshiteej K	dfb533ca5b	add vjp test with non-contig inputs (#89375 ) Ref: https://github.com/pytorch/functorch/issues/1029 We update `test_vjp` to do contiguous and non-contiguous sample testing. Prev Time: ~32s New Time : ~50s Pull Request resolved: https://github.com/pytorch/pytorch/pull/89375 Approved by: https://github.com/zou3519	2022-12-01 14:43:30 +00:00
Edward Z. Yang	99dac4dd48	Type torch._dynamo.guards (#89919 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89919 Approved by: https://github.com/albanD	2022-12-01 13:43:10 +00:00
Edward Z. Yang	e03cde07e4	Guarantee symbol allocation for all sizes/strides/storage offset (#89879 ) We may need to express guards on the size/stride/storage offset of a tensor, but we cannot do this if it's already been duck sized. This PR guarantees that we allocate a symbol (or negation of the symbol) whenever we ask to create a SymInt, and propagates this symbol to SymNode so that Dynamo can look at it (not in this PR). This PR doesn't actually add guards, nor does Dynamo do anything with these symbols. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89879 Approved by: https://github.com/albanD	2022-12-01 13:43:10 +00:00
Edward Z. Yang	74bcf2b604	Add definitely_not_01 set to ShapeEnv. (#89871 ) This set tracks symbols which we know are definitely not 0/1, and thus can be further simplified when we try to work out their static value without guards. Right now, all allocated symbols are in this set, but we will later add symbols which don't uphold this. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89871 Approved by: https://github.com/albanD	2022-12-01 13:43:10 +00:00
Edward Z. Yang	8d333761a9	When dealing with dupe arguments, prefer leafifying if possible (#89896 ) See code comment for details. I also had to do some extra fixes: * `run_functionalized_fw_and_collect_metadata` now is able to handle duplicated arguments * `aot_wrapper_dedupe` now always returns boxed compiled functions * `aot_wrapper_dedupe` is now applied to inference compiler along with autograd compiler (preexisting) Fixes https://github.com/pytorch/torchdynamo/issues/1939 Fixes DebertaV2ForQuestionAnswering DebertaForMaskedLM DebertaForQuestionAnswering DebertaV2ForMaskedLM Repro command: ``` python benchmarks/dynamo/huggingface.py --performance --float32 -dcuda --training --inductor --no-skip --dashboard --only DebertaForQuestionAnswering --cold_start_latency ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89896 Approved by: https://github.com/bdhirsh	2022-12-01 13:42:29 +00:00
Andrew Gu	808cb2e86d	[FSDP][Dynamo] Define annotation attributes as globals (#89913 ) This was separated out from the previous PR to decouple. Since not all builds include `torch.distributed`, we should define the globals in the dynamo file and import to distributed instead of vice versa. Unlike the version from the previous PR, this PR prefixes the globals with `_` to future proof against `_dynamo/` eventually becoming public. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89913 Approved by: https://github.com/wconstab	2022-12-01 13:25:54 +00:00
Philip Meier	4095ef8b80	remove torch.equal usages (#89527 ) Preparation for the next PR in this stack: #89559. I replaced - `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`, - the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and - `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default). There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527 Approved by: https://github.com/mruberry	2022-12-01 11:22:52 +00:00
Philip Meier	0acbcef4ab	fix assert_close docstring (#89620 ) Two improvements here: 1. To render a bullet list correctly, a blank line before and after is needed. Compare ![Screenshot from 2022-11-24 09-34-10](https://user-images.githubusercontent.com/6849766/203732792-18071831-c7d9-4138-9002-e67e29f342fa.png) vs. ![Screenshot from 2022-11-24 09-34-52](https://user-images.githubusercontent.com/6849766/203732806-1ded7a4b-ca30-46c8-89a2-5c83ea33dbe7.png) 2. #72508 added proper support for meta tensors. Thus, we no longer throw an error if we encounter them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89620 Approved by: https://github.com/kit1980	2022-12-01 11:22:52 +00:00
Philip Meier	d72cd4c4e5	document torch.testing.assert_allclose (#89526 ) After our failed attempt to remove `assert_allclose` in #87974, we decided to add it to the documentation after all. Although we drop the expected removal date, the function continues to be deprecated in favor of `assert_close`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89526 Approved by: https://github.com/mruberry	2022-12-01 11:22:50 +00:00
Philip Meier	4baa78bb1f	enable ufmt for torch/testing/.py (#89525 ) I've tried to soft-enforce this manually already, albeit with a line length of 120. This just adds it to the CI. Note that this only applies to `torch/testing/.py` and thus everything under `torch/testing/_internal/*/` is not affected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89525 Approved by: https://github.com/kit1980	2022-12-01 11:22:48 +00:00
Yanli Zhao	850b53bbee	ad more error info for cublasLtMatmul (#89983 ) hit an error at 'cublasLtMatmul' when running bfloat16 for a complicate model, this error info will help debugging and also is good for future error reporting Pull Request resolved: https://github.com/pytorch/pytorch/pull/89983 Approved by: https://github.com/ngimel	2022-12-01 06:34:13 +00:00
Edward Z. Yang	a747326423	Add manual meta implementations to quantize_per_tensor.tensor and co (#89958 ) When you are writing a meta function, you cannot call item() on the tensor because there is no real data on the tensor and it will fail. The error message was not very good in this case, see also https://github.com/pytorch/pytorch/issues/89959 This PR takes a brute force approach to resolving the problem: just manually define meta implementations for the naughty functions that are calling item(). However, this results in a lot of code duplication. The easiest way to avoid this situation is to rewrite the decomps so they don't call item. It should not be that difficult to use direct tensors on your operations, as scalar tensors can broadcast too. I could only test this with `buck test @mode/opt -c python.package_style=inplace //executorch/backends/test:test_backends` in internal with D41555454. Test coverage needs to be improved, otherwise don't blame us when we break you. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89958 Approved by: https://github.com/jerryzh168	2022-12-01 06:04:37 +00:00
mingfeima	f1978b18f9	add mixed data type support for LayerNorm (#81851 ) 1. If user uses amp to run bfloat16 models, `torch.autocast` will keep module paramters in acc dtype which will leave `gamma` and`beta` in float while input/output will be in bfloat16. 2. If user explicitly cast the model to bfloat16 such as: ``` x = torch.randn(n, t, c).bfloat16() ln = nn.LayerNorm(c).bfloat16() y = ln(x) ``` The input/output and gamma/beta will all be in bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81851 Approved by: https://github.com/ezyang	2022-12-01 04:48:34 +00:00
PyTorch MergeBot	b6d6c6933e	[vision hash update] update the pinned vision hash (#89749 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89749 Approved by: https://github.com/pytorchbot	2022-12-01 04:01:32 +00:00
Richard Barnes	b399acd2dd	[codemod][llvm15] LLVM-15 fixes for caffe2/caffe2/video/video_decoder.cc (#89937 ) Summary: This fixes issues which block `caffe2/caffe2/video/video_decoder.cc` from compiling with LLVM-15. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D41603386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89937 Approved by: https://github.com/soumith	2022-12-01 03:46:22 +00:00
Richard Barnes	2f5532a90e	[codemod][llvm15] LLVM-15 fixes for caffe2/caffe2/video/video_decoder.h (#89940 ) Summary: This fixes issues which block `caffe2/caffe2/video/video_decoder.h` from compiling with LLVM-15. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D41603451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89940 Approved by: https://github.com/soumith	2022-12-01 03:39:31 +00:00
Richard Barnes	cc01614186	[codemod][llvm15] LLVM-15 fixes for caffe2/test/cpp/jit/test_graph_executor.cpp (#89936 ) Summary: This fixes issues which block `caffe2/test/cpp/jit/test_graph_executor.cpp` from compiling with LLVM-15. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D41603459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89936 Approved by: https://github.com/soumith	2022-12-01 03:30:31 +00:00
Bert Maher	6317311e61	[inductor] Disable parallel compilation inside fbcode (#89926 ) Forking python processes using `multiprocessing` doesn't play nicely with certain aspects of FB infra, so let's disable it until we find a better solution. Differential Revision: [D41618774](https://our.internmc.facebook.com/intern/diff/D41618774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89926 Approved by: https://github.com/desertfire	2022-12-01 02:33:45 +00:00
Manuel Candales	8d8a215d4c	[Vulkan][TCC] Helper functions for vulkan quantized tests (#89922 ) Summary: Helper functions for producing random inputs/scale/zero points and also computing suitable scale and zero points of a tensor, used in the testing of quantized ops. Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: kimishpatel Differential Revision: D41595034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89922 Approved by: https://github.com/digantdesai	2022-12-01 02:10:51 +00:00
JackCaoG	a61450726f	Minor fix for dynamo xla integration test (#89891 ) Fix the test before I added them to the xla CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89891 Approved by: https://github.com/kit1980, https://github.com/shunting314	2022-12-01 02:10:36 +00:00
XiaobingSuper	4bae860813	quantization: make x86 as default backend (#88799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88799 Approved by: https://github.com/kit1980	2022-12-01 02:09:54 +00:00
XiaobingSuper	0e7918b931	fix mkldnn quantization issue for weight reorder error (#86876 ) Differential Revision: [D40351062](https://our.internmc.facebook.com/intern/diff/D40351062) For mkldnn quantization path, we will do weight prepack using dummy data to query the expected weight format, the packed weight's format may differ from the real input case(the weight format depends on the input's shape), and there will have a block weight to block weight reorder if the packed weight format differs with the expected weight format. The mkldnn may meet the following issue when doing such reorder(test on ICX machine): ``` test_conv_reorder_issue_onednn torch.ops.quantized.conv2d(qx, w_packed, output_scale=1.0, output_zero_point=0) File "/home/weiwen/.conda/envs/int8-dev/lib/python3.9/site-packages/torch/_ops.py", line 472, in __call__ return self._op(args, *kwargs or {}) RuntimeError: could not create a primitive descriptor for a reorder primitive ``` This PR will fix it: if the block weight to block weight reorder is failed, we will reorder the block weight to plain weight first, and then reorder the plain weight to the target block weight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86876 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2022-12-01 02:00:29 +00:00
mingfeima	6372f11d8d	RowwiseMoments: use float as acc type for bfloat16 inputs (#84405 ) To fix https://github.com/pytorch/pytorch/issues/77507 Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16, which is not only slow but also introducing additional rounding errors. This patch will do accumulation on float for the bfloat16 inputs: each of bfloat16 vec (size 16) will be converted to two float vec (size 8), and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs. No effect on float performance, will improve bfloat16 performance: * avx512 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms ``` * avx512 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms ``` * avx2 single socket: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms ``` * avx2 single core: ``` before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/84405 Approved by: https://github.com/jgong5	2022-12-01 01:58:59 +00:00
Iris	ad1585b4a4	[Checkpoint] Minor update to checkpoint utils (#89964 ) Change to only print temp directory once on rank0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89964 Approved by: https://github.com/XilunWu	2022-12-01 01:55:53 +00:00
Edward Z. Yang	a43e09c064	Implement gamma cdf (#89955 ) Authored by tillahoffmann originally at https://github.com/pytorch/pytorch/pull/72518 Implements the cumulative distribution function for the gamma distribution. The tests needed a small adjustment to pass because gradients cannot be evaluated with respect to the first argument of the incomplete gamma function (and they're not needed for the test). Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89955 Approved by: https://github.com/wconstab, https://github.com/malfet	2022-12-01 00:12:53 +00:00
Pearu Peterson	5167108c1a	Add device note to the docs of sparse tensor factory functions (#89910 ) Fixes #89402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89910 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2022-12-01 00:06:38 +00:00
Dmitry Tomshin	11db12bd94	Issue 68576 prefetch factor docstring changes (#89874 ) Fixes #68576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89874 Approved by: https://github.com/kit1980	2022-11-30 23:42:56 +00:00
Jerry Zhang	7cf0913909	Correct the label for quantization PRs (#89888 ) Summary: att Test Plan: NA Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89888 Approved by: https://github.com/andrewor14	2022-11-30 23:06:49 +00:00
Edward Z. Yang	1ccaa2a5f7	[EASY] Replace direct use of Guard ctor with make_guard (#89945 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89945 Approved by: https://github.com/albanD	2022-11-30 22:45:09 +00:00
Wanchao Liang	4451eb24e6	Move tensor_parallel out to distributed.tensor folder (#89878 ) This PR moves tensor parallel from torch.distributed._tensor.parallel to torch.distributed.tensor.parallel, to prepare for beta release Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878 Approved by: https://github.com/fduwjj	2022-11-30 22:13:10 +00:00
Jane Xu	8a760ea922	Subscribing janeyx99 to optimizer PRs (#89943 ) Adding myself to keep updated with what's up in the world of optimizers Pull Request resolved: https://github.com/pytorch/pytorch/pull/89943 Approved by: https://github.com/albanD	2022-11-30 22:07:32 +00:00
BowenBao	5a82c79024	Small fix for `torch._C.Graph` type hint (#89821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89821 Approved by: https://github.com/kit1980	2022-11-30 21:48:09 +00:00
Shen Li	dfbc4e5473	[Easy][FSDP] Fix pyre error (#89930 ) This PR attemps to fix the following pyre error: ``` Incompatible parameter type [6]: In call `dist.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel.__init__`, for 7th parameter `auto_wrap_policy` expected `Optional[typing.Callable[..., typing.Any]]` but got `Optional[_FSDPPolicy]`. ``` Besides, this also removes the type inconsistency in code and docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89930 Approved by: https://github.com/awgu	2022-11-30 21:33:00 +00:00
William Wen	0c3537a3c3	Add dynamo smoke tests to CI (#89302 ) Add dynamo smoke tests to CI, which checks for python/torch/cuda versions and runs simple dynamo examples on a few backends, including inductor. Smoke tests will run on dynamo and inductor shards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89302 Approved by: https://github.com/malfet	2022-11-30 21:24:45 +00:00
Jerry Zhang	9e4a25c731	[quant][decomposed] Add support for int32 for decomposed q/dq ops (#89881 ) Summary: att Test Plan: python test/test_quantization.py -k test_decomposed_quantize_per_tensor python test/test_qunatization.py -k test_decomposed_dequantize_per_tensor Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89881 Approved by: https://github.com/cccclai	2022-11-30 21:24:00 +00:00
Sijia Chen	62f01e2b26	[FIX][QAT] Switch to use `kwargs` when `args` is empty (#89778 ) Summary: When `ref_node.args` is empty, the QAT will throw index out of range. Here is an example, line 574 is using `tensors = ....` in torch.cat func, which will be treated as `kwargs` {F800357376} f388506954 To fix the issue, we will use the value of the first kwarg if args is empty Test Plan: f388545532 Reviewed By: bigning, lyoka Differential Revision: D41396771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89778 Approved by: https://github.com/lyoka, https://github.com/houseroad	2022-11-30 21:15:21 +00:00
Jerry Zhang	0bc19e77d2	[quant][be] Simplify `insert_observers_for_model` in fx/prepare.py (#89887 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89887 Approved by: https://github.com/andrewor14	2022-11-30 21:09:14 +00:00
Jane Xu	76e869c911	[BE] Beef up test_functionalization to test functionalizing multi-parameter functions (#89798 ) Previously, `assert_functionalization` only took in uni-Tensor-parameter functions. This PR beefs up the check to allow for functions that take multiple parameters. This PR also changes the test_instance_norm test to check that the multiparam change works. ## Test plan Locally tested, CI should also pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89798 Approved by: https://github.com/samdow	2022-11-30 20:46:16 +00:00
Yu, Guangye	4144ad16af	add XPU backend to support torch.save and torch.load (#89679 ) # Motivate We need to add XPU backend to support torch.save and torch.load when parameter _use_new_zipfile_serialization=False. # Solution We give a design via wrap data as a tensor: >1. and use an in-place copy for H2D >2. directly call a tensor.to() for D2H. This can help us: >1. unify the generic code for all backends. >2. support all the non-CPU device backends. # Additional Context No need more UT. test/test_serialization.py will cover this code change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89679 Approved by: https://github.com/ezyang	2022-11-30 20:38:02 +00:00
Andrew Gu	6fb8423904	[FSDP] Slightly refactor fx symbolic tracer (#89917 ) I made a pass over Linjian's `_symbolic_trace.py` and tidied it up a bit. Aside from simple stylistic changes, this PR makes the following changes: - Save `visited_params: Set[nn.Parameter]` to avoid linear overhead to check a parameter already being visited when appending to the parameter execution order list (`param_forward_order`) - Move the tracer patching logic to a class `_ExecOrderTracer` to have a reference to `self.exec_info` without having a fragmented 2-step initialization (like the old `_init_execution_info(root_module)` plus `_patch_tracer(tracer, root_module, execution_info)`) - Define `_ParamUsageInfo` to formalize the `Tuple[nn.Module, List[str, nn.Parameter]]` elements being mapped to in the execution info `dict`, and clarify the documentation regarding what this represents - Change the unit test to use `TestCase`, not `FSDPTest`, to avoid initializing a process group Pull Request resolved: https://github.com/pytorch/pytorch/pull/89917 Approved by: https://github.com/zhaojuanmao, https://github.com/fegin	2022-11-30 20:31:55 +00:00
Andrew Gu	89769d84eb	[FSDP][BE] Move dynamo annotation to separate file (#89890 ) This PR makes two minor changes: It (1) moves the recently-added module annotation logic for dynamo support to a separate file `torch/distributed/fsdp/_dynamo_utils.py` and ~~(2) saves the annotated attribute names to global variables `FSDP_MANAGED_MODULE` and `FSDP_USE_ORIG_PARAMS`~~. Update: Since the distributed package may not be included in some builds, it is not safe to import from `torch.distributed...` to a file in `_dynamo/`. I will not include change (2) in this PR. The alternative is to define those globals (privately) in the dynamo file and import from there in the FSDP file. - The first change is mainly a personal choice, where I wanted to avoid the dynamo explanation from dominating the FSDP constructor space-wise. I added the `(see function for details)` to the inline comment to forward interested readers. - The second change follows the custom we have taken in the past for such attributes (e.g. `FSDP_FLATTENED`). My understanding (in the past as well as currently) is that this is a good practice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89890 Approved by: https://github.com/wconstab	2022-11-30 20:29:41 +00:00
Pearu Peterson	76c6dfeaa6	Add layout and blocksize arguments to Tensor.to_sparse method (#89502 ) This PR extends the `Tensor.to_sparse()` method to `Tensor.to_sparse(layout=None, blocksize=None)` in a BC manner (`layout=None` means `layout=torch.sparse_coo`). In addition, the PR adds support for the following conversions: - non-hybrid/hybrid COO tensor to CSR or CSC or a COO tensor - short, bool, byte, char, bfloat16, int, long, half CSR tensor to a BSR tensor and fixes the following conversions: - hybrid COO to COO tensor - non-batch/batch hybrid BSR to BSR or BSC tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/89502 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2022-11-30 20:21:10 +00:00
Kulin Seth	f2308b1da6	[MPS] Enable fp16 for linear backward (#89774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89774 Approved by: https://github.com/albanD, https://github.com/malfet	2022-11-30 20:00:32 +00:00
Khushi Agrawal	b5ad90932a	[jiterator, complex32] lerp : cuda (#75584 ) Follows #74748 and #74537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75584 Approved by: https://github.com/anjali411	2022-11-30 19:07:30 +00:00
albanD	26054c1607	beef up inplace/view note on copy slices (#89856 ) Follow up doc update from https://github.com/pytorch/pytorch/pull/89812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89856 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2022-11-30 18:35:52 +00:00
Andrew Gu	b7c42b4066	[FSDP][Easy] ufmt `test_fsdp_checkpoint.py` (#89916 ) I am in the habit now to run `ufmt format test/distributed/fsdp` before committing, and this changed `test_fsdp_checkpoint.py`. I separated this into its own PR. This change should be safe to force merge to save CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89916 Approved by: https://github.com/mrshenli	2022-11-30 18:31:43 +00:00
Nikita Shulga	6e8e7b9407	Fix binary ios builds (#89929 ) curl on CircleCI MacOS runners does not support `--retry-all-errors` Should fix https://app.circleci.com/pipelines/github/pytorch/pytorch/616842/workflows/5d1162c8-eeae-4627-a1b2-17b493b15b59/jobs/17230369?invite=true#step-105-62 Cleanup after https://github.com/pytorch/pytorch/pull/89157 that were missed by https://github.com/pytorch/pytorch/pull/89298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89929 Approved by: https://github.com/seemethere, https://github.com/atalman	2022-11-30 18:25:47 +00:00
Xilun Wu	1207b0e474	Update Reviewers for PyTorch Distributed team (#89889 ) - Reflect PyTorch Distributed team member change on the merge rule - Added new team members since 2021 - Removed one member no longer on PyTorch Distributed team Pull Request resolved: https://github.com/pytorch/pytorch/pull/89889 Approved by: https://github.com/soumith	2022-11-30 17:56:19 +00:00
Sergii Dymchenko	09f2373ec0	Fix TODOs related to #38095 in test_mps.py (#89815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89815 Approved by: https://github.com/weiwangmeta, https://github.com/kulinseth	2022-11-30 17:00:36 +00:00
PyTorch MergeBot	f1415b8cb6	Revert "Call _sdp_attention in nn.functional.mha (#89470 )" This reverts commit 4d7ec302202caaf35bb8c997d035c54f0c24e192. Reverted https://github.com/pytorch/pytorch/pull/89470 on behalf of https://github.com/jeanschmidt due to breaking internal builds	2022-11-30 16:16:24 +00:00
PyTorch MergeBot	618a585f6c	Revert "replace double transpose with single permute in nn.f.mha (#89847 )" This reverts commit b9afa928271dfd6b80ddb2367fa1c4f4aba25fe4. Reverted https://github.com/pytorch/pytorch/pull/89847 on behalf of https://github.com/jeanschmidt due to Need to revert this commit as it is causing conflict when reverting #89470	2022-11-30 16:03:48 +00:00
Wu, Chunyuan	a6caa9c54b	Add a cpp wrapper for Inductor (#88167 ) ## Description Implements https://github.com/pytorch/torchdynamo/issues/1556. This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting: ```python from torch._inductor import config config.cpp_wrapper = True ``` ### Example The main part of the generated code: ```python from torch.utils.cpp_extension import load_inline wrapper = ( ''' #include <dlfcn.h> #include <assert.h> std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float); auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float); auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW); assert(kernel0_lib != nullptr); void (kernel0)(const float,const float,float,float); (void *) (&kernel0) = dlsym(kernel0_lib, "kernel"); kernel0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float*)(buf1.data_ptr())); arg0_1.reset(); arg1_1.reset(); return std::make_tuple(buf0, buf1); }''' ) module = load_inline( name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu', cpp_sources=[wrapper], functions=['call_0'], extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'], extra_ldflags=['-shared -lgomp'], extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m']) def _wrap_func(f): def g(args): return f(args) return g call = _wrap_func(module.call_0) ``` ### Next steps The below items will be addressed in upcoming PRs. - [x] Support Reduction: #88561 - [x] Support None: #88560 - [ ] Support ExternKernel - [x] ATen GEMM-related OPs: #88667 - [ ] ATen Conv - [ ] Conv/GEMM fusion OPs - [x] Cache the kernel loading part: #89742 - [ ] De-allocate input buffers when possible by leveraging CPython APIs - [ ] Support Constant Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-11-30 13:40:47 +00:00
Andrew Gu	5949d5fed5	[FSDP][Easy] Remove internal default arg (#89227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89227 Approved by: https://github.com/mrshenli	2022-11-30 13:34:05 +00:00
haozhe.zhu	7cd6e6acad	add bf16 in fp32 out fast path for embedingbag in caffe2 perfkernel (#89198 ) Add BF16 in FP32 out kernel into Caffe2 emb perfkernels. And also update the python code-gen files to generate the kernel. The ut will be covered in the next PR(#89199) in this stack ( Tested by nn.EmbeddingBag with BF16 data type) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89198 Approved by: https://github.com/jgong5, https://github.com/kit1980	2022-11-30 13:06:13 +00:00
Animesh Jain	68805b08d1	[benchmarks][dynamo] Trying CI - Set train() for TIMM models accuracy tests (#89780 ) Moving to train mode for TIMM models and also raising batch size for accuracy testing. Raising batch size seems to remove a lot of noise/instability coming from batch_norm decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89780 Approved by: https://github.com/ngimel	2022-11-30 12:57:35 +00:00
PyTorch MergeBot	969a7d09f6	Revert "[aarch64] add SLEEF dependency for aten_cpu (#89475 )" This reverts commit 3cef87f9fd59adb681d910b8edbc1f33e0be5ad2. Reverted https://github.com/pytorch/pytorch/pull/89475 on behalf of https://github.com/jeanschmidt due to breaking internal builds	2022-11-30 12:06:18 +00:00
PyTorch MergeBot	4cc5be3a06	Revert "Add bits tensor types (#88594 )" This reverts commit f3b1315eee92ac108f9ceacafaf4ad560c78769d. Reverted https://github.com/pytorch/pytorch/pull/88594 on behalf of https://github.com/jeanschmidt due to breaking internal builds	2022-11-30 11:37:56 +00:00
Pearu Peterson	296e1ba4d0	Row and column select support for block compressed sparse tensors (#88733 ) As in the title: - Support `select` and `select_copy` on block sparse compressed tensors - Fixes incorrect results when selecting dense dimensions The PR also improves the performance of indexing sparse compressed tensors considerably: <details> Before: ```python In [3]: a=torch.rand((1000, 1000)).to_sparse_csr() In [4]: %timeit a.select(0, 0) 606 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [5]: %timeit a.select(1, 0) 527 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [6]: %timeit a[0, 0] 617 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [7]: a = a.cuda() In [8]: %timeit a.select(0, 0); torch.cuda.synchronize(); 1.19 ms ± 137 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [9]: %timeit a.select(1, 0); torch.cuda.synchronize(); 1.2 ms ± 119 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [10]: %timeit a[0, 0]; torch.cuda.synchronize(); 1.23 ms ± 482 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` This PR: ```python In [3]: a=torch.rand((1000, 1000)).to_sparse_csr() In [4]: %timeit a.select(0, 0) 4.75 µs ± 8.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [5]: %timeit a.select(1, 0) 565 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [6]: %timeit a[0, 0] 13.1 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [7]: a = a.cuda() In [8]: %timeit a.select(0, 0); torch.cuda.synchronize(); 21.6 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [9]: %timeit a.select(1, 0); torch.cuda.synchronize(); 1.15 ms ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [10]: %timeit a[0, 0]; torch.cuda.synchronize(); 63.7 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88733 Approved by: https://github.com/nikitaved, https://github.com/amjames, https://github.com/cpuhrsch	2022-11-30 11:15:56 +00:00
Iris	0cc0e5ef65	[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987 ) This PR includes: Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in args and *kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987 Approved by: https://github.com/fduwjj	2022-11-30 08:19:41 +00:00
mingfeima	87d18cf0e7	fix RowwiseMoments vectorization issue on CPU (#84404 ) Originally `cpu/moments_utils.h` uses namespace of at::native::utils, this file contains `Vectorized<>`, in order to make it properly vectorized on different archs, need to use anonymous namespace or inline namespace. Otherwise it would be linked to scalar version of the code. This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers. Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup. This patch will improves layer_norm (input size 32x128x1024) float32 inference: * avx512 single socket: 2.1x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms ``` * avx512 single core: 3.2x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms ``` * avx2 single socket: 2.3x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms ``` * avx2 single core: 2.5x ```bash before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms after: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms ``` Attached some original VTune profiling results here to further indicate the issue: 1. original bottlenecks ![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png) we can see `RowwiseMomentsImpl<>` takes majority of the runtime here. 2. Instruction level breakdown of `RowwiseMomentsImpl<>` ![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png) we can see it's all scalar instructions here. 3. after the fix, the bottlenecks ![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png) getting better. 4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>` ![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png) now it is all vectorized instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84404 Approved by: https://github.com/jgong5	2022-11-30 07:55:47 +00:00
Wang, Eikan	92f08f09d8	Vectorize erf (#89837 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89837 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2022-11-30 06:42:36 +00:00
fduwjj	009dd3c4af	[PT-D][Tensor Parallel] Add more test cases when we use use_orig_params for FSDP wrapping (#89779 ) Differential Revision: [D41600656](https://our.internmc.facebook.com/intern/diff/D41600656) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89779 Approved by: https://github.com/wanchaol	2022-11-30 06:34:58 +00:00
Mark Saroufim	011452a2a1	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-11-30 06:07:14 +00:00
Yanbo Liang	d88b555577	[Dynamo] Fix source/reconstruction bugs in NNModule named_* calls (#89729 ) Fixes https://github.com/pytorch/torchdynamo/issues/1931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89729 Approved by: https://github.com/ezyang	2022-11-30 06:05:47 +00:00
Will Constable	447283752c	Update DDP docs for Dynamo/DDPOptimizer (#89096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89096 Approved by: https://github.com/msaroufim	2022-11-30 05:50:12 +00:00
Wanchao Liang	12f98f85bc	[dtensor] update README (#89800 ) This PR updates README to include the RFC details Pull Request resolved: https://github.com/pytorch/pytorch/pull/89800 Approved by: https://github.com/mrshenli	2022-11-30 04:35:32 +00:00
Wanchao Liang	b09efae3bc	update subscriber list (#89799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89799 Approved by: https://github.com/mrshenli	2022-11-30 04:35:32 +00:00
William Wen	f4707ae004	Add arguments to collect_results (#89611 ) Fixes https://github.com/pytorch/torchdynamo/issues/1901. Test script: ```python import copy import torch import torch._dynamo as dynamo import torch._dynamo.config dynamo.config.repro_after = "dynamo" dynamo.config.repro_level = 4 def custom_backend(gm: torch.fx.GraphModule, example_inputs): gm = copy.deepcopy(gm) for node in gm.graph.nodes: if len(node.args) > 1: node.target = torch.add node.args = (node.args[0], 0) gm.recompile() return gm inp = torch.ones(5) inp.requires_grad_(True) @dynamo.optimize(custom_backend) def foo(x): x = x * x return x.sum() y = foo(inp) print(y) y.backward() print(inp.grad) ``` Before, the script will finish but output an incorrect gradient. After the change, the accuracy minifier is triggered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89611 Approved by: https://github.com/ezyang	2022-11-30 04:25:33 +00:00
Andrew Gu	ce17bb95fc	[FSDP] Include module classes in `ModuleWrapPolicy.__repr__` (#89058 ) Before: ``` <torch.distributed.fsdp.wrap.ModuleWrapPolicy object at 0x7fd4280f0fd0> ``` After: ``` <torch.distributed.fsdp.wrap.ModuleWrapPolicy object at 0x7fd4280f0fd0>({<class 'transformers.models.t5.modeling_t5.T5Block'>}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89058 Approved by: https://github.com/mrshenli	2022-11-30 02:27:02 +00:00
Andrew Gu	c8aaad040e	[FSDP] Limit all gather after pre-unshard (#89057 ) To reuse memory when allocating the unsharded `FlatParameter` in the unshard stream, we only need to block the CPU thread on the preceding free event (i.e. `event.synchronize()`) before allocating the unsharded memory, which happens in `handle.unshard()`. Notably, this can be done after the pre-unshard logic, which at most performs _sharded_ allocations (low precision shard or H2D sharded `FlatParameter` copy) in its own pre-unshard stream. This enables the pre-unshard to overlap with any pending ops. With this change, I believe that we should use `limit_all_gathers=True` all the time to stay true to FSDP's proposed memory semantics. If a user wants to set `limit_all_gathers=False`, that would mean that he/she wants to overlap ops that are issued after the unshard logic's all-gather with ops that are pending at the time when FSDP _would_ block the CPU thread via `event.synchronize()`. - If the user is willing to not reuse memory for that all-gather, then the user may as well have applied `NO_SHARD` and optionally ZeRO-1 (if this niche is important, then maybe we should consider hardening ZeRO-1). This is because now the unsharded memory for the all-gather additionally contributes to peak memory since it cannot reuse memory. - If the user wanted to reuse memory for that all-gather, then we needed to block the CPU thread. There is no way around that given the caching allocator semantics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89057 Approved by: https://github.com/mrshenli	2022-11-30 02:27:02 +00:00
Iris	56b3ad78e1	[Checkpoint][2D][5/N] Add checkpoint_utils for distributed checkpoint to testing/_internal/distributed/ (#89873 ) Moving checkpoint_utils from Tau: `6acf4054cf/spmd/testing/checkpoint_utils.py` Checkpoint_utils: add a wrapper to initialize a temp directory for checkpoint testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89873 Approved by: https://github.com/XilunWu, https://github.com/awgu, https://github.com/fduwjj	2022-11-30 02:23:30 +00:00
Andrew Gu	be80b72add	[FSDP] Remove unneeded stream sync from `clip_grad_norm_()` (#89308 ) We do not need to have the pre-unshard and unshard streams wait for the computation stream because we are not using the pre-unshard or unshard streams in `clip_grad_norm_()`. The other change is simply avoiding a loop to get `grads`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89308 Approved by: https://github.com/mrshenli	2022-11-30 02:14:09 +00:00
Pearu Peterson	90bed8874f	Generator of tensor inputs with variable layout and structure (batch/non-batch, hybrid/non-hybrid, block/non-block) (#88914 ) This PR introduces `TestCase.generate_simple_inputs` method that is an improved and generalized version of the `TestSparseCompressed._generate_small_inputs` method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88914 Approved by: https://github.com/cpuhrsch	2022-11-30 02:13:33 +00:00
Elias Ellison	275ade6371	Enable rsqrt (#89771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89771 Approved by: https://github.com/anijain2305	2022-11-30 02:08:13 +00:00
Michael Lazos	2d32e5dd09	add env/config flag to disable dynamo (#89828 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/89828 Approved by: https://github.com/anijain2305	2022-11-30 01:59:44 +00:00
zhxchen17	a70082a863	[functorch] Move `cond.py` to `_cond.py` and expose `cond()` under functorch.experimental.control_flow. (#89819 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/88767 we want to reduce the chance that users accidentally import private functions from `functorch.experimental.cond` as if they were public interfaces. We also move `cond()` under `control_flow.py` to stay consistent with `map()` op. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89819 Approved by: https://github.com/zou3519	2022-11-30 01:50:44 +00:00
Andrew Gu	d1760d7a42	[FSDP][Easy] Remove outdated TODO (#89217 ) Overview This PR removes an outdated TODO: ``` # TODO (awgu): When exposing the original parameters, we need to also # use this attribute to prevent re-synchronizing parameters. ``` Justification We only pass `managed_params` to `_sync_module_params_and_buffers()`, where `managed_params` is defined as ``` managed_params = list(_get_orig_params(root_module, state._ignored_params)) ``` This `_get_orig_params()` call excludes parameters already flattened by FSDP. Thus, `_sync_module_params_and_buffers()` will not re-sync already-synchronized parameters. Each parameter appears in `managed_params` for some FSDP instance exactly once and hence is only synchronized once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89217 Approved by: https://github.com/mrshenli	2022-11-30 01:42:16 +00:00
Elias Ellison	1a33b7cbfa	Make fake tensors preserve dense strides in type conversion (#89803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89803 Approved by: https://github.com/ngimel	2022-11-30 01:28:51 +00:00
Iris	9c8a94bf90	[checkpoint] Improve test (test_nested_dict.py) (#89854 ) Improve the test_nested_dict.py test: 1. Add comments to show flatten_dict and mapping result. 2. Update test_mapping unit test to ensure the key value pair matching in mapping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89854 Approved by: https://github.com/H-Huang	2022-11-30 01:13:32 +00:00
Iris	cefece3726	Fix typo in filesystem.py (#89849 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849 Approved by: https://github.com/H-Huang	2022-11-30 01:06:58 +00:00
Animesh Jain	5a79144a79	[dashboaard] Fix flag compilers (#89853 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89853 Approved by: https://github.com/williamwen42	2022-11-30 01:02:36 +00:00
Nikita Shulga	59a2fe74d4	[CI] Add TorchTrition conda packages (#89841 ) As we need them to make triton available on both platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/89841 Approved by: https://github.com/msaroufim	2022-11-30 01:01:59 +00:00
Scott Ramsby	24b3b73c98	[Caffe2] Fix merge logic bug (#89551 ) Summary: `ExprGroup::getMergeCandidates()` had a logic bug. The vector being initialized had its arguments mis-ordered. This didn't trigger a build warning because the warning about implicit cast from an integral type to `bool` wasn't enabled. Test Plan: `buck test fbsource//arvr/mode/win/vs2019/cuda11/opt fbsource//arvr/mode/hybrid_execution //arvr/libraries/neural_net_inference/TorchScript/...` Differential Revision: D41488939 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89551 Approved by: https://github.com/davidberard98, https://github.com/jjsjann123	2022-11-30 01:01:49 +00:00
Sergii Dymchenko	55789b40ef	Remove beauby and dzdang from CODEOWNERS (#89811 ) GitHub linter complained because the users no longer on the project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89811 Approved by: https://github.com/weiwangmeta	2022-11-30 01:01:24 +00:00
Bin Bao	693135a9b8	[inductor] Add aten._native_batch_norm_legit to decomposition (#89843 ) Summary: Seeing a lot of fallback warnings when running dm_nfnet_f0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89843 Approved by: https://github.com/eellison	2022-11-30 00:58:36 +00:00
Michael Lazos	3d47c74cfe	Update code style for optimizer code (#89862 ) Separating out whitespace-only changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/89862 Approved by: https://github.com/albanD, https://github.com/soumith	2022-11-30 00:53:05 +00:00
Jerry Zhang	8ca09dda42	[quant][docs] Move some of the descriptions out of codeblock (#89795 ) Summary: This is to make sure the description texts are wrapping around code, instead of being displayed as a single line Test Plan: visual inspections Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89795 Approved by: https://github.com/andrewor14	2022-11-30 00:32:27 +00:00
Jane Xu	fcb5d6e771	Enable instance norm running mean test (#89793 ) Followup action to https://github.com/pytorch/pytorch/pull/88697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89793 Approved by: https://github.com/bdhirsh	2022-11-29 23:45:56 +00:00
Andrew Gu	c599cf24ad	[FSDP] Another fix for `DTensor`, `use_orig_params=True` (#89845 ) The issue for `test_2d_parallel.py` is that `DTensor` does not support the idiom `param.data = view` where `view` is a `DTensor`. To work around this, we do not preserve the parameter variable `param` and instead create a new parameter variable altogether via `nn.Parameter(view)`. Preserving the parameter variable when unsharded was not a strict requirement -- it just made sense to do that if we are already doing that when _sharded_, where it _is_ a strict requirement to support the optimizer step. The sharded case is not an issue for 2D because sharded implies local tensor, not `DTensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89845 Approved by: https://github.com/zhaojuanmao	2022-11-29 22:29:41 +00:00
Driss Guessous	b9afa92827	replace double transpose with single permute in nn.f.mha (#89847 ) # Summary I forgot about permute which was exactly what I wanted. Quick perf bump Pull Request resolved: https://github.com/pytorch/pytorch/pull/89847 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2022-11-29 22:18:42 +00:00
albanD	8713119c89	Stream actually overrides __new__ so we need to patch it as well (#89592 ) Avoids ``` $ python foo.py Traceback (most recent call last): File "foo.py", line 3, in <module> a = torch.cuda.Stream() File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__ return super(Stream, cls).__new__(cls, priority=priority, kwargs) TypeError: object.__new__() takes exactly one argument (the type to instantiate) ``` And now gets ``` $ python foo.py Traceback (most recent call last): File "foo.py", line 3, in <module> a = torch.cuda.Stream() File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__ return super(Stream, cls).__new__(cls, priority=priority, kwargs) File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn raise RuntimeError( RuntimeError: Tried to instantiate dummy base class Stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592 Approved by: https://github.com/soumith	2022-11-29 21:43:23 +00:00
David Berard	a029ec2c88	Move gpu slow tests to sm86 (#87880 ) NVFuser tests (which are slow tests) would be better to run on more modern GPU hardware. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87880 Approved by: https://github.com/malfet	2022-11-29 19:29:59 +00:00
Erjia Guan	991028cd9f	Deprecating DataPipes (#89794 ) Summary: per title Test Plan: `buck2 test buck2 test //caffe2/test:datapipe` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6473924589747074/ `buck2 test mode/opt //pytorch/data/test:tests` Differential Revision: D41563765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89794 Approved by: https://github.com/wenleix, https://github.com/NivekT	2022-11-29 19:21:53 +00:00
Edward Z. Yang	6c1fb3f21d	Don't unsafely clone autograd meta (#89720 ) Addresses this CR comment https://github.com/pytorch/pytorch/pull/88817/files#r1024618045 This appears to fix Dynamo+DDP+hf_BERT test but I don't know how to make a minimum reproducer. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89720 Approved by: https://github.com/soumith, https://github.com/bdhirsh, https://github.com/malfet	2022-11-29 18:59:34 +00:00
albanD	02e2eaa9c6	Fix CopySlices logic to ensure wrapped node runs properly. (#89812 ) This should remove the failures seen by https://github.com/pytorch/pytorch/pull/89720 in functionalization Locally verified that running the following on top of this PR does pass: `python benchmarks/dynamo/huggingface.py --accuracy --backend aot_eager --training --only MobileBertForMaskedLM` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89812 Approved by: https://github.com/soumith, https://github.com/voznesenskym, https://github.com/ezyang	2022-11-29 18:44:28 +00:00
kshitij12345	8314d403a6	[test_nn] split multihead_attention from test_nn (#89748 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89748 Approved by: https://github.com/albanD	2022-11-29 18:15:18 +00:00
andrewor14	fb47a66989	[Quant][docs] Use get_default_qconfig_mapping (#87299 ) Summary: The recommended way to use QConfigMapping is through `get_default_qconfig_mapping`. However, the docs still references usages that use `QConfigMapping().set_global(...)`. This doesn't actually work well in practice when the model has fixed qparams ops for example. This commit updates these usages. Reviewers: vkuzo Subscribers: vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/87299 Approved by: https://github.com/jerryzh168	2022-11-29 18:08:16 +00:00
andrewor14	2bce6d09ee	[Quant][fx][bc-breaking] Remove backend_config_utils.py (#89810 ) Summary: Previously under torch/ao/quantization we have backend_config/utils.py and fx/backend_config_utils.py, which was confusing. This commit deletes the latter and moves everything there to more suitable util files. BC-breaking note: The following public APIs under the `torch.ao.quantization.fx.backend_config_utils` namespace are removed in this commit. ``` get_quantize_handler_cls get_fusion_pattern_to_fuse_handler_cls get_native_quant_patterns get_pattern_to_quantize_handlers ``` Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/89810 Approved by: https://github.com/jerryzh168	2022-11-29 18:01:40 +00:00
PyTorch MergeBot	e1dbd9a288	Revert "[GHA] Decrease Windows test timeout to 120 minutes (#89694 )" This reverts commit faa032c5e58502de6ea461e531109d2acc22e56a. Reverted https://github.com/pytorch/pytorch/pull/89694 on behalf of https://github.com/clee2000 due to broke periodic b/c they take ~2.5 hrs, also broke mem leak check b/c its slow, should probably look into having this be a parameter	2022-11-29 17:55:43 +00:00
Andrew Gu	6e2da426f0	[FSDP] Relax post-backward assert (#89791 ) This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89791 Approved by: https://github.com/zhaojuanmao	2022-11-29 17:25:56 +00:00
PyTorch MergeBot	218d9c6e09	Revert "Move functorch/_src to torch/_functorch (#88756 )" This reverts commit 52bc5c1cfe098fd4b4b13902b4fea83b455b9773. Reverted https://github.com/pytorch/pytorch/pull/88756 on behalf of https://github.com/clee2000 due to broke imports in tests `52bc5c1cfe` https://github.com/pytorch/pytorch/actions/runs/3574742513/jobs/6010814968 probably a landrace	2022-11-29 17:17:11 +00:00
Kshiteej K	086b251f9a	[follow-up] Python Attr Serialization (#88913 ) Ref: https://github.com/pytorch/pytorch/pull/81616#issuecomment-1307595402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88913 Approved by: https://github.com/albanD	2022-11-29 16:46:20 +00:00
Brian Hirsh	2f9ec226e4	don't run input mutation analysis in dynamo (#89760 ) Right now we're running the analysis pass and then discarding the result. Instead, we should just stop running the analysis pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/89760 Approved by: https://github.com/soumith, https://github.com/ezyang	2022-11-29 16:40:06 +00:00
Ajay Hotchandani	3cef87f9fd	[aarch64] add SLEEF dependency for aten_cpu (#89475 ) Reviewed By: kimishpatel, dmm-fb Differential Revision: D41350031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89475 Approved by: https://github.com/kimishpatel, https://github.com/ezyang	2022-11-29 15:17:58 +00:00
andrewor14	c6ede0bdfc	[Quant][docs] Fix BackendConfig example in docstring/README (#89319 ) Summary: The example in the BackendConfig docstring and the README was not runnable. This fixes a typo (`bias_type` -> `bias_dtype`), removes the call to an internal helper function, and adds an additional BackendPatternConfig to make the example BackendConfig more realistic and useful. Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/89319 Approved by: https://github.com/jerryzh168	2022-11-29 15:11:40 +00:00
Richard Zou	52bc5c1cfe	Move functorch/_src to torch/_functorch (#88756 ) This will be the last disruptive functorch internals change. Why are we moving these files? - As a part of rationalizing functorch we are moving the code in functorch/_src to torch/_functorch - This is so that we can offer the functorch APIs as native PyTorch APIs (coming soon) and resolve some internal build issues. Why are we moving all of these files at once? - It's better to break developers all at once rather than many times Test Plan: - wait for tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/88756 Approved by: https://github.com/ezyang	2022-11-29 13:55:42 +00:00
Jiong Gong	620994cd7a	Guard the boundary of index computed in compute_source_index_and_lambda (#89252 ) Improve the fix in https://github.com/pytorch/pytorch/pull/89210 See discussion in https://github.com/pytorch/pytorch/issues/89212#issuecomment-1318911969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89252 Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta	2022-11-29 13:55:22 +00:00
Chen Lai	93772305d9	[PyTorch Edge] Set training for module only (#89488 ) Update previous recursive logic. Continue setting training attribute only if the slot is an object and a module. For the corresponding JIT module, they get the module list first and set module one by one. there is method to get all modules iteratively, instead of recursively. This change patch one fix to set training attribute for `model_f269583363.ptl`. Another patch is needed, because current lite interpreter doesn't have the correct type when loading object with setstate. Differential Revision: [D41466417](https://our.internmc.facebook.com/intern/diff/D41466417/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89488 Approved by: https://github.com/iseeyuan	2022-11-29 13:49:44 +00:00
Aleksandar Samardžić	a78467f3df	Refactoring to share vectorization code for int8/uint8. (#89650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89650 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10	2022-11-29 12:47:48 +00:00
Animesh Jain	8226a5d383	[minifier] Continue on assertion for accuracy minification (#89739 ) During accuracy minification, minifier can create graphs which can cause assertion failures. This PR catches such assertions and let minifier move on, instead of getting stuck in minifying this issue. It is possible that such graphs point to some real-although-unrelated issue. So, printing an assertion to flag and debug if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89739 Approved by: https://github.com/mlazos	2022-11-29 07:49:07 +00:00
Michael Lazos	40dd03eeaa	[dynamo] Don't copy the graph during checkpointing (copy_graphstate) (#89232 ) copy_graphstate is called a ton, this makes copy_graphstate a lot faster, helps with https://github.com/pytorch/torchdynamo/issues/1803 tag each graph node with a timestamp, when checkpointing store the timestamp, when restoring remove nodes older than the timestamp stored in the state. This essentially has the same behavior as the original impl, just doesn't copy the whole graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89232 Approved by: https://github.com/jansel	2022-11-29 07:19:02 +00:00
Yanli Zhao	91899a9ebd	add memory_tracker tool to help profiling memory usages (#88825 ) Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well. The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers. Will add following up PRs: 1. allow tracing more than 1 iteration 2. dump json data for visualization 3. add unit test for DDP training 4. add unit test for FSDP training 5. add unit test for activation checkpointing + DDP/FSDP training 6. add traces for activation memories and top operators that generate activation memories 7. print summaries for more breakdowns like model size, optimizer states, etc 8. add traces for temporary memories or memories consumed by cuda streams or nccl library if possible 9. connect the tool with OOM memory debugging 10. add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces 11. add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well ====================================================== Current test result for the memory_tracker_example.py on notebook: Top 20 ops that generates memory are: bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB layer1.0.conv1.forward.convolution.default_1: 24.5MB layer1.0.conv2.forward.convolution.default_2: 24.5MB layer1.1.conv1.forward.convolution.default_3: 24.5MB layer1.1.conv2.forward.convolution.default_4: 24.5MB layer1.2.conv1.forward.convolution.default_5: 24.5MB layer1.2.conv2.forward.convolution.default_6: 24.5MB maxpool.backward.threshold_backward.default_32: 23.5MB layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB <img width="1079" alt="Screen Shot 2022-11-10 at 10 03 06 AM" src="https://user-images.githubusercontent.com/48731194/201172577-ddfb769c-fb0f-4962-80df-92456b77903e.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88825 Approved by: https://github.com/awgu	2022-11-29 06:42:57 +00:00
Shen Li	7ec7a82082	Test FSDP with submodule non-reentrant checkpointing (#89781 ) With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](`e20ec44544/torch/distributed/fsdp/_runtime_utils.py (L487)`). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89781 Approved by: https://github.com/rohan-varma	2022-11-29 05:34:34 +00:00
Will Constable	705ad36cc5	Dynamo asserts FSDP wrapped modules use_orig_param (#89523 ) - This is a strict requirement given the way dynamo+FSDP is implemented, but isn't convenient to assert. - By plumbing use_orig_param field on all wrapped modules, we can do this assertion inside dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/89523 Approved by: https://github.com/awgu	2022-11-29 05:27:23 +00:00
Will Constable	7860fcc245	Enable DDPOptimizer by default in dynamo (#88523 ) Performance benchmarks on 6 popular models from 1-64 GPUs compiled with torchinductor show performance gains or parity with eager, and showed regressions without DDPOptimizer. *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed. (hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50) Correctness checks are implemented in CI (test_dynamo_distributed.py), via single-gpu benchmark scripts iterating over many models (benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py), and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88523 Approved by: https://github.com/davidberard98	2022-11-29 05:27:06 +00:00
Mark Saroufim	9048cf16fe	Move Dynamo docs back to core (#89769 ) With contributions from @svekars and @malfet Waiting for doc build job to complete Pull Request resolved: https://github.com/pytorch/pytorch/pull/89769 Approved by: https://github.com/soumith, https://github.com/malfet	2022-11-29 04:38:53 +00:00
Animesh Jain	2b522670d2	[dynamo] Minifier fixes for reproducing segfault (#89712 ) Helped with minifying the segfault in https://github.com/pytorch/torchdynamo/issues/1928 Tests not really needed. It improves quality of life as segfault can fail anywhere (when CUDA_LAUNCH_BLOCKING is off) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89712 Approved by: https://github.com/mlazos, https://github.com/ngimel	2022-11-29 04:29:42 +00:00
Animesh Jain	c1950620c5	[decomp] Fix native_batch_norm_backward dtype of dweight and dbias (#89740 ) Discovered while debugging an accuracy issue for Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89740 Approved by: https://github.com/soumith, https://github.com/ngimel	2022-11-29 03:15:20 +00:00
Driss Guessous	4d7ec30220	Call _sdp_attention in nn.functional.mha (#89470 ) # Summary Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470 Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb	2022-11-29 03:02:10 +00:00
Brian Hirsh	e20ec44544	fixes for inductor <> batch norm (#89603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89603 Approved by: https://github.com/albanD	2022-11-29 02:16:52 +00:00
Luis Montero	740860d414	Add type hint to torch.norm and Tensor.norm (#89728 ) Fixes #89727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89728 Approved by: https://github.com/kit1980	2022-11-29 02:09:51 +00:00
David Berard	908daa8ae5	[nvfuser] avoid out of bounds error (#89584 ) Summary: update OOB check (https://github.com/csarofeen/pytorch/pull/2218) and skip tests that OOM on internal machines. Test Plan: ``` buck2 test mode/dev-nosan //caffe2/torch/csrc/jit/codegen/cuda/test:nvfuser ``` Differential Revision: D41502369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89584 Approved by: https://github.com/jjsjann123	2022-11-29 02:03:59 +00:00
Will Constable	77df2ca9b6	Special-case fsdp wrapped modules to be Unspecialized (#89330 ) ### Summary Making dynamo treat the nn.Modules inside FSDP wrappers as 'Unspecialized' results in dynamo-produced graphs where nn.module parameters are inputs to the graph rather than attributes of the outer graphmodule. This helps in FSDP since it forces dynamo to pick the latest copy of the parameters off the user's nn.Module (which FSDP mutates every pre_forward), solving the ordering issue in backward. ### Details Imagine this toy model ``` class MyModule(torch.nn.Module): def __init__(self, a, b): super(MyModule, self).__init__() self.net = nn.Sequential( nn.Linear(a, b), nn.ReLU(), ) def forward(self, x): return self.net(x) class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net = nn.Sequential( *[MyModule(10, 10000)] + [MyModule(10000, 1000)] + [MyModule(1000, 5)] ) def forward(self, x): return self.net(x) ``` Where FSDP is recursively wrapped around each `MyModule`, then dynamo-compiled, with dynamo already configured to skip/break in FSDP code. You'd expect to get 3 compiled AOT functions, corresponding to the contents of `MyModule`, and then see FSDP's communication ops happen inbetween them (eagerly). This almost happens (everything works out fine in forward), but in backward there is an ordering issue. FSDP creates a flat buffer for all the parameters that are bucketed together, and then creates views into this buffer to replace the original parameters. On each iteration of forward, it creates a new view after 'filling' the flatbuffer with data from an all-gather operation, to 'unshard' the parameters from remote devices. Dynamo traces the first such view and stores it in a compiled graphmodule. During tracing, we see (1) view created for first MyModule, (2) compile first MyModule, (3) ... for the rest of layers Then during runtime, we see (A) view created for first MyModule (and orphaned), (B) execute first compiled MyModule, using old view, ... This is a problem, because we want backward hooks to run right after each compiled-backward, but autograd executes those hooks in an order mirroring their execution order during forward. Since we are forever using the views created during steps (1, 3, .. N), which all happen before the steps (A, B, ...), this means that all the hooks will happen after all the compiled backwards. An illustration of the problem - a torchviz graph showing the 2 possible orderings of autograd, and a profile showing the view-backwards ops happening after all the compiled backwards, and before all the backward hooks. <img width="2069" alt="image" src="https://user-images.githubusercontent.com/4984825/202828002-32dbbd15-8fc3-4281-93e9-227ab5e32683.png"> <img width="2069" alt="image" src="https://user-images.githubusercontent.com/4984825/202828632-33e40729-9a7f-4e68-9ce1-571e3a8dd2dd.png"> A solution is to make dynamo not specialize on these nn modules. It is worth pointing out that this nn.module specialization is de-facto failing, as we are modifying .parameters and this bypasses dynamo's __setattr__ monkeypatch, which should have automatically kicked us out to Unspecialized and forced a recompile. After unspecializing, the new views (created during steps A, C, ...) are actually _used_ at runtime by the module, making their creation order interleaved, making autograd execute their backwards interleaved. The new torchviz graph (this time with names added for the view tensors): <img width="2043" alt="image" src="https://user-images.githubusercontent.com/4984825/202828480-d30005ba-0d20-45d8-b647-30b7ff5e91d3.png"> And a new profile showing the interleaving of compiled backwards and hooks, allowing overlapping of reduce-scatter. <img width="2293" alt="image" src="https://user-images.githubusercontent.com/4984825/202828533-bb20a041-19b8-499c-b3cf-02808933df47.png"> @jansel @davidberard98 @aazzolini @mrshenli @awgu @ezyang @soumith @voznesenskym @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89330 Approved by: https://github.com/davidberard98	2022-11-29 01:24:03 +00:00
Jiong Gong	c75434ed4f	[Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674 ) This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674 Approved by: https://github.com/jansel	2022-11-29 00:58:46 +00:00
cyy	4b11119cc3	[functorch] fix possible overflow (#83389 ) Fix some errors detected by static analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83389 Approved by: https://github.com/zou3519	2022-11-29 00:55:34 +00:00
William Wen	63843401f5	Fix archive issue impacting summary stat diff (#89789 ) Summary stat diff was reporting diff between previous day and the day before that, instead of today and previous day. Issue was because summary stats were not uploaded to the archive before the summary stat differ was run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89789 Approved by: https://github.com/anijain2305	2022-11-29 00:55:06 +00:00
Andrew Gu	943acd4d27	[FSDP] Fix `nn.Parameter` usage for 2D and `use_orig_params=True` (#89782 ) This ensures that all elements of `FlatParameter._params` and `FlatParameter._shared_params` are `nn.Parameter`s (as expected). This was violated by the local tensor of a `DTensor` when using 2D parallelism. To fix the breakage, we simply wrap with `nn.Parameter` if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89782 Approved by: https://github.com/fduwjj	2022-11-28 23:56:38 +00:00
Iris	23ee6757fc	[Checkpoint][2D][4/N] Add nested_dict for distributed checkpoint to core distributed (#89537 ) This PR moves nested_dict and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This provides the functionality to flatten a nested dict and unflatten a flattened dict. Docstring will be added in the following PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89537 Approved by: https://github.com/fduwjj, https://github.com/wanchaol	2022-11-28 23:49:17 +00:00
mantaionut	a378ba2123	Re-enabled 3 reductions tests on Windows (#89567 ) With PR #88089 the test_ref_small_input_masked_prod with int8,int16 and int32 tests no longer overflows on Windows so they can be re-enable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89567 Approved by: https://github.com/cpuhrsch	2022-11-28 23:41:54 +00:00
Angela Yi	f3b1315eee	Add bits tensor types (#88594 ) TODO (in later PRs) - [ ] the other bits8, 4x2, 2x4, 1x8 - [ ] bits printer function Pull Request resolved: https://github.com/pytorch/pytorch/pull/88594 Approved by: https://github.com/ezyang	2022-11-28 23:39:57 +00:00
Iris	22e7514a15	[Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed (#89501 ) This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT. Docstring, individual and integration test will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89501 Approved by: https://github.com/wanchaol	2022-11-28 23:21:38 +00:00
Aidyn-A	0057be3361	[CUDA graphs] Add warning if captured graph is empty (#88754 ) Fixes #87894 This PR adds a warning if captured graph is empty (consists of zero nodes). The example snippet where would it be useful: ```python import torch x = torch.randn(10) z = torch.zeros(10) g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): z = x * x # Warn user ``` and in #87894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88754 Approved by: https://github.com/ezyang	2022-11-28 23:20:19 +00:00
WeberXie	c18da597e0	[skip ci] documentation update for the kwargs defaults section of fun… (#89719 ) In this doc, it's better to multiply the scale instead of the constant 4.0 to illustrate the default of kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89719 Approved by: https://github.com/kit1980, https://github.com/malfet	2022-11-28 21:49:26 +00:00
Jiewen Tan	13d2af2a9b	[LTC] Metrics can be reset too (#89606 ) Summary: This change allow MetricsArena to ResetMetrics too. And then rename Reset to ResetCounters given that's what it does for real. This matches pytorch/xla#4109, and is paired with pytorch/xla#4245. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89606 Approved by: https://github.com/JackCaoG	2022-11-28 21:44:12 +00:00
Manuel Candales	5abe454d6c	[Vulkan][TCC] Fix conv2d pack biases (#89568 ) Summary: Fixed bug on pack_biases, where the weight scale and zero point were being assigned to the bias. Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: SS-JIA Differential Revision: D41350358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89568 Approved by: https://github.com/salilsdesai	2022-11-28 21:36:01 +00:00
Peter Bell	2e0cd7c8bd	Add meta implementation for _efficientzerotensor (#88936 ) `_efficientzerotensor` is used in several backwards formulas, so its lack of meta implementation makes those functions untracable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88936 Approved by: https://github.com/anjali411	2022-11-28 21:24:12 +00:00
Alexander Grund	69a8c92d53	Fix comparison of batched_prop vs unbatched_prob in test_distributions (#87977 ) When using SciPy >= 1.7 wishart_log_prob runs into singular samples which means there are `inf`s in `batched_prop` and `unbatched_prop`. The difference of 2 `inf`s is `nan` which will fail the `equal(0` check. However passing the tensors directly to `assertEqual` is not only supported but the correct way as it will handle `inf` values etc. Change the same code in 2 more tests: - test_multivariate_normal_log_prob - test_lowrank_multivariate_normal_log_prob Pull Request resolved: https://github.com/pytorch/pytorch/pull/87977 Approved by: https://github.com/soulitzer	2022-11-28 21:15:21 +00:00
PyTorch MergeBot	47cca5e444	Revert "Move Dynamo docs back to core (#89769 )" This reverts commit be2816db181cc4d9a1822feb1202dbd2e8c87918. Reverted https://github.com/pytorch/pytorch/pull/89769 on behalf of https://github.com/clee2000 due to broke lint	2022-11-28 21:04:33 +00:00
eqy	8321066031	Tweak formatting of note on macros (#89598 ) For readability when viewing the rendered file e.g., from the browser. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89598 Approved by: https://github.com/kit1980	2022-11-28 20:42:30 +00:00
Mark Saroufim	be2816db18	Move Dynamo docs back to core (#89769 ) With contributions from @svekars and @malfet Waiting for doc build job to complete Pull Request resolved: https://github.com/pytorch/pytorch/pull/89769 Approved by: https://github.com/soumith	2022-11-28 20:32:05 +00:00
Bin Bao	465ee7bc09	[inductor] skip dm_nfnet_f0 in TIMM model test (#89768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89768 Approved by: https://github.com/clee2000	2022-11-28 20:08:41 +00:00
Animesh Jain	cdf4087597	[benchmarks] Disabling gradscaler (#89741 ) Disabling Gradscaler because 1) Benchmark setup runs 2 iterations of fwd-bwd. So, not useful. 2) Current setup shares grad_scaler for eager and dynamo model, which is bad as Gradscaler has state and can adjust the scaling factor between eager and dynamo run, making accuracy check harder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89741 Approved by: https://github.com/ngimel	2022-11-28 20:08:37 +00:00
Edward Z. Yang	e8643ded6d	Revert "Don't allow recomputing a node that must be materialized in the backwards pass (#89171 )" (#89770 ) This reverts commit e36d68af8885f27d8c0b4727ab078bf53e55e7a0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89770 Approved by: https://github.com/anijain2305	2022-11-28 20:02:07 +00:00
Sahan Paliskara	2a2c07ae37	[multipy] Address GetPythonFramesFunction() and multipy incompatibility. (#267 ) (#89315 ) Summary: https://github.com/pytorch/pytorch/pull/89122 introduces internal compatibility issues with torchdeploy. However, GetPythonFramesFunction() never worked with torchdeploy, so this PR simply reverts to the original behavior of skipping the function if torchdeploy is used as a forward fix. Test Plan: Running failed tests in T128123281 ``` buck2 test @//mode/opt //multipy/runtime:test_deploy -- --exact 'multipy/runtime:test_deploy - TorchpyTest.TaggingRace' --run-disabled buck2 test mode/dev //multipy/runtime/testdev:test_deploy_from_python -- --exact 'multipy/runtime/testdev:test_deploy_from_python - multipy.runtime.testdev.test_deploy_from_python.TestDeployFromPython: test_deploy_from_python' ``` Differential Revision: D41414263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89315 Approved by: https://github.com/kurman	2022-11-28 19:36:45 +00:00
Edward Z. Yang	95563b3eda	Reland "Add single process version of dynamo distributed hf_Bert tests (#89721 )" (#89756 ) This reverts commit 0d9a615af4007014586c946cb8ffcc911d4100f6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89756 Approved by: https://github.com/anjali411, https://github.com/malfet	2022-11-28 19:15:03 +00:00
PyTorch MergeBot	6ef702490d	Revert "Support set_rng_state with fake tensor (#89642 )" This reverts commit 2f8769d680f068cb97a829d7582fac1cdea21753. Reverted https://github.com/pytorch/pytorch/pull/89642 on behalf of https://github.com/ezyang due to elias is right this is probably wrong	2022-11-28 19:13:33 +00:00
Andrey Talman	ed41a7fb68	Update minor release acceptance criteria (#89767 ) Update minor release acceptance criteria Pull Request resolved: https://github.com/pytorch/pytorch/pull/89767 Approved by: https://github.com/albanD, https://github.com/weiwangmeta	2022-11-28 18:49:32 +00:00
Edward Z. Yang	ed9cd47e31	Add AOTAutograd and partitioner to ciflow/inductor (#89772 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89772 Approved by: https://github.com/albanD	2022-11-28 18:39:42 +00:00
Edward Z. Yang	cf91e3641a	Use isinstance test rather than exact type test for wrap to fake (#89671 ) I'm not sure why we did an exact test originally. Let's find out! Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89671 Approved by: https://github.com/voznesenskym	2022-11-28 18:39:18 +00:00
Edward Z. Yang	b87c45d5a7	Make aot_module_simplified accept fake tensors (#89670 ) Strategy taken from voz's #89392 but my implementation strategy is a bit different. If a fake tensor is provided, we use its FakeTensorMode (and more importantly, its ShapeEnv--this is what is tested in the new unit test). Only one tensor needs to be fake; if nothing is fake we just make a fresh mode as before. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89670 Approved by: https://github.com/voznesenskym	2022-11-28 18:39:18 +00:00
Edward Z. Yang	abf91562bd	Change aot_module_simplified to take take arguments directly (#89669 ) This is extracted from voz's #89392 Previously, the implementation did some half-assed caching where it returned a callable, that when invoked for the first time, actually performed the compilation. Delaying the compilation like this... seems totally unnecessary? To make matters worse, this has cost (we have to check if we hit the cache) and unsound (because the compiled function may not be valid for other arguments.) So instead, we ask user to provide arguments, and compile everything immediately. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89669 Approved by: https://github.com/voznesenskym, https://github.com/Chillee	2022-11-28 18:39:15 +00:00
Edward Z. Yang	b589e726d9	Refactor how AOTAutograd backends are defined (#89736 ) There was a lot of strangeness in how AOTAutograd backends were previously defined. This refactor replaces the strangeness with something simple and straightforward. The improvements: - There is no longer a footgun aot_autograd "backend" which doesn't actually work. No more mistyping `torch._dynamo.optimize("aot_autograd")` when you meant "aot_eager" - Deleted aot_print because it's annoying and anyway there's no uses of it - Instead of having BOTH the backend Subgraph and AotAutogradStrategy, there is now only an aot_autograd function which takes the kwargs to configure AOTAutograd, and then gives you a compiler function that does AOTAutograd given those kwargs. Easy. - The primary downside is that we are now eagerly populating all of the kwargs, and that can get us into import cycle shenanigans. Some cycles I resolved directly (e.g., we now no longer manually disable the forward function before passing it to aot_autograd; aot_autograd it does it for us), but for getting inductor decompositions I had to make it take a lambda so I could lazily populate the decomps later. New code is 130 lines shorter! Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89736 Approved by: https://github.com/anjali411, https://github.com/albanD	2022-11-28 18:39:12 +00:00
Hubert Lu	cf4969d9d6	[ROCm] Replace layer_norm_grad_input_kernel with cuComputeGradInput for ROCm (#87726 ) We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when fs (=config_m in our benchmark script) is large and bs (=config_n in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of https://github.com/pytorch/pytorch/pull/68238#issue-1051621716 on AMD GPUs. This PR is to replace layer_norm_grad_input_kernel with the Apex cuComputeGradInput kernel with some ROCm-specific parameter tuning when fs (=config_m) is larger than or equal to `32768` on AMD GPUs. Some of the code changes in LayerNormBackwardKernelImplInternal are from another PR: https://github.com/pytorch/pytorch/pull/87635 We used the same benchmark script in the previous PR and tested the optimized kernel with various input shapes on AMD MI100 GPU. At [the previous PR](https://github.com/pytorch/pytorch/pull/87635): <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {color:windowtext;} --> </head> <body link="#0563C1" vlink="#954F72"> M \| N \| fwd (half) \| fwdbwd (half) \| fwd (float) \| fwdbwd (float) -- \| -- \| -- \| -- \| -- \| -- 50432 \| 384 \| 0.38589 \| 0.92603 \| 0.38367 \| 1.15148 50176 \| 384 \| 0.38719 \| 0.91579 \| 0.37815 \| 1.13761 200704 \| 192 \| 0.99787 \| 2.39954 \| 0.98996 \| 2.54284 802816 \| 64 \| 3.66525 \| 7.96952 \| 3.61293 \| 7.69946 200 \| 256 \| 0.06578 \| 0.34613 \| 0.06966 \| 0.35449 1000 \| 256 \| 0.07837 \| 0.37631 \| 0.07725 \| 0.37758 6000 \| 256 \| 0.09318 \| 0.3788 \| 0.09202 \| 0.37989 6272 \| 256 \| 0.08694 \| 0.36267 \| 0.08703 \| 0.3615 200 \| 512 \| 0.06975 \| 0.34506 \| 0.06973 \| 0.34208 1000 \| 512 \| 0.07012 \| 0.36363 \| 0.07307 \| 0.36741 6000 \| 512 \| 0.09725 \| 0.36251 \| 0.09908 \| 0.37078 6272 \| 512 \| 0.09899 \| 0.36519 \| 0.10068 \| 0.37514 200 \| 1024 \| 0.07188 \| 0.33896 \| 0.0712 \| 0.34683 1000 \| 1024 \| 0.07357 \| 0.3625 \| 0.0734 \| 0.3598 6000 \| 1024 \| 0.12642 \| 0.38949 \| 0.12973 \| 0.5035 6272 \| 1024 \| 0.12901 \| 0.40759 \| 0.13609 \| 0.51871 200 \| 1536 \| 0.06998 \| 0.34782 \| 0.07419 \| 0.3514 1000 \| 1536 \| 0.07987 \| 0.37915 \| 0.07888 \| 0.37264 6000 \| 1536 \| 0.15401 \| 0.47524 \| 0.15416 \| 0.68609 6272 \| 1536 \| 0.15286 \| 0.48843 \| 0.17681 \| 0.72997 200 \| 2048 \| 0.07054 \| 0.34791 \| 0.07289 \| 0.35138 1000 \| 2048 \| 0.07767 \| 0.37954 \| 0.08554 \| 0.37464 6000 \| 2048 \| 0.18744 \| 0.5811 \| 0.25004 \| 0.93338 6272 \| 2048 \| 0.20037 \| 0.63398 \| 0.26918 \| 0.97018 200 \| 3072 \| 0.07687 \| 0.36739 \| 0.08917 \| 0.37845 1000 \| 3072 \| 0.09323 \| 0.38901 \| 0.09739 \| 0.39823 6000 \| 3072 \| 0.24314 \| 0.89029 \| 0.38093 \| 1.30719 6272 \| 3072 \| 0.26079 \| 0.92023 \| 0.38352 \| 1.51012 128 \| 2097152 \| 6.17775 \| 23.876 \| 10.27952 \| 30.10848 256 \| 1048576 \| 4.51855 \| 19.47637 \| 10.07609 \| 29.42678 512 \| 524288 \| 4.13615 \| 18.80888 \| 10.07853 \| 32.29804 1024 \| 262144 \| 4.47397 \| 17.88388 \| 9.50367 \| 31.15699 2048 \| 131072 \| 4.2458 \| 16.70852 \| 9.17979 \| 30.51708 4096 \| 65536 \| 4.24412 \| 16.43098 \| 8.97651 \| 30.1617 8192 \| 32768 \| 4.24556 \| 16.09038 \| 8.77001 \| 30.3643 16384 \| 16384 \| 4.14642 \| 15.80355 \| 8.82402 \| 30.35291 32768 \| 8192 \| 4.12599 \| 15.68897 \| 8.82605 \| 30.43423 </body> </html> ---- At this PR: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {color:windowtext;} .xl66 {background:yellow; mso-pattern:black none;} --> </head> <body link="#0563C1" vlink="#954F72"> M \| N \| fwd (half) \| fwdbwd (half) \| fwd (float) \| fwdbwd (float) -- \| -- \| -- \| -- \| -- \| -- 50432 \| 384 \| 0.38667 \| 0.84133 \| 0.37916 \| 1.01222 50176 \| 384 \| 0.3814 \| 0.87266 \| 0.37858 \| 1.04399 200704 \| 192 \| 0.99902 \| 2.14386 \| 0.98973 \| 2.33265 802816 \| 64 \| 3.66578 \| 6.85376 \| 3.6092 \| 7.00331 200 \| 256 \| 0.06607 \| 0.34176 \| 0.07009 \| 0.34548 1000 \| 256 \| 0.06947 \| 0.36461 \| 0.07902 \| 0.37851 6000 \| 256 \| 0.09319 \| 0.37432 \| 0.09342 \| 0.36927 6272 \| 256 \| 0.09544 \| 0.37565 \| 0.09476 \| 0.37377 200 \| 512 \| 0.07935 \| 0.364 \| 0.07891 \| 0.36894 1000 \| 512 \| 0.07676 \| 0.37552 \| 0.07957 \| 0.37564 6000 \| 512 \| 0.10472 \| 0.37504 \| 0.1051 \| 0.38782 6272 \| 512 \| 0.1069 \| 0.36662 \| 0.10062 \| 0.38506 200 \| 1024 \| 0.07793 \| 0.36561 \| 0.08023 \| 0.35019 1000 \| 1024 \| 0.07426 \| 0.36729 \| 0.07345 \| 0.35851 6000 \| 1024 \| 0.12729 \| 0.39219 \| 0.12974 \| 0.51526 6272 \| 1024 \| 0.13622 \| 0.41627 \| 0.14252 \| 0.52926 200 \| 1536 \| 0.07615 \| 0.36621 \| 0.0797 \| 0.3695 1000 \| 1536 \| 0.08327 \| 0.38174 \| 0.07938 \| 0.37573 6000 \| 1536 \| 0.14894 \| 0.46197 \| 0.15268 \| 0.63814 6272 \| 1536 \| 0.15368 \| 0.48818 \| 0.16309 \| 0.71441 200 \| 2048 \| 0.06935 \| 0.36691 \| 0.07258 \| 0.35548 1000 \| 2048 \| 0.07738 \| 0.36388 \| 0.08036 \| 0.36452 6000 \| 2048 \| 0.18757 \| 0.58573 \| 0.23701 \| 0.92915 6272 \| 2048 \| 0.1938 \| 0.61628 \| 0.26475 \| 0.96896 200 \| 3072 \| 0.07884 \| 0.3673 \| 0.07724 \| 0.37869 1000 \| 3072 \| 0.09342 \| 0.38193 \| 0.09822 \| 0.38646 6000 \| 3072 \| 0.24452 \| 0.86776 \| 0.38251 \| 1.3036 6272 \| 3072 \| 0.25971 \| 0.91053 \| 0.38744 \| 1.39039 128 \| 2097152 \| 6.06752 \| 23.26379 \| 9.87466 \| 29.81851 256 \| 1048576 \| 4.50336 \| 19.4614 \| 10.11239 \| 29.25554 512 \| 524288 \| 4.12649 \| 18.72831 \| 10.054 \| 32.26784 1024 \| 262144 \| 4.40855 \| 17.77993 \| 9.38856 \| 31.18679 2048 \| 131072 \| 4.18716 \| 16.74615 \| 9.14487 \| 30.24603 4096 \| 65536 \| 4.17374 \| 16.34444 \| 8.94894 \| 30.0326 8192 \| 32768 \| 4.19095 \| 16.05751 \| 8.70358 \| 30.14669 16384 \| 16384 \| 4.15404 \| 15.83771 \| 8.80042 \| 30.5022 32768 \| 8192 \| 4.12515 \| 15.5657 \| 8.66138 \| 28.87386 </body> </html> --- Performance Improvement (%) <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl65 {color:windowtext;} .xl66 {mso-number-format:"0\.000";} --> </head> <body link="#0563C1" vlink="#954F72"> M \| N \| fwdbwd, torch.float16 \| fwdbwd, torch.float32 -- \| -- \| -- \| -- 50432 \| 384 \| 9.147 \| 12.094 50176 \| 384 \| 4.710 \| 8.230 200704 \| 192 \| 10.655 \| 8.266 802816 \| 64 \| 14.000 \| 9.042 200 \| 256 \| 1.263 \| 2.542 1000 \| 256 \| 3.109 \| -0.246 6000 \| 256 \| 1.183 \| 2.796 6272 \| 256 \| -3.579 \| -3.394 200 \| 512 \| -5.489 \| -7.852 1000 \| 512 \| -3.270 \| -2.240 6000 \| 512 \| -3.456 \| -4.596 6272 \| 512 \| -0.392 \| -2.644 200 \| 1024 \| -7.862 \| -0.969 1000 \| 1024 \| -1.321 \| 0.359 6000 \| 1024 \| -0.693 \| -2.336 6272 \| 1024 \| -2.130 \| -2.034 200 \| 1536 \| -5.287 \| -5.151 1000 \| 1536 \| -0.683 \| -0.829 6000 \| 1536 \| 2.792 \| 6.989 6272 \| 1536 \| 0.051 \| 2.132 200 \| 2048 \| -5.461 \| -1.167 1000 \| 2048 \| 4.126 \| 2.701 6000 \| 2048 \| -0.797 \| 0.453 6272 \| 2048 \| 2.792 \| 0.126 200 \| 3072 \| 0.024 \| -0.063 1000 \| 3072 \| 1.820 \| 2.956 6000 \| 3072 \| 2.531 \| 0.275 6272 \| 3072 \| 1.054 \| 7.929 128 \| 2097152 \| 2.564 \| 0.963 256 \| 1048576 \| 0.077 \| 0.582 512 \| 524288 \| 0.428 \| 0.094 1024 \| 262144 \| 0.581 \| -0.096 2048 \| 131072 \| -0.225 \| 0.888 4096 \| 65536 \| 0.527 \| 0.428 8192 \| 32768 \| 0.204 \| 0.717 16384 \| 16384 \| -0.216 \| -0.492 32768 \| 8192 \| 0.786 \| 5.127 </body> </html> CC: @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/87726 Approved by: https://github.com/ngimel	2022-11-28 18:35:27 +00:00
albanD	098cbe23c3	Update masked.rst (#89758 ) Fix https://github.com/pytorch/pytorch/issues/89734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89758 Approved by: https://github.com/anjali411, https://github.com/malfet, https://github.com/cpuhrsch	2022-11-28 17:55:43 +00:00
Radek Bartoň	faa032c5e5	[GHA] Decrease Windows test timeout to 120 minutes (#89694 ) This PR decreases the Windows tests pipelines timeout to 120 mins per discusison as requested at https://github.com/pytorch/pytorch/issues/73489#issuecomment-1322539593 Closes #73489. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89694 Approved by: https://github.com/kit1980	2022-11-28 17:24:53 +00:00
Andrew Gu	a37072170d	[FSDP()] Require args as kwargs for `fully_shard()` (#89573 ) I am not aware of any users of `FullyShardedDataParallel` that pass arguments after `process_group` positionally. I.e., I believe users pass arguments as keyword arguments. This PR formalizes this for `fully_shard()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89573 Approved by: https://github.com/mrshenli	2022-11-28 16:56:32 +00:00
Andrew Gu	090fc62b24	[FSDP()] Register root pre-forward hook (#89572 ) - This PR registers the FSDP root pre-forward hook as a module forward pre-hook following the recently added support for kwargs for those hooks. - This PR also passes `prepend=True` for the normal (not root) pre-forward hook. This is not strictly required for this PR, but I believe it is needed for composability with activation checkpointing. (We want to run FSDP logic on the outside and AC logic on the inside, just like how we recommend `FSDP(AC(module))` for the wrapper versions.) Fun fact: I originally chose the `[FSDP()]` prefix in the PR titles when we still referred to composable FSDP as functional-like FSDP, in which case `FSDP()` approximated "functional FSDP". I am preserving this usage to make searching for PRs relating to composable FSDP easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89572 Approved by: https://github.com/mrshenli	2022-11-28 16:56:32 +00:00
Andrey Talman	8721448544	Add statement about minor releases, in the release.md document (#89698 ) * Add statement about minor releases * Update RELEASE.md	2022-11-28 10:36:40 -05:00
PratsBhatt	6ba6b64a79	Ci andriod cache conda (#89554 ) Fixes - T137631662 Caching conda dependencies for android build workflows. Conda dependencies have been gathered from the following workflow 1. https://github.com/pytorch/pytorch/blob/master/.github/workflows/_run_android_tests.yml The pull request updates the action from conda-incubator/setup-miniconda@v2 to pytorch/test-infra/.github/actions/setup-miniconda@main as it supports caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89554 Approved by: https://github.com/huydhn	2022-11-28 15:02:30 +00:00
Edward Z. Yang	2661ff10a9	Include test/distributed/test_dynamo_distributed.py for ciflow/inductor (#89755 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89755 Approved by: https://github.com/anjali411	2022-11-28 15:02:09 +00:00
PyTorch MergeBot	0d9a615af4	Revert "Add single process version of dynamo distributed hf_Bert tests (#89721 )" This reverts commit 1a2dd6b15e0089a9e45ba4feb90c2d0dfac19238. Reverted https://github.com/pytorch/pytorch/pull/89721 on behalf of https://github.com/ezyang due to this broke inductor_distributed job	2022-11-28 14:56:54 +00:00
Edward Z. Yang	2f8769d680	Support set_rng_state with fake tensor (#89642 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89642 Approved by: https://github.com/anjali411	2022-11-28 14:49:30 +00:00
Edward Z. Yang	856e2fa59c	Guard traceable_tensor_subclasses patching with finally (#89689 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89689 Approved by: https://github.com/albanD, https://github.com/anjali411	2022-11-28 14:48:12 +00:00
Edward Z. Yang	49eb43fc45	Don't modify log level in dynamo distributed test (#89655 ) Let the developer decide! Taken from voz's https://github.com/pytorch/pytorch/pull/89392 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89655 Approved by: https://github.com/albanD	2022-11-28 14:47:52 +00:00
Jean Schmidt	d089fbdc33	supress Werror introduced by lack of override by #86786 on `bool initialized()` (#89687 )	2022-11-28 15:16:15 +01:00
Edward Z. Yang	f45fe7de33	Add mypy checking for a few files in torch/_dynamo (#89731 ) It's kind of intractable to enable mypy everywhere at the moment, because there are a lot of errors, and also mypy is really slow for some reason. I just want enough types to explain the public types for user compiler calls, going through typing the _C.dynamo bindings along the way. This is a first step for this. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89731 Approved by: https://github.com/suo	2022-11-28 13:14:06 +00:00
PyTorch MergeBot	55e8b5c126	[xla hash update] update the pinned xla hash (#89405 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89405 Approved by: https://github.com/pytorchbot	2022-11-28 10:27:24 +00:00
Michael Voznesensky	b5616cd5f4	Add simple assert to detect fake tensors on modules (#89723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89723 Approved by: https://github.com/ezyang	2022-11-28 08:57:33 +00:00
Edward Z. Yang	db1f1144f1	Beef up AOTAutograd logging with aot_id and input descriptions (#89710 ) A few things in this PR, that I found useful while debugging some recent issues: - We now allocate an aot_id to each aot_function/aot_module invocation, and print it whenever we report error messages and graph output logging. Check the comment for why this sort of thing is useful, and also why it's different from nth_graph. This number is now incorporated into aot_graph_name - I noticed that nth_graph only gets incremented when backwards is compiled. Because backwards is compiled lazily, this means that multiple forward graphs would have gotten the same ID! I change nth_graph to always increment to avoid confusion here. - I added a simple describe_input function, which makes use of num_params_buffers to tell the user if the input index they're looking at is a param/buffer or an input. With the help of https://github.com/pytorch/pytorch/pull/89709 we could give even more detailed information about inputs (we could also easily give detailed information about parameters if we stored a mapping of index to parameter name, but I didn't need this when debugging so I'll let someone else add it if they need it.) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89710 Approved by: https://github.com/bdhirsh	2022-11-28 04:52:05 +00:00
Edward Z. Yang	5f8848f329	Don't suppress log messages for dynamo CI config (#89653 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89653 Approved by: https://github.com/albanD, https://github.com/kit1980	2022-11-28 03:39:40 +00:00
Edward Z. Yang	1a2dd6b15e	Add single process version of dynamo distributed hf_Bert tests (#89721 ) It's a lot easier to debug problems in the Dynamo optimization pass if you aren't actually triggering a multiprocessing run. Keep these tests around. I think the other tests can probably get this treatment too, leaving this to future work. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89721 Approved by: https://github.com/voznesenskym	2022-11-28 03:16:47 +00:00
Edward Z. Yang	0e7c100c9b	Add debug asserts to AOTAutograd for input consistency with compilation (#89702 ) Fixes https://github.com/pytorch/torchdynamo/issues/1927 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89702 Approved by: https://github.com/bdhirsh	2022-11-28 00:36:58 +00:00
Edward Z. Yang	1f95f24d30	Factor input deduplication into a separate function (#89701 ) It turns out that instead of having a giant blobby aot_dispatch_autograd function, we can factor it into a series of wrapper functions, each of which successively guarantees more invariants on the inner compilation function until the final inner function is quite trivial. How exactly you have to wrap the input user functions and the output compiled functions can be expressed concisely in Haskell, so I've included the Haskell formulation in code comments. This PR shows how to do this for input deduplication. Dealing with the rest of the view handling is left to future work. This PR should also be a slight performance improvement as deduplicating is skipped entirely when there are no duplicate inputs. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89701 Approved by: https://github.com/bdhirsh	2022-11-28 00:36:58 +00:00
Edward Z. Yang	dcefc8f90f	Implement guard_source on RandomValueSource (#89711 ) I audited the pattern matches on the enum and it didn't look like this one should apply there. Sorry, no test, I know this matters on symbolic-shapes branch but I haven't had time to extract out a minimal reproducer. Take my word for it. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89711 Approved by: https://github.com/jansel	2022-11-28 00:32:48 +00:00
Edward Z. Yang	1da633f98a	Access named parameters/buffers/etc via getattr rather than index (#89625 ) I'm not sure why this never caused problems before. The error manifests as `TypeError: 'MyModule' object is not subscriptable` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89625 Approved by: https://github.com/albanD	2022-11-28 00:19:48 +00:00
Horace He	e36d68af88	Don't allow recomputing a node that must be materialized in the backwards pass (#89171 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89171 Approved by: https://github.com/ngimel	2022-11-27 19:09:24 +00:00
Taylor Robie	b709078dc6	[Profiler] Memory profiler part 11: Mark tensors created in the backward pass which don't correspond to parameters. (#88926 ) There are various Tensors created in the backward pass which do not correspond to parameters. We don't want to mark these as gradients, but we do still want to convey as much information as possible. Thus, this PR introduces an AUTOGRAD_DETAIL category. (Which can be grouped with GRADIENT in visualization if one wishes to take a coarse grained view of the world.) Differential Revision: [D40868661](https://our.internmc.facebook.com/intern/diff/D40868661/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88926 Approved by: https://github.com/chaekit	2022-11-27 12:20:30 +00:00
Taylor Robie	143d2881a8	[Profiler] Memory profiler part 10: Mark optimizer state (#88925 ) This is also a fairly simple pass, since we're simply collecting values from the python tracer. Differential Revision: [D40868664](https://our.internmc.facebook.com/intern/diff/D40868664/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88925 Approved by: https://github.com/chaekit	2022-11-27 12:20:30 +00:00
Taylor Robie	ae725d501e	[Profiler] Memory profiler part 9: Mark activations (#88924 ) This is a fairly straightforward pass: start at inputs and flood fill until we reach the backward pass. Differential Revision: [D40868662](https://our.internmc.facebook.com/intern/diff/D40868662/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88924 Approved by: https://github.com/chaekit	2022-11-27 12:20:28 +00:00
Yuxin Wu	56e40fe054	Let SyncBatchNorm fallback to BN if not using distributed training (#89706 ) Fixes #63662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89706 Approved by: https://github.com/soumith	2022-11-27 05:55:24 +00:00
PyTorch MergeBot	39449ea61d	[vision hash update] update the pinned vision hash (#89692 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89692 Approved by: https://github.com/pytorchbot	2022-11-27 02:59:06 +00:00
Taylor Robie	483d3a3d07	[Profiler] E2E expecttests for category assignment (#88653 ) Up until now the unit tests for category assignment have been narrowly scoped to specific checks on specific Tensors. However as we start to reach reasonable levels of category assignment it's useful to supplement those tests with higher level summary tests to inspect the larger graph and confirm that it makes sense. (It will also be necessary for some categories like activations where it is tedious to record all relevant Tensors.) The general structure of these tests is to capture a model invocation with `__torch_dispatch__` and then cross reference those inputs and outputs with the categories assigned by the memory profiler. Differential Revision: [D40868659](https://our.internmc.facebook.com/intern/diff/D40868659/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88653 Approved by: https://github.com/chaekit	2022-11-27 02:10:29 +00:00
Taylor Robie	0435894bb3	[Profiler] Memory profiler part 8: Mark parameters. (#87568 ) Following the pattern of earlier PRs, we use two methods to extract parameters. The primary one is the Python tracer; both nn.Module and optim.Optimizer collect parameters and in most cases that is sufficient. As a fallback we can analyze the data flow graph and deduce likely parameters based on gradient computation and updates. Parameter identification has a circular interaction with input identification. Inputs are defined as "not part of the core forward-backward-update loop", but we need inputs for the parameter identification fallback to give us a proxy for the forward pass. Thus, we mark parameters from the python tracer which limits which Tensors get marked as inputs. While not necessary, it adds a bit of robustness. (As shown by the strengthening of the input unit tests.) Differential Revision: [D40238619](https://our.internmc.facebook.com/intern/diff/D40238619/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87568 Approved by: https://github.com/chaekit	2022-11-27 02:10:29 +00:00
Taylor Robie	17fa6bf1f5	[Profiler] Memory profiler part 7: Mark inputs (#87567 ) It is surprisingly difficult to identify the leaves of the data flow graph. The issue is that inputs and pre-existing parameters look identical until parameter identification takes place. It's not too bad for training since Autograd lets us differentiate between them however I still want the tool to do something reasonable in inference. Some of this will be ameliorated when a later PR pulls in parameters from python tracing. The current approach is passable, but I will continue to mull over refinements. Differential Revision: [D40220388](https://our.internmc.facebook.com/intern/diff/D40220388/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87567 Approved by: https://github.com/chaekit	2022-11-27 02:10:27 +00:00
Taylor Robie	64c5c77cd4	[Profiler] Memory profiler part 6: Mark gradients and temporary intermediates. (#87566 ) Semantic assignment will be built up as a series of passes which gradually pin down the regions of a trace. For this reason it is important to be very meticulous in the assignment of categories. We begin with gradients as they are both straightforward to identify and foundational to subsequent analysis. There are two mechanisms that the profiler can use to tag gradients, each with their own advantages and limitations. The first is direct inspection of the op graph which is generic but predicated on certain features of the Autograd engine. (And therefore not necessarily exhaustive.) The second approach is direct instrumentation via the python tracer. This method relies requires that gradients be attached to an nn.Module parameter and can miss corner cases such as `set_to_none=True` due to the cache structure of the python tracer. Combined these two approaches provide very high coverage. Temporaries are more straightforward; we can easily add them by trivial local inspection of a data flow node. Because this is the first PR in the end-to-end section most of the code is building the scaffolding for category bookkeeping and unit testing. (The actual gradient extraction was covered in an earlier PR.) Differential Revision: [D40220389](https://our.internmc.facebook.com/intern/diff/D40220389/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87566 Approved by: https://github.com/chaekit	2022-11-27 02:10:26 +00:00
Taylor Robie	5f09a6d573	[Profiler] Memory profiler part 5: Data flow graph (#87006 ) The semantic meaning of a Tensor is tightly coupled to its lineage. The data flow graph allows us to identify temporary Tensors, masks, inputs, activations, and more. However one important nuance is that Tensors must be versioned; operations which mutate their inputs can also change the semantic meaning of said inputs. It is challenging to assemble a complete picture of the data flow in a PyTorch model because ops can, and often do, recursively call into other ops. For the purpose of memory profiling this is an implementation detail, so instead we traverse the op tree to identify top level ops and allocations and then coalesce their children, folding inputs and outputs into the top level Node. Differential Revision: [D40220391](https://our.internmc.facebook.com/intern/diff/D40220391/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87006 Approved by: https://github.com/chaekit	2022-11-27 00:28:57 +00:00
Taylor Robie	c3116dd78b	[Profiler] Memory profiler part 4: Select top level torch ops (#86880 ) In a later PR we will walk the children of these nodes and formulate a node from the entire bundle to build a data flow graph. This PR simply defines what a "top level" op is. Differential Revision: [D40220387](https://our.internmc.facebook.com/intern/diff/D40220387/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86880 Approved by: https://github.com/chaekit	2022-11-27 00:28:57 +00:00
Jiong Gong	bb77accb4c	[Inductor] Record cpp kernel in PyTorch Profiler (#89367 ) Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367 Approved by: https://github.com/jansel	2022-11-26 14:06:44 +00:00
Edward Z. Yang	36018a6ee6	Don't suppress exceptions from backends (#89656 ) Taken from voz's https://github.com/pytorch/pytorch/pull/89392 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89656 Approved by: https://github.com/voznesenskym	2022-11-26 03:18:05 +00:00
Natalia Gimelshein	3e20d023b1	put descriptive kernel names behind config (#89697 ) Per title, generated kernel names are often long and confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697 Approved by: https://github.com/Chillee	2022-11-26 03:08:23 +00:00
jlukehubbard	591dfffa38	update docstring for torch.linalg.lstsq (#89383 ) Previous documentation lacked details about the handling of over- and underdetermined systems, and made incorrect mention of MAGMA. Fixes #85021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89383 Approved by: https://github.com/lezcano	2022-11-25 21:31:53 +00:00
Edward Z. Yang	c9a0cc8640	Simplify aot_module_simplified by removing top_args/top_kwargs (#89666 ) This makes good on Chillee's CR comment at `af30d351cc (r843315222)` which was never done in the original PR. There is no logic change, just unpack the args/kwargs at the top level and remove the inner function indirection. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89666 Approved by: https://github.com/voznesenskym	2022-11-25 20:43:13 +00:00
Edward Z. Yang	6168f22fae	Don't support kwargs at runtime in aot_module_simplified (#89664 ) The preexisting logic here added in https://github.com/pytorch/functorch/pull/970 was very peculiar: if top_kwargs was non-empty, then the inner compiled function supports kwargs. Naively, this would leave you to expect that there is some sort of correlation between top_kwargs and kwargs. But in fact, they're completely unrelated! top_kwargs is the AOTAutograd configuration knobs (e.g., fw_compiler/bw_compiler), but kwargs is the RUNTIME kwargs that are to be passed to the compiled function. But (1) we don't support this (the function to be compiled only takes a list of tensors) and (2) even if we did support it, conditioning on whether or not you had passed AOTAutograd configuration kwargs to support kwargs at runtime is bonkers. So delete it. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89664 Approved by: https://github.com/voznesenskym	2022-11-25 20:43:13 +00:00
Edward Z. Yang	b04dda4291	Delay verify correctness wrapping to call site. (#89662 ) There is only one call site for compiler_fn, so we can safely delay wrapping verify correctness to here. This will help later when we change the backend compiler calling convention to pass fake tensors (but I need to pass real tensors here.) This is adapted from voz's changes at https://github.com/pytorch/pytorch/pull/89392 but with less changes to the substantive logic. I only moved the relevant inner implementation; there are no changes otherwise. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89662 Approved by: https://github.com/voznesenskym	2022-11-25 20:43:11 +00:00
Natalia Gimelshein	61a3fe4b64	make inductor correctly propagate nans for maximum and minimum (#89612 ) Partially fixes https://github.com/pytorch/torchdynamo/issues/594 Also, small cleanup for `where` codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-25 19:42:38 +00:00
Ikko Ashimine	70c0a3006e	Fix typo in segment_reduction_op_gpu.cu (#89647 ) menber -> member Pull Request resolved: https://github.com/pytorch/pytorch/pull/89647 Approved by: https://github.com/kit1980	2022-11-25 19:26:18 +00:00
kshitij12345	2c0bd85c75	complex: register c10::complex with py::cast (#89680 ) Fixes #77134 TODO: * [x] Add test (tested locally with script below) (Are there similar tests in the test-suite?) ```c++ #include <torch/torch.h> #include <torch/csrc/utils/pybind.h> #include <iostream> #include <vector> #include <pybind11/pybind11.h> #include <pybind11/embed.h> #include <cassert> namespace py = pybind11; int main() { py::scoped_interpreter guard{}; // start the interpreter auto casted_cdouble = py::cast(c10::complex<double>(1.0, 2.0)); assert( (c10::complex<double>(1.0, 2.0) == py::cast<c10::complex<double>>(casted_cdouble))); auto casted_cfloat = py::cast(c10::complex<float>(1.0, 2.0)); assert( (c10::complex<double>(1.0, 2.0) == py::cast<c10::complex<double>>(casted_cfloat))); auto casted_chalf = py::cast(c10::complex<at::Half>(1.0, 2.0)); assert( (c10::complex<double>(1.0, 2.0) == py::cast<c10::complex<double>>(casted_chalf))); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89680 Approved by: https://github.com/ezyang	2022-11-25 14:53:57 +00:00
Alvaro Gaona	abb446af8c	Implement old windows in Python (#87082 ) Relates to #85366 - Bartlett, Blackman, Hamming, Hann. - Except Kaiser which will be in a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/87082 Approved by: https://github.com/mruberry, https://github.com/lezcano	2022-11-25 11:09:28 +00:00
Jason Ansel	95ea47ef0c	torchdynamo to torch._dynamo in aot_autograd.py (#89385 ) Test Plan: Run torchbench models Differential Revision: D41429573 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89385 Approved by: https://github.com/soumith, https://github.com/malfet	2022-11-25 04:28:36 +00:00
Edward Z. Yang	6904324781	Remove fake_tensor_propagation (#89646 ) You always have to run dynamo with fake tensors. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89646 Approved by: https://github.com/soumith	2022-11-25 03:27:32 +00:00
Edward Z. Yang	1aa1014b26	xfail maml test, instead of running it without fake tensor prop (#89645 ) A previous version of this patch graph breaks when torch.tensor fails, but that causes ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/nn/test_embedding.py -k test_embedding_bag_1D_padding_idx_cpu_float32 ``` to start failing. Probably another latent bug that needs investigating. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89645 Approved by: https://github.com/albanD	2022-11-25 03:27:32 +00:00
PyTorch MergeBot	a048913e25	[vision hash update] update the pinned vision hash (#89667 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89667 Approved by: https://github.com/pytorchbot	2022-11-25 03:03:43 +00:00
XiaobingSuper	3b3ebcd031	TorchDynamo: weight prepack for single conv (#89209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89209 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-25 01:23:11 +00:00
XiaobingSuper	0c4f3db7bf	TorchDynamo: weight prepack for mkl linear (#89109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89109 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-25 01:20:19 +00:00
XiaobingSuper	07151a6bd6	TorchDynamo: weight prepack for onednn convolution external call (#88988 ) This PR is about enabled weight prepack using the MKLDNN tensor: 1. enable fake tensor mode for MKLDNN tensor input. 2. make convolution fusion kernel support MKLDNN tensor input. 3. do the weight prepack at FX fusion step. For better performance, we always use channels_last for CPU convolution path. because we test that the channels_last path can get a better performance than block input path, and also avoid the activation's layout conversion(plain to block, block to plain), currently, there only need plain to plain format conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88988 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-25 01:16:11 +00:00
Edward Z. Yang	0884fdaba0	Revert "Dont clone unmutated args in triton autotuning (#89519 )" (#89652 ) This reverts commit f18f0c70ab10c400947e71be30794e04dcc22acf. Testing to see if this fixes gmixer_24_224 mixer_b16_224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89652 Approved by: https://github.com/eellison	2022-11-24 22:49:09 +00:00
Edward Z. Yang	4a16f8cdb2	Reenable fake_tensor_propagation on test_cudnn_rnn (#89644 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89644 Approved by: https://github.com/anjali411	2022-11-24 22:46:49 +00:00
Edward Z. Yang	fc7dcb684a	Run optimizer tests with fake tensors (#89643 ) This is a slight regression: RAdam and Adagrad don't appear to trace at all under fake tensors. But I think this is a more accurate reflection of the current state of affairs. Along the way fix some problems on the fake tensor path. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89643 Approved by: https://github.com/anjali411	2022-11-24 22:46:49 +00:00
Edward Z. Yang	9b13508ef3	Force test_rng_state to run with fake tensor prop (#89641 ) I'm not really sure what desertfire's intended follow up was on https://github.com/pytorch/pytorch/pull/87490 because when I remove the unsupported() call, dynamo tests pass. But the change here is conservative and I think strictly better than the current situation. The idea is to force fake tensor pop on for the test, and then just observe that we are doing a graph break. Clearly, export doesn't work, so I manually xfail it. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89641 Approved by: https://github.com/anjali411	2022-11-24 22:46:47 +00:00
Edward Z. Yang	c6be06d93a	Easy: These tests work with fake_tensor_propagation on (#89640 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89640 Approved by: https://github.com/anjali411, https://github.com/albanD	2022-11-24 22:46:45 +00:00
Edward Z. Yang	6fb6eb0a74	Support unspecialized integers with dynamic shapes (#89639 ) Previously, we hackily wrapped unspecialized integers into tensors and treated them as tensor inputs. Sometimes, downstream operations would not be able to deal with the tensor input. Now, we wrap them into SymInt, so more correct overload selection occurs. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89639 Approved by: https://github.com/anjali411	2022-11-24 22:46:42 +00:00
Edward Z. Yang	0c96841a20	Cond capture with fake tensors actually works; don't raise in this case (#89638 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89638 Approved by: https://github.com/anjali411	2022-11-24 22:46:40 +00:00
kshitij12345	d3c012f409	[test_nn] split pruning tests from test_nn (#89590 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89590 Approved by: https://github.com/albanD	2022-11-24 21:41:22 +00:00
Aleksandar Samardžić	83666f167d	Added vectorized CPU code for uint8_t datatype. (#89284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89284 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2022-11-24 19:58:40 +00:00
Howard Huang	9497552771	Update SyncBatchNorm _all_gather_base to all_gather_into_tensor (#89521 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/88568 `_all_gather_base` is deprecated. So replacing its usage with `all_gather_into_tensor` Test Plan: CI Differential Revision: D41479983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89521 Approved by: https://github.com/wz337	2022-11-24 19:41:17 +00:00
Edward Z. Yang	94a88b53ed	Remove fake_tensors_available (#89637 ) As we are one repo now, they are always available. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89637 Approved by: https://github.com/anjali411	2022-11-24 19:28:10 +00:00
Emilio Castillo	1c8b0779de	Fix segfault when swapping custom allocator (#89613 ) Just screwed it before merging ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/89613 Approved by: https://github.com/albanD	2022-11-24 18:25:28 +00:00
Edward Z. Yang	fd279fe85b	Make pytest work again on test/dynamo (#89631 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89631 Approved by: https://github.com/anjali411	2022-11-24 17:24:25 +00:00
albanD	c3e85d879c	Mention discrepency between original impl and our impl of RAdam (#89575 ) Fixes https://github.com/pytorch/pytorch/issues/88836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89575 Approved by: https://github.com/mruberry	2022-11-24 17:11:42 +00:00
Edward Z. Yang	860bae49e4	Suppress guards on as_strided call only. (#89569 ) See comment in meta_utils.py for the whole story. This doesn't have a substantive impact yet, but will in the next PR on the stack. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89569 Approved by: https://github.com/albanD	2022-11-24 14:01:12 +00:00
mfkasim1	1588ea0dbf	Added log1p for complex in c10 (#89214 ) One PR towards #89205. The content is mostly from PR #38465, but slightly changed the expression to make it faster. Here are some benchmarking code: ```c++ #include <complex> #include <iostream> #include <chrono> // main.cc template<typename T> inline std::complex<T> log1p_v0(const std::complex<T> &z) { // this PR T x = z.real(); T y = z.imag(); T theta = std::atan2(y, x + T(1)); T r = x * (x + T(2)) + y * y; return {T(0.5) * std::log1p(r), theta}; } template<typename T> inline std::complex<T> log1p_v1(const std::complex<T> &z) { // PR #38465 T x = z.real(); T y = z.imag(); std::complex<T> p1 = z + T(1); T r = std::abs(p1); T a = std::arg(p1); T rm1 = (x * x + y * y + x * T(2)) / (r + 1); return {std::log1p(rm1), a}; } template<typename T> inline std::complex<T> log1p_v2(const std::complex<T> &z) { // naive, but numerically inaccurate return std::log(T(1) + z); } int main() { int n = 1000000; std::complex<float> res(0.0, 0.0); std::complex<float> input(0.5, 2.0); auto start = std::chrono::system_clock::now(); for (int i = 0; i < n; i++) { res += log1p_v0(input); } auto end = std::chrono::system_clock::now(); auto elapsed = end - start; std::cout << "time for v0: " << elapsed.count() << '\n'; start = std::chrono::system_clock::now(); for (int i = 0; i < n; i++) { res += log1p_v1(input); } end = std::chrono::system_clock::now(); elapsed = end - start; std::cout << "time for v1: " << elapsed.count() << '\n'; start = std::chrono::system_clock::now(); for (int i = 0; i < n; i++) { res += log1p_v2(input); } end = std::chrono::system_clock::now(); elapsed = end - start; std::cout << "time for v2: " << elapsed.count() << '\n'; std::cout << res << '\n'; } ``` Compiling the script with command `g++ main.cc` produces the following results: ``` time for v0: 237812271 time for v1: 414524941 time for v2: 360585994 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89214 Approved by: https://github.com/lezcano	2022-11-24 11:11:51 +00:00
Jiewen Tan	4f5c4c022a	[LTC] Refine MetricsArena::Reset (#89608 ) Summary: After counters are reset, getters' behaviors are inconsistent. To improve that, here I 1) move the validation of CounterData into CounterData::IsValid such that it's better encapsulated, 2) divide getters into two groups: a) MetricsArena::GetCounter() and b) MetricsArena::ForEachCounter(), and route MetricsArena::GetCounterNames() and CreateMetricReport() to use b. This is paired with pytorch/xla#4217. Test Plan: PJRT_DEVICE=CPU python xla/test/test_metrics.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/89608 Approved by: https://github.com/JackCaoG	2022-11-24 10:57:03 +00:00
Jithun Nair	a8629a1c18	Upgrade nightly wheels to ROCm5.3 (#89101 ) Dependent on PR https://github.com/pytorch/builder/pull/1193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89101 Approved by: https://github.com/kit1980	2022-11-24 10:53:22 +00:00
Ivan Yashchuk	c0d81aa70c	Use fx.replace_pattern for removing empty_like+fill in nvFuser+PrimTorch execution (#89132 ) I learned about `torch.fx.replace_pattern` and it's a cleaner way of removing unnecessary tensor materialization from the graph coming from tracing C++ code `1 - tensor`. Test: ``` python -m pytest test/test_prims.py -k "test_silu_backward_no_filled_tensor" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89132 Approved by: https://github.com/mruberry, https://github.com/jjsjann123	2022-11-24 09:37:10 +00:00
Hao Guan	b515c1d960	[QAT] Check the value of numel to avoid segfault (#81547 ) Fixes #78123 ### Original Result Segmentation fault ### Result after fix RuntimeError: numel is out of the bound of input tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/81547 Approved by: https://github.com/kit1980	2022-11-24 08:14:24 +00:00
Vasiliy Kuznetsov	22a1b5e243	quantization: deprecate observer compute_dtype and replace with is_dynamic (#85431 ) Summary: This PR deprecates the `compute_dtype` field on observers, and replaces it with the `is_dynamic` field on observers. This is better aligned with the reference model spec. Test plan: ``` python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/85431 Approved by: https://github.com/jerryzh168	2022-11-24 07:07:34 +00:00
Yanbo Liang	e4ccec6eca	[Dynamo] Fix bug of using customized torch.autograd.Function (#89397 ) Fixes https://github.com/pytorch/torchdynamo/issues/1899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89397 Approved by: https://github.com/jansel	2022-11-24 05:28:58 +00:00
Michael Lazos	903ae4570e	Disable optimizer tracing, enable for tests only (#89500 ) Disabling optimizer tracing before launch until it can be added to the benchmark suites without increasing compile times Pull Request resolved: https://github.com/pytorch/pytorch/pull/89500 Approved by: https://github.com/anijain2305	2022-11-24 04:15:34 +00:00
albanD	c79489c8e6	Expose to python the backward AD view_func (#89586 ) This will be useful for other systems (AOTAutograd) that want to replay autograd views. FYI @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/89586 Approved by: https://github.com/soulitzer	2022-11-24 03:39:58 +00:00
Nikita Karetnikov	4cb6bbbe27	Symintify `embedding` (#89327 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89327 Approved by: https://github.com/ezyang	2022-11-24 03:25:00 +00:00
Wu, Chunyuan	9c867eae1a	nnc: fix Store if value is fp32 while buf is bf16 (#86788 ) Fixes https://github.com/pytorch/pytorch/issues/86533. For the below graph: ```bash [DUMP kernel.cpp:1690] TensorExprKernel graph: [DUMP kernel.cpp:1690] graph(%x.1 : BFloat16(10, strides=[1], requires_grad=0, device=cpu)): [DUMP kernel.cpp:1690] %1 : int = prim::Constant[value=0]() [DUMP kernel.cpp:1690] %2 : BFloat16(10, strides=[1], requires_grad=0, device=cpu) = aten::pow(%x.1, %1) # test/test_tensorexpr.py:1330:29 [DUMP kernel.cpp:1690] %3 : BFloat16(10, strides=[1], requires_grad=0, device=cpu) = aten::sin(%2) # test/test_tensorexpr.py:1330:19 [DUMP kernel.cpp:1690] return (%3) ``` Loop stmt before the fix: The store value `0.8414709568023682f` is float while the scalar_type of the store buf `aten_sin` is bf16. ```bash [DEBUG llvm_codegen.cpp:489] After HalfRewriter { [DEBUG llvm_codegen.cpp:489] aten_sin[Ramp(0ll, 1ll, 8)] = Broadcast(0.8414709568023682f, 8); [DEBUG llvm_codegen.cpp:489] for (int64_t i_1_tail_tail = 0ll; i_1_tail_tail < 2ll; i_1_tail_tail++) { [DEBUG llvm_codegen.cpp:489] aten_sin[i_1_tail_tail + 8ll] = 0.8414709568023682f; [DEBUG llvm_codegen.cpp:489] } [DEBUG llvm_codegen.cpp:489] } ``` Loop stmt after the fix: ```bash [DEBUG llvm_codegen.cpp:489] After HalfRewriter { [DEBUG llvm_codegen.cpp:489] aten_sin[Ramp(0ll, 1ll, 8)] = bfloat16(Broadcast(0.8414709568023682f, 8)); [DEBUG llvm_codegen.cpp:489] for (int64_t i_1_tail_tail = 0ll; i_1_tail_tail < 2ll; i_1_tail_tail++) { [DEBUG llvm_codegen.cpp:489] aten_sin[i_1_tail_tail + 8ll] = bfloat16(0.8414709568023682f); [DEBUG llvm_codegen.cpp:489] } [DEBUG llvm_codegen.cpp:489] } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86788 Approved by: https://github.com/EikanWang, https://github.com/kit1980	2022-11-24 02:52:34 +00:00
Zhijing Li (Accelerator Enablement)	f0e5bc4b9f	Symintified layer_norm (#89466 ) Summary: As titled. Test Plan: ``` buck2 run mode/opt scripts/wwei6:test_executorch ``` Differential Revision: D41451390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89466 Approved by: https://github.com/frank-wei, https://github.com/ezyang	2022-11-24 02:18:32 +00:00
Alexander Grund	fdb2dd113d	Install missing VSX headers (POWER) (#85547 ) E.g. `test_cpp_extensions_aot_ninja` fails as it includes `vec.h` which requires the vec/vsx/* headers and `sleef.h`. The latter is also required for AVX512 builds on non MSVC compilers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85547 Approved by: https://github.com/kit1980	2022-11-24 01:52:11 +00:00
Wei-Sheng Chin	e922bd4e52	[ONNX] Move two headers from .h to .cc (#86852 ) As title. Header dependency should be as small as possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86852 Approved by: https://github.com/titaiwangms, https://github.com/BowenBao	2022-11-24 01:30:09 +00:00
Shunting Zhang	23fe2ff910	verify the number of outputs of xla graph (#89536 ) This PR add tests to verify the behavior of number of outputs returns by an XLA graph. The understanding from this PR will help us fix https://github.com/pytorch/torchdynamo/issues/1908 and enable training for dynamo/torchxla integration eventually. Send this PR separately so Jack could help verify if the behavior is expected and play with it. List some code snippets here since their behavior is not straightforward at a first glance: ``` def forward(self, a, b, c): """ The XLA graph will only return the first 2 items """ return a + b, a + c, b ``` ``` def forward(self, a, b, c): """ Inplace update on b cause it to be returned in XLA graph """ b.zero_() return a + b, a + c, b ``` ``` def forward(self, a, b, c): """ Even if we return b twice, the XLA graph only return b once. """ b.zero_() return a + b, a + c, b, b ``` Here are what observed by the added tests: 1. XLA does not return outputs that are also inputs -- if the tensor is not inplace updated. At first glance people may feel curious why should we consider this kind of 'non-realistic' corner case. But this kind of graphs indeed shows up in AOTAutograd. The main reason is AOTAutograd lift all model parameters/buffers as graph input and may return some of them. Check *test_direct_return* 2. if a tensor is inplace updated, XLA will still return it as graph output even if it's also an input. The only difference compared to item 1 is, the inplace updating on the tensor cause it being returned. This happens for BatchNorm2d since the running_mean/variance tensors will be inplace updated during training. Check *test_direct_return_with_inplace_update* Pull Request resolved: https://github.com/pytorch/pytorch/pull/89536 Approved by: https://github.com/jansel	2022-11-24 01:28:13 +00:00
Nikita Shulga	0bde514981	Add `c10::` namespace in front of `optional` (#89605 ) Prep change for moving the codebase to C++17 standard Was part of https://github.com/pytorch/pytorch/pull/85969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89605 Approved by: https://github.com/weiwangmeta, https://github.com/kit1980	2022-11-24 00:57:17 +00:00
foram-chandra	e19a7165fd	[nn] Remove deprecation warning from nn.functional.{tanh, sigmoid} (#86905 ) Fixes #65909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86905 Approved by: https://github.com/albanD, https://github.com/kit1980	2022-11-24 00:34:26 +00:00
clee2000	a00bd6f686	Don't run auto request review on forked PRs (#89583 ) tested on https://github.com/pytorch/pytorch/pull/89581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89583 Approved by: https://github.com/albanD, https://github.com/malfet	2022-11-23 23:48:35 +00:00
Nikita Karetnikov	0a1a53083e	[primTorch] Enable regex error testing for some refs (#87765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87765 Approved by: https://github.com/mruberry	2022-11-23 23:36:27 +00:00
Nikita Shulga	3ad2a032f4	Update default cmake to 3.18 (#89570 ) Set `cmake.dir` to `/usr/local` in `.circleci/scripts/build_android_gradle.sh ` Prep change for raising compiler standard to C++17: cmake-3.18 is the first one to support CUDA17 language Pull Request resolved: https://github.com/pytorch/pytorch/pull/89570 Approved by: https://github.com/atalman	2022-11-23 23:23:26 +00:00
Jane Xu	8695f0cced	Rectify `native_batch_norm` schema by splitting it into two legit schemas (#88697 ) Using the same repro from the issue (but with BatchNorm2D) Rectifies native_batch_norm schema by splitting the schema into 2: 1. one will have NON-optional alias-able running_mean and running_var inputs 2. the other will just not have those parameters at all (no_stats variation) Calling for name suggestions! ## test plan I've added tests in test_functionalization.py as well as an entry in common_method_invocations.py for `native_batch_norm_legit` CI should pass. ## next steps Because of bc/fc reasons, we reroute native_batch_norm to call our new schemas ONLY through the python dispatcher, but in 2 weeks or so, we should make `native_batch_norm_legit` the official batch_norm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88697 Approved by: https://github.com/albanD	2022-11-23 23:23:17 +00:00
Everton Constantino	a00efe55c3	Fix CheckOutputStreamSetting on JitLoggingTest as it failed if logging wasn't enabled. (#82722 ) `JIT_LOG` checks if logging was enabled for that particular file and when it isn't it doesn't output anything. Since the test checks for the size of `test_stream` it fails. I believe forcing the file to have logging enabled to see if the stream is being correctly set during test makes no sense so this patches just forcibly outputs and checks if it worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82722 Approved by: https://github.com/davidberard98	2022-11-23 22:46:29 +00:00
Huy Do	b8d3afd886	Skip upload test stats for test reports from rerun disabled tests workflow (#89548 ) I have found the reason why uploading tests stats fails for rerun disabled workflow, for example https://github.com/pytorch/pytorch/actions/runs/3522896778/jobs/5917765699. The problem is that the pytest XML file is now too big to be processed quickly (x50 bigger). Unlike unittest, `pytest-flakefinder` used by rerun disabled tests for test_ops includes skipped messages multiple times (50 times by default, retrying and skipping). This slows down the upload test stats script too much (O(n)) because it tries to gather all the stats. On the other hand, `check_disabled_tests` doesn't suffer from the same issue because it ignores all these skipped messages. This is a quick fix to skip test reports from rerun disabled tests workflow when trying to upload test stats. I'll try to fix this properly later in the way we use pytest-flakefinder. From what I see, a zipped test report from rerun disabled test is only few MB ([example](https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3521687954/1/artifact/test-reports-test-default-1-2-linux.2xlarge_9636028803.zip)), but will balloon up to a much bigger XML file after extracting from a dozen to a few hundred MB (text). The size of the zipped file is not a big immediate problem ### Testing [3521687954](https://github.com/pytorch/pytorch/actions/runs/3521687954) is an example workflow with rerun disabled tests and mem leak check. The script can now finish when running locally: * `upload_test_stats` finishes around 3+ minutes ``` time python -m tools.stats.upload_test_stats --workflow-run-id 3521687954 --workflow-run-attempt 1 --head-branch master ... Writing 8925 documents to S3 Done! Writing 1760 documents to S3 Done! Writing 1675249 documents to S3 Done! python3 -m tools.stats.upload_test_stats --workflow-run-id 3521687954 1 185.69s user 12.89s system 75% cpu 4:22.82 total ``` * `check_disabled_tests` finishes within 3 minutes ``` time python -m tools.stats.check_disabled_tests --workflow-run-id 3521687954 --workflow-run-attempt 1 --repo pytorch/pytorch ... python -m tools.stats.check_disabled_tests --workflow-run-id 3521687954 1 154.19s user 4.17s system 97% cpu 2:42.50 total ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89548 Approved by: https://github.com/clee2000	2022-11-23 22:39:39 +00:00
Elias Ellison	f18f0c70ab	Dont clone unmutated args in triton autotuning (#89519 ) Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great. Edit: i think it's just the triton cache clearing `44f577984d/python/triton/testing.py (L159)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519 Approved by: https://github.com/ngimel, https://github.com/jansel	2022-11-23 22:00:03 +00:00
Peter Bell	ac19c5be82	FFT: disable dimension wrapping for scalar tensors (#89234 ) Fixes #88985 By default, `maybe_wrap_dim` allows through `dim=0` or `dim=-1` for scalar tensors which leads to an invalid dimension being used to index into `tensor.sizes()` as in the code sample from the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89234 Approved by: https://github.com/mruberry	2022-11-23 21:55:00 +00:00
Pearu Peterson	50e2e4faf3	Sparse CSC/BSR/BSC serialization and pickle support (#89553 ) Fixes https://github.com/pytorch/pytorch/issues/89497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89553 Approved by: https://github.com/cpuhrsch	2022-11-23 20:56:48 +00:00
Elias Ellison	a8d6b82167	Fix norm decomp when dtype is passed in (#89508 ) Fix for https://github.com/pytorch/torchdynamo/issues/1889. The wrapper was doing a downcast even when the dtype was explicitly passed in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89508 Approved by: https://github.com/anijain2305	2022-11-23 20:49:09 +00:00
Elias Ellison	72110d7833	Fix Upsample Decomp Striding For Small Channels (#89528 ) Fix for https://github.com/pytorch/torchdynamo/issues/623. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89528 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2022-11-23 20:47:39 +00:00
Jerry Zhang	b7483be06a	[quant][docs] Add docstrings for operators defined in torch.ops.quantized_decomposed namespace (#89547 ) Summary: no functionality changes Test Plan: NA Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89547 Approved by: https://github.com/vkuzo	2022-11-23 20:40:53 +00:00
Natalia Gimelshein	a188f05e8c	Reland #89031 Added conv constraint that infers layouts (#89530 ) Relands #89031 Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530 Approved by: https://github.com/Chillee	2022-11-23 20:18:54 +00:00
William Wen	e800d27b10	[dashboard] Add graphs for all summary metrics, add additional testing flags (#89580 ) Title. Test post: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1325572179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89580 Approved by: https://github.com/davidberard98	2022-11-23 20:11:39 +00:00
Charlie West-Taylor	953f39578a	Mark IPU device as not supports_as_strided (#89130 ) Currently causes issues in calls to `.to`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89130 Approved by: https://github.com/albanD	2022-11-23 19:51:53 +00:00
Yanbo Liang	37e46a5035	[Dynamo] Fix several bugs & code refactor in RangeVariable (#89322 ) Fix bug in [7k github models](https://github.com/pytorch/torchdynamo/issues/1884): https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_clovaai_stargan_v2.py ``` E TypeError: 'list' object cannot be interpreted as an integer E E from user code: E File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_clovaai_stargan_v2.py", line 335, in forward E idx = torch.LongTensor(range(y.size(0))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89322 Approved by: https://github.com/jansel	2022-11-23 19:44:48 +00:00
Xilun Wu	91dcef41ae	Thread PG: add allreduce to threaded pg (#89043 ) Summary: Goal Add `all_reduce` collective to multi-threaded ProcessGroup added in D40236769 (`6663ae5537`). Code Motion Added `allreduce` collective to ProcessLocalGroup (a subclass of c10d ProcessGroup). What's Next Add a DDP test utilizing the new allreduce op. Generalize `allreduce` to allow other `ReduceOp`s besides `SUM`. Test Plan: cd fbcode/caffe2 buck2 test mode/dev //caffe2/test/distributed:multi_threaded Differential Revision: D41046606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89043 Approved by: https://github.com/wanchaol	2022-11-23 19:43:30 +00:00
Charlie West-Taylor	27db806888	Handle Tensor.__deepcopy__ via clone(), on IPU (#89129 ) Currently it falls through to a call to `storage()`, which the IPU doesn't support. I've made the minimal change here for ease of merging (this'd help us if it was in for 1.13.1), however... QUESTION: Is there any reason why `not torch._C._has_storage(self)` needs to also be guarded on `self.device.type == privateuseone`? in other words, could the condition for using `clone` not be this? ```python self.is_sparse or self.device.type in ["lazy", "xla", "mps", "ort", "meta", "hpu", "ipu"] or not torch._C._has_storage(self) or (type(self) is not Tensor and self.data_ptr() == 0) ``` If the condition fails, the very next thing is a call to `self._typed_storage()` which will fail, so it feels to me like any case without storage shouldn't fall through to the `storage()` call. The original PR for adding the 'no storage and device is `PrivateUse1`' condition ([86557](https://github.com/pytorch/pytorch/pull/86557)) doesn't discuss whether this could be broadened. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89129 Approved by: https://github.com/albanD	2022-11-23 19:41:09 +00:00
Sergii Dymchenko	fa7a963f65	Remove BaseException TODO (#89540 ) After discussion in https://github.com/pytorch/pytorch/pull/88461#issuecomment-1318965664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89540 Approved by: https://github.com/H-Huang	2022-11-23 19:39:49 +00:00
Yanbo Liang	9eed6b7f9a	[Dynamo] Several fixes on TensorVariable & TorchVariable (#89486 ) This is a group of bug fixes for [7k github models](https://github.com/pytorch/torchdynamo/issues/1884), it would fix 30+ model tests. * Support ```tensor.type()```. * Support ```tensor.get_device()```. * Support ```torch.nn.functional._Reduction.get_enum```. * Support ```torch._utils._get_device_index()```. * Fallback ```tensor.data_ptr()```. * ```FakeTensor``` always returns 0 * For no fake tensor propagation, we ```clone``` the input tensor, which makes no sense to track the original ```data_ptr```. And I don't think this is a very popular API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89486 Approved by: https://github.com/jansel	2022-11-23 19:39:45 +00:00
Iris	f03e6672fb	[Checkpoint][2D] Minor update for dedup_tensors.py (#89542 ) Rename variables for better readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89542 Approved by: https://github.com/H-Huang	2022-11-23 19:39:04 +00:00
Iris	74703eb502	[Checkpoint] Add a logger to dedup_tensors (#89503 ) Add a logger to dedup_tensors to log the duplicate keys to remove in global plan (List of SavePlan). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89503 Approved by: https://github.com/fduwjj	2022-11-23 19:36:03 +00:00
Brian Hirsh	57353c9608	first draft of input mutation handling for aot autograd (#88817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88817 Approved by: https://github.com/ezyang, https://github.com/wconstab	2022-11-23 19:20:11 +00:00
PyTorch MergeBot	902e4e3926	Revert "Fix the kineto daemon build condition (#89174 )" This reverts commit 9fd00f194ae4e28948a9a03a6382c20dde04e4fd. Reverted https://github.com/pytorch/pytorch/pull/89174 on behalf of https://github.com/robieta due to For some reason this is interacting badly with NVFuser. I think it is instability in kineto, but until we figure out what's going on reverting is a necessary evil.	2022-11-23 19:05:14 +00:00
Bin Bao	049a0f2cd5	[inductor] Update CI model tests (#89499 ) Summary: 1) Add model inference test 2) Switch model training test to use AMP Pull Request resolved: https://github.com/pytorch/pytorch/pull/89499 Approved by: https://github.com/bertmaher	2022-11-23 18:30:51 +00:00
Jerry Zhang	95474e00a9	[quant][be] Remove unused util code (#89272 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89272 Approved by: https://github.com/andrewor14	2022-11-23 18:27:41 +00:00
Jerry Zhang	128faf2b69	[quant][be] Refactor the error checking code for quantize_per_channel op (#89271 ) Summary: at Test Plan: make sure it compiles Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89271 Approved by: https://github.com/andrewor14	2022-11-23 18:27:41 +00:00
Catherine Lee	71c0e84914	Gate leak check and reruns on schedule (#89504 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89504 Approved by: https://github.com/huydhn	2022-11-23 18:27:37 +00:00
Emilio Castillo	c9d4390d13	Add Pluggable CUDA allocator backend (#86786 ) Fixes #43144 This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries. For example, we could have the following allocator in c++ ```c++ #include <sys/types.h> #include <cuda_runtime_api.h> #include <iostream> extern "C" { void* my_malloc(ssize_t size, int device, cudaStream_t stream) { void ptr; std::cout<<"alloc "<< size<<std::endl; cudaMalloc(&ptr, size); return ptr; } void my_free(void ptr) { std::cout<<"free "<<std::endl; cudaFree(ptr); } } ``` Compile it as a shared library ``` nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC' ``` And use it from PyTorch as follows ```python import torch # Init caching # b = torch.zeros(10, device='cuda') new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free') old = torch.cuda.memory.get_current_allocator() torch.cuda.memory.change_current_allocator(new_alloc) b = torch.zeros(10, device='cuda') # This will error since the current allocator was already instantiated torch.cuda.memory.change_current_allocator(old) ``` Things to discuss - How to test this, needs compiling external code ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786 Approved by: https://github.com/albanD	2022-11-23 17:54:36 +00:00
kshitij12345	1333fdcff1	[test_nn] split parametrization test from test_nn (#89552 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89552 Approved by: https://github.com/albanD	2022-11-23 17:27:40 +00:00
albanD	347a7d97a5	Deprecate decorating classes with torch.no_grad and similar (#89522 ) Fixes https://github.com/pytorch/pytorch/issues/89450 I would have completely removed it but I don't think this is particularly urgent and there are some use of it in the wild: https://github.com/search?q=%2Ftorch%5C.no_grad%5C%28%5C%29%5Cnclass%2F&type=code So we might as well take one release to do it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89522 Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/janeyx99	2022-11-23 16:51:42 +00:00
Nikita Shulga	2de38a0714	Add `torch._dynamo` to docs (#89510 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89510 Approved by: https://github.com/msaroufim	2022-11-23 16:33:13 +00:00
fduwjj	de0dee30d0	[PT-D][3/N] Sync TP API change to Pytorch (#89535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89535 Approved by: https://github.com/wanchaol	2022-11-23 16:13:49 +00:00
Yukio Siraichi	795473ff5e	Call `symint::sizes()` instead of `sizes()` on convolution error messages. (#89549 ) This PR fixes convolution when using `torchdynamo` with dynamic shapes. Problem: there are some `tensor.sizes()` calls in a few error messages. As a result, an uninformative error message was being displayed. ```python @torch._dynamo.optimize("eager") def foo(inp, w): return F.conv2d(inp, w) inp = torch.rand((1, 1, 32, 32)) w = torch.rand((1, 2, 3, 3)) # \| # \|--------- incorrect shape! foo(inp, w) ``` ----- Before this PR: ```python Traceback (most recent call last): File "torch/_dynamo/utils.py", line 1076, in run_node return node.target(args, kwargs) File "torch/_subclasses/fake_tensor.py", line 867, in __torch_dispatch__ op_impl_out = op_impl(self, func, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 445, in conv conv_backend = torch._C._select_conv_backend(kwargs) RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides ``` After this PR: ```python Traceback (most recent call last): File "torch/_dynamo/utils.py", line 1076, in run_node return node.target(args, kwargs) File "torch/_subclasses/fake_tensor.py", line 867, in __torch_dispatch__ op_impl_out = op_impl(self, func, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 445, in conv conv_backend = torch._C._select_conv_backend(kwargs) RuntimeError: Given groups=1, weight of size [1, s1, s2, s2], expected input[1, 1, s0, s0] to have s1 channels, but got 1 channels instead ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89549 Approved by: https://github.com/ezyang	2022-11-23 15:56:54 +00:00
Jerry Zhang	39772a6a01	[quant] Add support for quantize_per_channel in the reference flow with decomposed tensor (#89270 ) Summary: att, after this PR we can produce quantize_per_channel and dequantize_per_channel ops (typically used for quantizing weights) in the reference flow using decomposed tensor Test Plan: python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_per_channel_quant Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89270 Approved by: https://github.com/vkuzo	2022-11-23 10:57:04 +00:00
Kshiteej K	c651944f92	[test_nn] split hooks test from test_nn (#89201 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89201 Approved by: https://github.com/albanD	2022-11-23 08:39:45 +00:00
Kshiteej K	dd140fc351	[test_nn] move init tests from test_nn (#89202 ) Ref: https://github.com/pytorch/pytorch/issues/63085 Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89202 Approved by: https://github.com/albanD	2022-11-23 08:30:51 +00:00
Alexander Grund	7594e043b8	Fix Use-after-Free in qembeddingbag_byte_prepack_out (#84750 ) When FBGEMM is not used (either manually disabled or on platforms such as POWER where it isn't supported at all) the fallback code requests a `data_ptr<float>` on a `Tensor` object returned by `to(ScalarType::Float)` in the same line. This object will be destroyed at the end of the line leading to a dangling pointer. On some platforms this manifests in wrong results being returned as the memory gets overwritten. On other platforms anything may happen due to this being undefined behavior, although most likely it will just crash or continue to return semi-random results which may even happen to be correct (when the memory is not reused yet) Fix this by binding the temporary object (or initial object) to a const value reference which extents its lifetime and getting the `data_ptr` from that. Fixes #84748 This bug was introduced by a seemingly unrelated change in #64081 hence ccing @d1jang Pull Request resolved: https://github.com/pytorch/pytorch/pull/84750 Approved by: https://github.com/kimishpatel	2022-11-23 06:50:08 +00:00
Nikita Karetnikov	07dd2fe6c3	Symintify `select` (#89326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89326 Approved by: https://github.com/ezyang	2022-11-23 05:00:33 +00:00
Jerry Zhang	29742786f3	[quant] Add dequantize_per_channel in quantized_decomposed op library (#89269 ) Summary: att Test Plan: python test/test_quantization.py -k test_decomposed_dequantize_per_channel Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89269 Approved by: https://github.com/vkuzo	2022-11-23 04:25:25 +00:00
Edward Z. Yang	5266953443	Add crossref debug mode for functionalization, catches stride errors (#89498 ) The idea is to add a custom handler to Functionalize key in Python dispatcher that runs the functionalized version along side a non functionalized version, and checks that their outputs agree in the end. (Technically, for metadata mutation we should also check the inputs, but for now we're relying on those functions returning self.) I turned this on for test_functionalize.py (new TestCrossRefFunctionalize) and found a bunch of failures that look legit. This probably doesn't interact that nicely if you're also tracing at the same time, probably need more special logic for that (directly, just disabling tracing for when we create the nested fake tensor mode, but IDK if there's a more principled way to organize this.) There are some misc fixups which I can split if people really want. - xfail_inherited_tests moved to test common_utils - Bindings for _dispatch_tls_set_dispatch_key_included, _dispatch_tls_is_dispatch_key_included and _functionalization_reapply_views_tls - Type stubs for _enable_functionalization, _disable_functionalization - all_known_overloads utility to let you iterate over all OpOverloads in all namespaces. Iterator support on all torch._ops objects to let you iterate over their members. - suspend_functionalization lets you temporarily disable functionalization mode in a context - check_metadata_matches for easily comparing outputs of functions and see if they match (TODO: there are a few copies of this logic, consolidate!) - _fmt for easily printing the metadata of a tensor without its data - _uncache_dispatch for removing a particular dispatch key from the cache, so that we force it to regenerate - check_significant_strides new kwarg only_cuda to let you also do stride test even when inputs are not CUDA - Functionalize in torch._C.DispatchKey Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89498 Approved by: https://github.com/malfet	2022-11-23 04:18:25 +00:00
Nikita Shulga	fe990c8db9	[BE] Add more `ssh` instructions (#89516 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89516 Approved by: https://github.com/huydhn	2022-11-23 03:31:17 +00:00
Alexander Grund	5b51ca6808	Update CUDA compiler matrix (#86360 ) Switch GCC/Clang max versions to be exclusive as the `include/crt/host_config.h` checks the major version only for the upper bound. This allows to be less restrictive and match the checks in the aforementioned header. Also update the versions using that header in the CUDA SDKs. Follow up to #82860 I noticed this as PyTorch 1.12.1 with CUDA 11.3.1 and GCC 10.3 was failing in the `test_cpp_extensions*` tests. Example for CUDA 11.3.1 from the SDK header: ``` #if __GNUC__ > 11 // Error out ... #if (__clang_major__ >= 12) \|\| (__clang_major__ < 3) \|\| ((__clang_major__ == 3) && (__clang_minor__ < 3)) // Error out ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86360 Approved by: https://github.com/ezyang	2022-11-23 03:07:22 +00:00
Sergii Dymchenko	504570d577	Delete unused variable assignment in _refs/__init__.py (#89538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89538 Approved by: https://github.com/huydhn	2022-11-23 02:59:25 +00:00
Edward Z. Yang	ed32511974	Don't use explain() for --explain; instead read it off the counters (#89518 ) Fixes huggingface problem where example_inputs is not actually the args. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89518 Approved by: https://github.com/albanD	2022-11-23 02:43:53 +00:00
Shen Li	f5d18574a3	Allow Module forward-pre and forward hooks to take kwargs (#89389 ) closes #35643 This PR is mostly borrowed from #82042. Thanks @Padarn for implementing the first version and debugging into the errors. Based on the discussion in #82042 this PR adds a with_kwargs argument to register_forward_pre_hook and register_forward_hook methods. When the arg is set to true, the provided hook must accept kwargs args. Under the hook, this PR adds a `_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs` set to keep track of which hooks accept kwargs. Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89389 Approved by: https://github.com/soulitzer	2022-11-23 02:43:32 +00:00
Thomas	4935b597ac	Added implementation and tests for MPS Hardswish (#87952 ) ## What? Fixes issue #86807 by adding MPS backend support for aten::hardswish. ## How? Registered mps hardswish functions in native_functions.yaml, and added the code implementation to Activations.mm. Added functions: - hardswish_mps - hardswish_mps_ - hardswish_backward_mps - hardswish_out_mps ## Testing Added test in test/test_mps.py and tested code using the command `python3 test/test_mps.py -k test_hardswish` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87952 Approved by: https://github.com/kulinseth, https://github.com/kit1980	2022-11-23 02:18:03 +00:00
Animesh Jain	1cfd3858ac	[inductor] Use dense masks for indirect indexing (#89524 ) Fixes https://github.com/pytorch/torchdynamo/issues/1654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89524 Approved by: https://github.com/jansel	2022-11-23 00:48:00 +00:00
Will Constable	26322544b8	Add limited FSDP correctness to torchdynamo benchmark (#89469 ) - Does not do recursive wrapping - Only supports accuracy bench - Mainly useful for sweeping over models for correctness, in part to evaluate whether dynamo support for FSDP is breaking anywhere Pull Request resolved: https://github.com/pytorch/pytorch/pull/89469 Approved by: https://github.com/davidberard98, https://github.com/aazzolini	2022-11-23 00:19:36 +00:00
Nikita Shulga	7f4b4d2827	[Inductor] Limit g++12 installation to Linux (#89472 ) According to https://anaconda.org/conda-forge/gxx/ its only available on Linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/89472 Approved by: https://github.com/soumith, https://github.com/jgong5	2022-11-23 00:07:59 +00:00
Will Constable	b50699f247	Fix inductor fallback_random for dropout/rand_like (#89515 ) - Avoid fx graph rewrite that replaces certain ops with ones using triton random - Keep track of replacement ops using triton random, so it is possible to not disable all replacements when using fallback_random Pull Request resolved: https://github.com/pytorch/pytorch/pull/89515 Approved by: https://github.com/ngimel	2022-11-22 23:53:47 +00:00
William Wen	8bf8e4d71e	[dashboard] Add metric graphs back to dashboard (#89531 ) Title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89531 Approved by: https://github.com/davidberard98	2022-11-22 23:42:09 +00:00
Kshiteej K	ce856cee7e	[test_nn] fix missing class attributes for NNTestCase (#89200 ) Missed setting these class variable 😓 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89200 Approved by: https://github.com/albanD	2022-11-22 22:55:44 +00:00
Jerry Zhang	391b593ca2	[quant] Add quantize_per_channel in quantized_decomposed op library (#89268 ) Summary: att Test Plan: python test/test_quantization.py -k test_decomposed_quantize_per_channel Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89268 Approved by: https://github.com/vkuzo	2022-11-22 22:40:11 +00:00
Animesh Jain	5bba783d21	[dashboard] Remove aot_cudagraphs and nvprims_nvfuser (#89514 ) Helps speeding up Dashboard runs We will bring these back when the backends are ready to be tested on full model suite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89514 Approved by: https://github.com/SherlockNoMad	2022-11-22 22:25:30 +00:00
Manuel Candales	ea920a1115	[Vulkan][TCC] Add tests for quantize_per_tensor and dequantize (#89496 ) Summary: Add tests for quantize per tensor and dequantize Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: salilsdesai Differential Revision: D41047097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89496 Approved by: https://github.com/digantdesai	2022-11-22 22:15:57 +00:00
Hubert Lu	74e62a1fef	[ROCm] Optimize layer norm backward kernel for ROCm (#87635 ) We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when `fs` (=`config_m` in our benchmark script) is large and `bs` (=`config_n` in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of [PR #68238](https://github.com/pytorch/pytorch/pull/68238#issue-1051621716) on AMD GPUs. This PR is to replace `GammaBetaBackwardCUDAKernel` with the Apex layernorm backward kernel with some ROCm-specific parameter tuning when `fs` (=`config_m`) is larger than 512 on AMD GPUs. There are a few PRs for LayerNorm kernel: - https://github.com/pytorch/pytorch/pull/26201 - https://github.com/pytorch/pytorch/pull/27634 - https://github.com/pytorch/pytorch/pull/68238 Therefore, we have tested and compared the kernel before and at this PR with the input shapes in the last two PRs along with those commonly used in the CvT model on AMD MI100. --- Current <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} --> </head> <body link="#0563C1" vlink="#954F72"> M \| N \| fwd (half) \| fwdbwd (half) \| fwd (float) \| fwdbwd (float) -- \| -- \| -- \| -- \| -- \| -- 50432 \| 384 \| 0.387256 \| 1.372758 \| 0.378975 \| 1.47892 50176 \| 384 \| 0.38231 \| 1.362416 \| 0.378084 \| 1.473886 200704 \| 192 \| 0.997859 \| 4.315875 \| 0.989306 \| 4.560827 802816 \| 64 \| 3.671828 \| 16.68013 \| 3.613515 \| 16.827946 200 \| 256 \| 0.066503 \| 0.332096 \| 0.071422 \| 0.325349 1000 \| 256 \| 0.071848 \| 0.333355 \| 0.073038 \| 0.334753 6000 \| 256 \| 0.086334 \| 0.345139 \| 0.086834 \| 0.347429 6272 \| 256 \| 0.088601 \| 0.347906 \| 0.087855 \| 0.351245 200 \| 512 \| 0.071626 \| 0.329726 \| 0.073798 \| 0.326878 1000 \| 512 \| 0.073975 \| 0.330226 \| 0.074166 \| 0.332751 6000 \| 512 \| 0.099617 \| 0.362367 \| 0.100095 \| 0.378313 6272 \| 512 \| 0.100378 \| 0.358066 \| 0.099857 \| 0.395982 200 \| 1024 \| 0.072954 \| 0.326382 \| 0.073899 \| 0.333007 1000 \| 1024 \| 0.0743 \| 0.325532 \| 0.071126 \| 0.330991 6000 \| 1024 \| 0.127025 \| 0.390084 \| 0.128692 \| 0.471504 6272 \| 1024 \| 0.130704 \| 0.403536 \| 0.135244 \| 0.487133 200 \| 1536 \| 0.070331 \| 0.339169 \| 0.070086 \| 0.331015 1000 \| 1536 \| 0.075085 \| 0.330042 \| 0.076295 \| 0.328778 6000 \| 1536 \| 0.148889 \| 0.44949 \| 0.155781 \| 0.659987 6272 \| 1536 \| 0.154939 \| 0.478871 \| 0.17673 \| 0.716025 200 \| 2048 \| 0.070269 \| 0.335585 \| 0.072804 \| 0.334655 1000 \| 2048 \| 0.080094 \| 0.326991 \| 0.080426 \| 0.32685 6000 \| 2048 \| 0.187888 \| 0.623023 \| 0.245762 \| 0.981635 6272 \| 2048 \| 0.195431 \| 0.65244 \| 0.262574 \| 1.008141 200 \| 3072 \| 0.068205 \| 0.339428 \| 0.073068 \| 0.344034 1000 \| 3072 \| 0.087554 \| 0.328899 \| 0.09218 \| 0.346433 6000 \| 3072 \| 0.240352 \| 0.905058 \| 0.368135 \| 1.280462 6272 \| 3072 \| 0.26179 \| 0.959387 \| 0.387782 \| 1.476524 128 \| 2097152 \| 5.905976 \| 22.724793 \| 10.287974 \| 30.242092 256 \| 1048576 \| 4.561596 \| 19.554308 \| 10.223171 \| 29.42371 512 \| 524288 \| 4.146751 \| 22.7247 \| 11.404285 \| 39.175902 1024 \| 262144 \| 5.193135 \| 23.403325 \| 11.334512 \| 38.947192 2048 \| 131072 \| 4.992907 \| 23.377801 \| 11.400286 \| 40.889191 4096 \| 65536 \| 5.429488 \| 24.275701 \| 11.196778 \| 41.4751 8192 \| 32768 \| 5.35758 \| 21.360312 \| 10.535418 \| 42.875646 16384 \| 16384 \| 5.44947 \| 20.852605 \| 10.357685 \| 34.603408 32768 \| 8192 \| 4.688925 \| 17.379392 \| 9.635596 \| 31.188271 </body> </html> --------- At this PR <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D"; margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Calibri, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl63 {color:windowtext;} --> </head> <body link="#0563C1" vlink="#954F72"> M \| N \| fwd (half) \| fwdbwd (half) \| fwd (float) \| fwdbwd (float) -- \| -- \| -- \| -- \| -- \| -- 50432 \| 384 \| 0.38797 \| 0.93103 \| 0.37966 \| 1.15283 50176 \| 384 \| 0.3874 \| 0.96417 \| 0.38462 \| 1.18595 200704 \| 192 \| 1.00002 \| 2.40876 \| 0.99224 \| 2.55579 802816 \| 64 \| 3.67348 \| 7.98658 \| 3.61871 \| 7.72404 200 \| 256 \| 0.07292 \| 0.35119 \| 0.07195 \| 0.32602 1000 \| 256 \| 0.07354 \| 0.33325 \| 0.07237 \| 0.33742 6000 \| 256 \| 0.08819 \| 0.33283 \| 0.08453 \| 0.3279 6272 \| 256 \| 0.0886 \| 0.33446 \| 0.08774 \| 0.33426 200 \| 512 \| 0.0701 \| 0.33505 \| 0.07072 \| 0.33018 1000 \| 512 \| 0.07042 \| 0.33442 \| 0.074 \| 0.33206 6000 \| 512 \| 0.09931 \| 0.34956 \| 0.09895 \| 0.3572 6272 \| 512 \| 0.10103 \| 0.32976 \| 0.10041 \| 0.36635 200 \| 1024 \| 0.07144 \| 0.33579 \| 0.07209 \| 0.33216 1000 \| 1024 \| 0.0736 \| 0.32803 \| 0.07286 \| 0.32936 6000 \| 1024 \| 0.12584 \| 0.38916 \| 0.12852 \| 0.48273 6272 \| 1024 \| 0.13053 \| 0.38804 \| 0.13464 \| 0.49545 200 \| 1536 \| 0.07159 \| 0.3396 \| 0.07062 \| 0.33545 1000 \| 1536 \| 0.07443 \| 0.33239 \| 0.07366 \| 0.33204 6000 \| 1536 \| 0.14959 \| 0.45043 \| 0.15826 \| 0.69119 6272 \| 1536 \| 0.1542 \| 0.47644 \| 0.18249 \| 0.72208 200 \| 2048 \| 0.07258 \| 0.33982 \| 0.07412 \| 0.33859 1000 \| 2048 \| 0.0793 \| 0.32816 \| 0.07864 \| 0.32583 6000 \| 2048 \| 0.18973 \| 0.571 \| 0.25506 \| 0.91796 6272 \| 2048 \| 0.19719 \| 0.64208 \| 0.26445 \| 0.95055 200 \| 3072 \| 0.07092 \| 0.33867 \| 0.07104 \| 0.34695 1000 \| 3072 \| 0.08727 \| 0.33144 \| 0.09144 \| 0.36633 6000 \| 3072 \| 0.24683 \| 0.87275 \| 0.37761 \| 1.3289 6272 \| 3072 \| 0.26437 \| 0.91178 \| 0.38496 \| 1.53694 128 \| 2097152 \| 6.27936 \| 23.69425 \| 10.40004 \| 30.13699 256 \| 1048576 \| 4.5404 \| 19.47675 \| 10.28494 \| 29.36936 512 \| 524288 \| 4.13951 \| 18.78771 \| 10.09557 \| 32.67083 1024 \| 262144 \| 4.47576 \| 18.00411 \| 9.56488 \| 31.47117 2048 \| 131072 \| 4.28026 \| 16.95619 \| 9.40297 \| 30.82845 4096 \| 65536 \| 4.2653 \| 16.5018 \| 9.03315 \| 30.08392 8192 \| 32768 \| 4.25613 \| 16.13583 \| 8.9258 \| 30.75296 16384 \| 16384 \| 4.20256 \| 16.38207 \| 9.52587 \| 31.31113 32768 \| 8192 \| 4.20231 \| 16.19452 \| 9.31478 \| 31.03514 </body> </html> --------- Performance Improvement (%) <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:Calibri;font-size:11.0pt'> <!--StartFragment--> <div style='direction:ltr'> M \| N \| fwdbwd, torch.float16 \| fwdbwd, torch.float32 -- \| -- \| -- \| -- 50432 \| 384 \| 32.178 \| 22.049 50176 \| 384 \| 29.231 \| 19.536 200704 \| 192 \| 44.188 \| 43.962 802816 \| 64 \| 52.119 \| 54.100 200 \| 256 \| -5.750 \| -0.206 1000 \| 256 \| 0.031 \| -0.797 6000 \| 256 \| 3.566 \| 5.621 6272 \| 256 \| 3.865 \| 4.836 200 \| 512 \| -1.615 \| -1.010 1000 \| 512 \| -1.270 \| 0.208 6000 \| 512 \| 3.534 \| 5.581 6272 \| 512 \| 7.905 \| 7.483 200 \| 1024 \| -2.883 \| 0.254 1000 \| 1024 \| -0.767 \| 0.493 6000 \| 1024 \| 0.237 \| -2.381 6272 \| 1024 \| 3.840 \| -1.707 200 \| 1536 \| -0.127 \| -1.340 1000 \| 1536 \| -0.711 \| -0.992 6000 \| 1536 \| -0.209 \| -4.728 6272 \| 1536 \| 0.508 \| -0.846 200 \| 2048 \| -1.262 \| -1.176 1000 \| 2048 \| -0.358 \| 0.312 6000 \| 2048 \| 8.350 \| 6.487 6272 \| 2048 \| 1.588 \| 5.713 200 \| 3072 \| 0.223 \| -0.848 1000 \| 3072 \| -0.773 \| -5.743 6000 \| 3072 \| 3.570 \| -3.783 6272 \| 3072 \| 4.962 \| -4.092 128 \| 2097152 \| -4.266 \| 0.348 256 \| 1048576 \| 0.397 \| 0.185 512 \| 524288 \| 17.325 \| 16.605 1024 \| 262144 \| 23.070 \| 19.195 2048 \| 131072 \| 27.469 \| 24.605 4096 \| 65536 \| 32.023 \| 27.465 8192 \| 32768 \| 24.459 \| 28.274 16384 \| 16384 \| 21.439 \| 9.514 32768 \| 8192 \| 6.818 \| 0.491 </div> <!--EndFragment--> </body> </html> --------- Benchmark script of this PR ``` # Ref: # 1. https://github.com/pytorch/pytorch/pull/26201 # 2. https://github.com/pytorch/pytorch/pull/68238 from distutils.command.config import config import torch from torch.nn import LayerNorm import timeit number_runs = 1000 # TODO: Modify this to save time! def test_forward(layer_norm_cuda, input_cuda): layer_norm_cuda(input_cuda); torch.cuda.synchronize() def test_backward(out_cuda, layer_norm_grad_cuda, create_graph): out_cuda.backward(layer_norm_grad_cuda, retain_graph=True, create_graph=create_graph); torch.cuda.synchronize() def test_fwdbwd(input_cuda, layer_norm_cuda, gO): input_cuda.grad = None layer_norm_cuda.zero_grad(set_to_none=True) out = layer_norm_cuda(input_cuda) out.backward(gO) torch.cuda.synchronize() def benchmark(config_m, config_n): print("M \| N \| fwd (half) \| fwdbwd (half) \| fwd (float) \| fwdbwd (float)") if len(config_m) != len(config_n): print("Please make sure the lengths of config_m and config_m are the same.") for i in range(len(config_m)): normalized_shape = config_n[i] results = [config_m[i], config_n[i]] for dtype in (torch.half, torch.float): if dtype == torch.half: layer_norm_cuda = LayerNorm(normalized_shape).half().cuda() else: layer_norm_cuda = LayerNorm(normalized_shape).cuda() input_cuda = torch.randn(config_m[i], config_n[i], device='cuda', dtype=dtype, requires_grad=True) # print("cuda forward:") result_fwd = timeit.timeit(lambda: test_forward(layer_norm_cuda, input_cuda), number=number_runs) results.append(result_fwd / number_runs * 1000) gO = torch.rand_like(input_cuda) result_fwdbwd = timeit.timeit(lambda: test_fwdbwd(input_cuda, layer_norm_cuda, gO), number=number_runs) results.append(result_fwdbwd / number_runs * 1000) print('{:09d}\|{:09d}\|{:9.5f}\|{:9.5f}\|{:9.5f}\|{:9.5f}'.format(results[0], results[1], results[2], results[3], results[4], results[5])) print("Times are in microseconds (us).") # CVT config_m_cvt = [50432, 50176, 200704, 802816] config_n_cvt = [384, 384, 192, 64] # https://github.com/pytorch/pytorch/pull/68238#issue-1051621716 config_m_68238 = [200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272] config_n_68238 = [256,256,256,256,512,512,512,512,1024,1024,1024,1024,1536,1536,1536,1536,2048,2048,2048,2048,3072,3072,3072,3072] # https://github.com/pytorch/pytorch/pull/27634 config_m_27634 = [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768] config_n_27634 = [2097152, 1048576, 524288, 262144, 131072, 65536, 32768, 16384, 8192] config_m = config_m_cvt + config_m_68238 + config_m_27634 config_n = config_n_cvt + config_n_68238 + config_n_27634 benchmark(config_m, config_n) ``` CC: @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/87635 Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/ezyang	2022-11-22 22:15:38 +00:00
Catherine Lee	00b7d8ef23	Shard windows periodic job more (#89455 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89455 Approved by: https://github.com/huydhn	2022-11-22 21:52:50 +00:00
William Wen	77d7f2c659	[dashboard] Add commit date & fix date related issues (#89517 ) Add commit date to build summary of dashboard. Make the date of the run reflective of when the run started, not when the run ended. Use PST (UTC -8) to determine day, rather than GMT (UTC +0). Test comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1324176119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89517 Approved by: https://github.com/anijain2305	2022-11-22 21:17:36 +00:00
Alexander Grund	177baf366a	Fix vectorized trigonometric functions for VSX (#86453 ) Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in #59382 & #82646 after #41541 This fixes wrong results for e.g. `sin(1e20)`. Fixes #85978 To fix #85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86453 Approved by: https://github.com/malfet	2022-11-22 20:29:09 +00:00
Alexander Grund	ac3004757e	Relax tolerance for test_out_addbmm_cpu_float32 (#86365 ) The test may fail due to slightly different values caused by different order of matrizes in SGEMM: > Mismatched elements: 1 / 50 (2.0%) > Greatest absolute difference: 1.430511474609375e-05 at index (4, 5) (up to 1e-05 allowed) > Greatest relative difference: 4.65393206065873e-06 at index (4, 5) (up to 1.3e-06 allowed) Observed on POWER (ppc64le) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86365 Approved by: https://github.com/mruberry, https://github.com/kit1980	2022-11-22 20:27:29 +00:00
Alexander Grund	d053d51343	(Further) limit world size in test_fsdp_pure_fp16 (#86280 ) Test still fails when run on 5 A100 GPUs, although it works with 5 V100s. Using 4 GPUs seems to be fine. Followup to #85957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86280 Approved by: https://github.com/awgu, https://github.com/kit1980	2022-11-22 20:25:38 +00:00
Li-Huai (Allan) Lin	c2ce79f06e	Fix dev-discuss link in the maintainer docs (#89493 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89493 Approved by: https://github.com/H-Huang	2022-11-22 19:33:21 +00:00
Fuzzkatt	ef8b91fec7	enable previously failing UCC distributed_test.py tests (#89023 ) Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (https://github.com/pytorch/pytorch/pull/86961) or UCC-side timeout error handling fix: (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89023 Approved by: https://github.com/kwen2501, https://github.com/malfet	2022-11-22 19:05:56 +00:00
Animesh Jain	f281f435a8	Fix benchmarks - xla tensor test (#89509 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89509 Approved by: https://github.com/ngimel, https://github.com/shunting314	2022-11-22 18:42:13 +00:00
mantaionut	7c0bb61291	Force numpy prod to use 64 bit integers on Windows in some tests (#88089 ) This fixes some prod and masked.prod tests on Windows. np.prod uses int32 on Windows so it overflows. On Linux it uses by default int64. Fixes #77305 Fixes #77320 Fixes #77334 Fixes #77335 Fixes #77336 Fixes #77337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88089 Approved by: https://github.com/mruberry	2022-11-22 18:37:14 +00:00
PratsBhatt	f4898daaee	Add cached conda env file for Buck CI workflow (#89422 ) Fixes - T137631262 Caching conda dependencies for build workflows. Conda dependencies have been gathered from the workflow https://github.com/pytorch/pytorch/blob/master/.github/workflows/_buck-build-test.yml The pull request updates the action from `conda-incubator/setup-miniconda@v2` to `pytorch/test-infra/.github/actions/setup-miniconda@main` as it supports caching. Test Plan: Running the `ciflow/periodic` which runs the ci builds `buck-build-test` workflow. Expected output is to have all the conda dependencies cached. <img width="1227" alt="Screenshot 2022-11-22 at 15 44 20" src="https://user-images.githubusercontent.com/15447437/203343298-e55c384b-01ad-45c3-a5e9-ba5c53149be4.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89422 Approved by: https://github.com/huydhn	2022-11-22 18:00:01 +00:00
anjali411	9c0bf9387c	Meta impl for linalg_cholesky and linalg_cholesky_ex (#89430 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89430 Approved by: https://github.com/ezyang	2022-11-22 17:05:34 +00:00
Jerry Zhang	c4e08387c1	[quant][fx] Support producing reference quantized patterns for dynamic quantization (#89248 ) Summary: split the is_decomposed logic for `_replace_observer_with_quantize_dequantize_node` in a separate function and added support for dynamic quantization in the decomposed version of this function. In case of dynamic quantization, we'll produce the following reference quantized pattern in decomposed mode: ``` x -> choose_qparams -> quantize_per_tensor -> dequantize_per_tensor -> linear ``` Test Plan: python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_dynamic_quant Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89248 Approved by: https://github.com/vkuzo	2022-11-22 16:45:13 +00:00
Bin Bao	2823fc5e4c	[inductor] generate nan in the cpp backend (#89289 ) Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5	2022-11-22 15:54:04 +00:00
Howard Huang	5797f74924	[19/N] Add monitored_barrier custom op with CPU implementation (#89318 ) Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89318 Approved by: https://github.com/kwen2501	2022-11-22 14:18:40 +00:00
Howard Huang	be22b5d39f	[18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (#89317 ) Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89317 Approved by: https://github.com/kwen2501	2022-11-22 14:14:17 +00:00
Edward Z. Yang	d9cbe7764e	Make aten.copy preserve strides (hf_Longformer) (#89464 ) Fixes https://github.com/pytorch/torchdynamo/issues/1888 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D41460986](https://our.internmc.facebook.com/intern/diff/D41460986) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89464 Approved by: https://github.com/bdhirsh	2022-11-22 13:06:43 +00:00
Manuel Candales	2d94fd3b19	[Vulkan][TCC] Fix quantized shaders (#89456 ) Summary: Fix rounding issue in quantized shaders Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: salilsdesai Differential Revision: D41047095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89456 Approved by: https://github.com/kirklandsign, https://github.com/digantdesai	2022-11-22 11:05:58 +00:00
Aleksandar Samardžić	0f7dca1733	Vectorized CPU code implementing right shift operator. (#88990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88990 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2022-11-22 10:10:38 +00:00
lezcano	1d6a188d08	Reland Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761 ) (#84624 ) Reland https://github.com/pytorch/pytorch/pull/81761 Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84624 Approved by: https://github.com/kit1980	2022-11-22 07:53:24 +00:00
Iris	6b085d5cad	[Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed (#89398 ) This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This is used when flatten nested dict and flatten sharded tensors. Docstring and comments will be added in the following PRs. Test: ``` python3 test/distributed/_tensor/parallel/test_2d_parallel.py ``` and CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/89398 Approved by: https://github.com/wanchaol	2022-11-22 07:49:09 +00:00
Mike Iovine	7b0650d5cf	Back out "[static-runtime] change the backend for permute_copy" (#89463 ) Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways. Differential Revision: D41450088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89463 Approved by: https://github.com/hlu1	2022-11-22 06:26:10 +00:00
Nikita Shulga	f2cf1b0f5e	Revert submodule updates introduced by #89157 (#89449 ) Reverts updates that were introduced by https://github.com/pytorch/pytorch/pull/89157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89449 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000	2022-11-22 05:48:43 +00:00
Wang, Eikan	40cf214f2d	Support masked_fill to address the GPT2 performance issue (#89274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-22 04:12:43 +00:00
Shunting Zhang	e545caa50f	dynamo/torchxla integration: trace on xla rather than eager (#88904 ) In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because - we try 2 things together 1. let dynamo trace the model on xla rather than eager 2. enable training - It turns out neither of these two tasks are trivial enough. Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync. This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training. Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x. ``` +-------------------------+--------------------+-------------------------+ \| Model \| XLA (trace once) \| XLA (trace everytime) \| +=========================+====================+=========================+ \| resnet18 \| 1.38 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| resnet50 \| 1.227 \| 0.998 \| +-------------------------+--------------------+-------------------------+ \| resnext50_32x4d \| 1.544 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| alexnet \| 1.085 \| 1.045 \| +-------------------------+--------------------+-------------------------+ \| mobilenet_v2 \| 2.028 \| 1.013 \| +-------------------------+--------------------+-------------------------+ \| mnasnet1_0 \| 1.516 \| 0.995 \| +-------------------------+--------------------+-------------------------+ \| squeezenet1_1 \| 0.868 \| 1.01 \| +-------------------------+--------------------+-------------------------+ \| vgg16 \| 1.099 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| BERT_pytorch \| 3.26 \| 1.027 \| +-------------------------+--------------------+-------------------------+ \| timm_vision_transformer \| 2.182 \| 1.015 \| +-------------------------+--------------------+-------------------------+ \| geomean \| 1.50389 \| 1.01261 \| +-------------------------+--------------------+-------------------------+ ``` Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88904 Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel	2022-11-22 03:57:04 +00:00
Iris	1dae59ba16	[Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed (#89399 ) This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint. This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement. Docstring and comments will be added in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89399 Approved by: https://github.com/wanchaol	2022-11-22 03:52:35 +00:00
Huy Do	ce342ed2d3	Fix retrying logic for successful unittest tests under --rerun-disabled-tests mode (#89454 ) When looking into Rockset data for disabled test unittest, for example `testAdd`, I see that it's re-run only 3 times instead of 50+ times as expected under rerun-disabled -test mode ``` [ { "name": "testAdd", "classname": "TestLazyReuseIr", "filename": "lazy/test_reuse_ir.py", "flaky": false, "num_green": 3, "num_red": 0 } ] ``` It turns out that I made a mistake mixing `RERUN_DISABLED_TESTS` and `report_only` into `(RERUN_DISABLED_TESTS or report_only) and num_retries_left < MAX_NUM_RETRIES` in https://github.com/pytorch/pytorch/pull/88646. The retrying logic for successful tests under rerun-disabled-tests mode is never executed because num_retries_left would be equal to MAX_NUM_RETRIES (not smaller) if the very first run successes. Thus, the sample test `testAdd` finishes right away (1 success count) * `report_only` and `RERUN_DISABLED_TESTS` are 2 different things and shouldn't be mixed together. RERUN_DISABLED_TESTS has the higher priority. * We also don't want to retry skipped tests under rerun-disabled-tests mode because they are only skipped due to `check_if_enable` check `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run` ### Testing * CI https://github.com/pytorch/pytorch/actions/runs/3518228784 generates https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3518228784/1/artifact/test-reports-test-default-4-4-linux.4xlarge.nvidia.gpu_9627285587.zip in which `testAdd` is correctly called multiple times and `TestLazyReuseIr` is skipped correctly * Locally ``` # export CI=1 # export PYTORCH_RETRY_TEST_CASES=1 # export PYTORCH_OVERRIDE_FLAKY_SIGNAL=1 # export PYTORCH_TEST_RERUN_DISABLED_TESTS=1 $ python test/run_test.py --verbose -i lazy/test_reuse_ir Ignoring disabled issues: [] Selected tests: lazy/test_reuse_ir Prioritized test from test file changes. reordering tests for PR: prioritized: [] the rest: ['lazy/test_reuse_ir'] Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/slow-tests.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-slow-tests.json Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-disabled-tests.json parallel (file granularity) tests: lazy/test_reuse_ir serial (file granularity) tests: Ignoring disabled issues: [] Ignoring disabled issues: [] Running lazy/test_reuse_ir ... [2022-11-21 13:21:07.165877] Executing ['/Users/huydo/miniconda3/envs/py3.9/bin/python', '-bb', 'lazy/test_reuse_ir.py', '-v', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2022-11-21 13:21:07.166279] Expand the folded group to see the log file of lazy/test_reuse_ir ##[group]PRINTING LOG FILE of lazy/test_reuse_ir (/Users/huydo/Storage/mine/pytorch/test/test-reports/lazy-test_reuse_ir_6cf_dxa1) Running tests... ---------------------------------------------------------------------- Test results will be stored in test-reports/python-unittest/lazy.test_reuse_ir testAdd (__main__.TestLazyReuseIr) ... ok (1.215s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 50 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 49 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 48 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 47 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 46 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 45 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 44 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 43 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 42 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 41 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 40 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 39 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 38 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 37 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 36 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 35 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 34 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 33 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 32 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 31 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 30 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 29 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 28 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 27 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 26 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 25 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 24 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 23 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 22 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 21 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 20 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 19 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 18 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 17 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 16 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 15 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 14 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 13 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 12 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 11 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 10 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 9 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 8 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 7 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 6 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 5 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 4 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 3 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 2 ok (0.001s) testAdd (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 1 ok (0.001s) testAddSub (__main__.TestLazyReuseIr) ... testAdd succeeded - num_retries_left: 0 skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) testAddSubFallback (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) testBatchNorm (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s) ---------------------------------------------------------------------- Ran 54 tests in 1.264s OK (skipped=3) ``` Here is the sample rockset query ``` WITH added_row_number AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY name, classname, filename ORDER BY _event_time DESC) AS row_number FROM commons.rerun_disabled_tests ) SELECT name, classname, filename, flaky, num_green, num_red FROM added_row_number WHERE row_number = 1 AND name = 'testAdd' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89454 Approved by: https://github.com/clee2000	2022-11-22 03:39:17 +00:00
PyTorch MergeBot	338f619044	[vision hash update] update the pinned vision hash (#89471 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89471 Approved by: https://github.com/pytorchbot	2022-11-22 03:38:56 +00:00
fduwjj	00b9473ad6	[PT-D][Tensor Parallelism][2/N] Sync TP API change to PT prod (#89467 ) This is part of TP Beta Release efforts. ref: https://github.com/pytorch/tau/issues/576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89467 Approved by: https://github.com/wanchaol	2022-11-22 03:05:53 +00:00
Animesh Jain	82713a1cc4	[inductor][compilation time] Fallback when kernel size for avg/max pool is large (#89448 ) This fixes compilation time for yolov3 from 400 seconds to 48 seconds. yolov3 has a 13x13 max_pool2d kernel, which was creating really large Triton code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89448 Approved by: https://github.com/ngimel	2022-11-22 02:23:24 +00:00
maxren	496c8ae760	[xnnpack][lite-int] Handle Constant Data (#89445 ) Handling constant data for xnnpack delegation. This allows us to handle new modules like such: ``` class Module(torch.nn.Module): def __init__(self): super().__init__() self._constant = torch.ones(4, 4, 4) def forward(self, x): return x + self._constant ``` this is the precursor work to handling convolution, as we need to serialize constant data(weights) Differential Revision: [D41050349](https://our.internmc.facebook.com/intern/diff/D41050349/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89445 Approved by: https://github.com/digantdesai	2022-11-22 02:20:54 +00:00
Animesh Jain	120d200620	Revert "Added conv constraint that infers layouts (#89031 )" (#89451 ) This reverts commit 716f70f19a4b63268da2a753afdbe9b385a831ab. Fixes performance regression and compilation latency increase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-22 02:20:50 +00:00
Edward Z. Yang	06dffb3319	dont clone symints, dont clobber symint proxies (#88230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88230 Approved by: https://github.com/albanD	2022-11-22 01:37:43 +00:00
Howard Huang	58a74f34f9	[17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation (#88903 ) Differential Revision: [D41415325](https://our.internmc.facebook.com/intern/diff/D41415325) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88903 Approved by: https://github.com/kwen2501	2022-11-22 00:42:11 +00:00
Will Constable	7174572b1e	Add torchvis support to dist bench (#89324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89324 Approved by: https://github.com/davidberard98, https://github.com/albanD	2022-11-22 00:41:33 +00:00
Edward Z. Yang	57ed94804e	Bind DispatchKey.Functionalonalize in pybind11 (#89452 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89452 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-11-22 00:32:30 +00:00
Khushi	b189a7444d	[fix] tril & tril : out of bound check (#89384 ) Fixes #83326 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89384 Approved by: https://github.com/ngimel	2022-11-22 00:15:34 +00:00
Huy Do	dbc354b262	Mitigate flaky test_ops_fwd_gradients on macOS (#89410 ) This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by https://github.com/pytorch/pytorch/issues/66033 and it seems to point to macos itself https://github.com/graphia-app/graphia/issues/33. So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests. ### Testing `pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore) https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test. There is no obvious difference in the test duration: ``` 2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663] 2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156] 2022-11-21T21:34:21.6806380Z Ignoring disabled issues: [] 2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174] 2022-11-21T21:34:21.6815830Z Ignoring disabled issues: [] ..... 2022-11-21T21:40:42.2422700Z =============================== warnings summary =============================== ..... 2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml - 2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) = ..... 2022-11-21T21:42:00.1923310Z =============================== warnings summary =============================== ..... 2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml - 2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) = .... 2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89410 Approved by: https://github.com/soulitzer	2022-11-22 00:13:38 +00:00
Edward Z. Yang	ea50549ce6	Suppress guards when creating fake tensors (#89349 ) When we create fake tensors, we may call operators that introduce guards, to accurately reconstruct views. But these guards are spurious: if a user is able to present a tensor that "looks the same", they have implicitly fulfilled the contract that the view is creatable. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89349 Approved by: https://github.com/voznesenskym	2022-11-21 23:14:20 +00:00
William Wen	fa4980cd5e	Add commit hash to dynamo dashboard (#89462 ) Title - also fix a small bug with dashboard outputs. Sample: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1322732698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89462 Approved by: https://github.com/anijain2305	2022-11-21 22:56:13 +00:00
Yanbo Liang	186192bb26	[Dynamo] Fix bugs when calling tensor.data and tensor.layout (#89257 ) Fix bugs in [7k github models](https://github.com/pytorch/torchdynamo/issues/1884). * Legacy code still use ```tensor.data```, I think we can use ```tensor.detach``` to rewrite, not sure if there is anything I didn't anticipate. * Support ```tensor.layout```. The root cause of these issues are: dynamo wraps unimplemented ```tensor.x``` call into ```GetAttrVariable(TensorVariable, x)```, but this op was not inserted into FX graph. Hence, during the fake tensor propagation, it throws ```KeyError: 'example_value` ```. For these two popular attributes, Dynamo should support them anyway. However, if dynamo should support ___all___ ```tensor.x``` call and not fallback to ```GetAttrVariable```, I think it's debatable. If I turn off fake tensor propagation, it works well even not including this fix. So I'm curious if we should improve the fake propagation to cover similar cases. cc @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @jansel @eellison ``` Traceback (most recent call last): File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 404, in _compile out_code = transform_code_object(code, transform) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/bytecode_transformation.py", line 341, in transform_code_object transformations(instructions, code_options) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 392, in transform tracer.run() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 1523, in run super().run() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 389, in run and self.step() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 359, in step getattr(self, inst.opname)(inst) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 193, in wrapper return inner_fn(self, inst) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 865, in CALL_FUNCTION_KW self.call_function(fn, args, kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 301, in call_function self.push(fn.call_function(self, args, kwargs)) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/torch.py", line 407, in call_function tensor_variable = wrap_fx_proxy( File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/builder.py", line 636, in wrap_fx_proxy return wrap_fx_proxy_cls( File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/builder.py", line 676, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1024, in get_fake_value args, kwargs = torch.fx.node.map_arg((node.args, node.kwargs), visit) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 613, in map_arg return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 621, in map_aggregate t = tuple(map_aggregate(elem, fn) for elem in a) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 621, in <genexpr> t = tuple(map_aggregate(elem, fn) for elem in a) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 627, in map_aggregate return immutable_dict((k, map_aggregate(v, fn)) for k, v in a.items()) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 627, in <genexpr> return immutable_dict((k, map_aggregate(v, fn)) for k, v in a.items()) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 631, in map_aggregate return fn(a) File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 613, in <lambda> return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1022, in visit return n.meta["example_value"] KeyError: 'example_value\n\nfrom user code:\n File "./generated/test_BayesWatch_pytorch_prunes.py", line 108, in forward\n return torch.zeros([x.size()[0], self.channels, x.size()[2] // self.spatial, x.size()[3] // self.spatial], dtype=x.dtype, layout=x.layout, device=x.device)\n\nSet torch._dynamo.config.verbose=True for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n torch._dynamo.config.suppress_errors = True\n' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89257 Approved by: https://github.com/jansel	2022-11-21 22:44:01 +00:00
Wanchao Liang	821ba6b51b	[4/n] Thread PG: add reduce_scatter to threaded pg (#89442 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89442 Approved by: https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:44 +00:00
Wanchao Liang	3e99d4db76	[3/n] Thread PG: add scatter to threaded pg (#89441 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89441 Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:44 +00:00
Wanchao Liang	3876f94c3d	[2/n] Thread PG: add test for broadcast (#89440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89440 Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:42 +00:00
Wanchao Liang	deae450899	[1/n] Thread PG: add test for allgather (#89439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89439 Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj	2022-11-21 22:36:41 +00:00
Mengwei Liu	047e542a1a	[tools] expose selective build library (#89351 ) Change the base module and visibility of `tools:gen_oplist_lib` so that it can be reused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89351 Approved by: https://github.com/cccclai	2022-11-21 21:08:13 +00:00
Peter Bell	c068fa900f	[inductor] Misc division lowering fixes (#88603 ) 1. `aten.div.Tensor_mode` should allow broadcasting 2. `div` can use `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` 3. `prims.div` on integers should be truncating division 4. Add lowering for `true_divide` which is aliased to `div` 5. register lowering for inplace version of `div_mode` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88603 Approved by: https://github.com/ngimel	2022-11-21 20:56:41 +00:00
Peter Bell	1267dcf297	[inductor] Fix nan handling for aten.sign (#88937 ) ATen gives `sign(nan) == 0` but inductor's cuda codegen would give `sign(nan) == 1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88937 Approved by: https://github.com/ngimel	2022-11-21 20:56:40 +00:00
Keval Morabia	3d247a8bcd	Fix unconvertible_ops as per #89261 (#89299 ) Fixes #89261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89299 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-21 20:40:04 +00:00
Driss Guessous	1d9e1fca97	Update sdp dispatch logic to enable fused backward (#89154 ) # Summary Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154 Approved by: https://github.com/cpuhrsch	2022-11-21 20:02:09 +00:00
Taylor Robie	cf9476554f	update kineto pinned commit (#89435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89435 Approved by: https://github.com/malfet	2022-11-21 17:32:29 +00:00
Xu Zhao	e4d9dbd7d2	Port torchdynamo's torchbench script to userbenchmark (#89239 ) Summary: This Diff ports the torchbench.py script from torchdynamo to torchbench to support the development of internal models. Currently, only works with the `--only` option, and can only test one model at a time. Note that the noisy logs are from upstream model code, not the benchmark code. In the internal environment, `torch._dynamo.config.base_dir` is not writable, so we add an option to specify the output directory. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only ads_dhen_5x --part over --output-directory /tmp/tb-test/ cuda eval ads_dhen_5x 1/ 1 +0 frames 2s 1 graphs 1 graph calls 412/ 411 = 100% ops 100% time ``` ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only cmf_10x --part over --output-directory /tmp/tb-test/ cuda eval cmf_10x 1/ 1 +0 frames 1s 1 graphs 1 graph calls 306/ 305 = 100% ops 100% time ``` Reviewed By: jansel Differential Revision: D41294311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89239 Approved by: https://github.com/jansel	2022-11-21 17:25:28 +00:00
PyTorch MergeBot	9d209e7834	Revert "[ao] making _is_activation_post_process private (#87520 )" This reverts commit 45c62a337756ff9db97cd64d2d42d9e65dda0a85. Reverted https://github.com/pytorch/pytorch/pull/87520 on behalf of https://github.com/bigfootjon due to Diff reverted internally	2022-11-21 16:48:26 +00:00
PyTorch MergeBot	f3db03612f	Revert "[ao] maintain BC for is_activation_post_process (#89260 )" This reverts commit c5fafb4e1694f141d8a1a31142cce4049d9057ed. Reverted https://github.com/pytorch/pytorch/pull/89260 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2022-11-21 16:38:20 +00:00
Jiong Gong	6796979ee1	[Inductor] Limit the number of compile threads to the available cpu cores (#89377 ) `config.compile_threads` gets the number of compile threads via `min(32,os.cpu_count())` while `os.cpu_count()` is the total number of cpu cores in the system, not the available ones. This would cause compile thread contention when the available cpu cores are less than `min(32,os.cpu_count())`, e.g., available cpu cores are limited with numactl or taskset, making the compilation very slow. This PR tries to use `len(os.sched_getaffinity(0))` if `os.sched_getaffinity` is available which returns the available number of cpu cores. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89377 Approved by: https://github.com/soumith	2022-11-21 14:20:36 +00:00
lezcano	c2cf0bde1f	Move the OpInfo same-storage error to the autograd test (#88306 ) This check was previously located at the `non_contiguous` test (quite and odd location). Even more, at https://github.com/pytorch/pytorch/pull/86378#discussion_r993658395, Kshiteej found that this assert was not doing anything really. We move it to the autograd test and make it a proper `self.assert`. We also disallow returning 1-tuples from sample_input functions, as they were breaking this assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88306 Approved by: https://github.com/mruberry	2022-11-21 13:59:03 +00:00
yanbing-j	a80e5e7813	Update ideep for future performance improvement (#87966 ) Summary The update includes API changes and optimzations to reduce framework overhead, which will benefit all mkldnn (onednn) ops in JIT mode and inductor CPU backend, etc. These benefits will be seen after switching to new ideep API by future PRs. Test plan For correctness, all UTs that call mkldnn ops, including test_ops.py, test_mkldnn*.py, test_quantization.py, etc. For performance, TorchBench has been run and no regression is found. Results are shown below. - Intel (R) Xeon (R) IceLake with 40 cores - Use multi-instance - Using tcmalloc & Intel OMP ![image](https://user-images.githubusercontent.com/12522207/201631004-bb77468d-953b-4757-a001-94d44615b5f6.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87966 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper	2022-11-21 09:52:36 +00:00
XiaobingSuper	31708a7310	TorchDynamo: enable conv+silu fusion (#89278 ) This PR will improve the tf_efficientnet_b0 performance by fusing conv+silu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89278 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-21 09:35:53 +00:00
Wang, Eikan	bc716383a6	Redefine the simdlen semantic (#89263 ) This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`. Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows. - _simdlen = None_: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2. - _simdlen <=1_: Explicitly disable SIMD - _simdlen > 1_: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-21 09:08:16 +00:00
XiaobingSuper	79770d3636	TorchDynamo: enable conv+relu6 fusion (#89265 ) This PR is about enabled conv+relu6 which improves mobilenet'e performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89265 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-21 08:01:07 +00:00
Shen Li	e0251de42f	[Easy] Use prepend arg to register forward hooks in quantize.py (#89391 ) Differential Revision: [D41431110](https://our.internmc.facebook.com/intern/diff/D41431110) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89391 Approved by: https://github.com/awgu	2022-11-21 05:19:47 +00:00
PyTorch MergeBot	1db5ce095f	[vision hash update] update the pinned vision hash (#89287 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89287 Approved by: https://github.com/pytorchbot	2022-11-21 03:08:33 +00:00
Natalia Gimelshein	51e961dd7b	use std/libdevice erf in inductor (#89388 ) By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel). Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388 Approved by: https://github.com/jansel	2022-11-21 00:58:03 +00:00
Huy Do	1856fa5df7	Temporary increase ASAN shard 5 to 4xlarge (#89387 ) ASAN shard 5 also see OOM now `7b0d577c22`, may be we should increase all 5 of them to 4xlarge until https://github.com/pytorch/pytorch/issues/88309 is resolved Pull Request resolved: https://github.com/pytorch/pytorch/pull/89387 Approved by: https://github.com/kit1980	2022-11-20 23:36:50 +00:00
PyTorch MergeBot	e1d58b1928	Revert "Update sdp dispatch logic to enable fused backward (#89154 )" This reverts commit 2e72ec79823111e8dd8c5e82c5d1b56197cd52d3. Reverted https://github.com/pytorch/pytorch/pull/89154 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_sdp_math_gradcheck test breaks periodic slow gradcheck, i.e. `419ef2cdcf`	2022-11-20 22:14:38 +00:00
Edward Z. Yang	c09929659c	Also include MKL_THREAD_LIB in link libraries for caffe2::mkl (#89378 ) Actually fixes https://github.com/pytorch/audio/issues/2784 for real; in my previous testing I didn't check if I could import torchaudio; now torchaudio successfully imports. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89378 Approved by: https://github.com/soumith	2022-11-20 19:47:25 +00:00
Edward Z. Yang	7b0d577c22	Set INTERFACE_LINK_DIRECTORIES on caffe2::mkl (#89359 ) This ensures that subsequent link commands involving mkl libraries know where to find the libraries if they are in a non-standard location (which is the case if you installed mkl via conda, which is what our standard instructions recommend.) This is kind of a hack, because the MKL libraries are not actually guaranteed to be in $MKL_ROOT/lib (they are for the conda install though). The real fix is to properly use the MKL targets from FindMKL.cmake but thats its own can of fish. See https://github.com/pytorch/pytorch/issues/73008 This fixes https://github.com/pytorch/audio/issues/2784 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89359 Approved by: https://github.com/soumith	2022-11-20 13:34:30 +00:00
Edward Z. Yang	dbeacf1182	Fix cat striding in PrimTorch (#89332 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89332 Approved by: https://github.com/ngimel	2022-11-20 04:05:33 +00:00
Sherlock Huang	caf3d5319f	Symintify numel(), infer_size, prims.elementwise_meta (#88956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88956 Approved by: https://github.com/ezyang	2022-11-20 00:42:03 +00:00
Edward Z. Yang	7c811efab7	Add support for dynamic kwarg to torch._dynamo.optimize (#89290 ) This is an easier way to enable dynamic shapes for a region. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89290 Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/voznesenskym	2022-11-19 23:51:02 +00:00
PyTorch MergeBot	8ad39536d7	Revert "Symintify numel(), infer_size, prims.elementwise_meta (#88956 )" This reverts commit ce2f8700bafcf44850402a39188ec121ba8b5486. Reverted https://github.com/pytorch/pytorch/pull/88956 on behalf of https://github.com/ezyang due to somehow breaks torch.numel	2022-11-19 21:47:55 +00:00
kvathupo	8ac58bc2e3	Add nullptr_t overload to c10::intrusive_ptr (#89196 ) __What?__ Fixes #82413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89196 Approved by: https://github.com/ezyang	2022-11-19 21:40:07 +00:00
Edward Z. Yang	5582001bd5	Reland 2 "Towards unifying symbolic and non symbolic fake tensor (#89038 ) (#89143 )" (#89346 ) This reverts commit 8e4c9828f4c990f439179912159086aaed790493. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89346 Approved by: https://github.com/wconstab	2022-11-19 21:14:31 +00:00
fduwjj	6afe341276	[PT-D][1/N] Sync TP Beta change to prod (#89242 ) This is part of TP Beta Release efforts. ref: https://github.com/pytorch/tau/issues/576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89242 Approved by: https://github.com/wanchaol	2022-11-19 18:01:25 +00:00
Michael Voznesensky	6b8c1b19b5	RM expectedFailure UnspecReproTests.test_batch_norm_act_unspec (#89340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89340 Approved by: https://github.com/bertmaher	2022-11-19 17:49:39 +00:00
AllenTiTaiWang	6daf60be5a	[ONNX] Add setType from user into InferredType and Reliable in ConstantValueMap (#88622 ) `setType` API is not respected in current exporter because the graph-level shape type inference simply overrides every NOT ONNX Op shape we had from node-level shape type inference. To address this issue, this PR (1) makes custom Op with `setType` reliable in ConstantValueMap to secure its shape/type information in pass: _C._jit_pass_onnx. (2) If an invalid Op with shape/type in pass: _C._jit_pass_onnx_graph_shape_type_inference(graph-level), we recognize it as reliable. 1. In #62856, The refactor in onnx.cpp made regression on custom Op, as that was the step we should update custom Op shape/type information into ConstantValueMap for remaining Ops. 2. Add another condition besides IsValidONNXNode for custom Op setType in shape_type_inference.cpp. If all the node output has shape (not all dynamic), we say it's custom set type. 3. ~However, this PR won't solve the [issue](https://github.com/pytorch/pytorch/issues/87738#issuecomment-1292831219) that in the node-level shape type inference, exporter invokes the warning in terms of the unknow custom Op, since we process its symbolic_fn after this warning, but it would have shape/type if setType is used correctly. And that will be left for another issue to solve. #84661~ Add `no_type_warning` in UpdateReliable() and it only warns if non ONNX node with no given type appears. Fixes #81693 Fixes #87738 NOTE: not confident of this not breaking anything. Please share your thoughts if there is a robust test on your mind. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88622 Approved by: https://github.com/BowenBao	2022-11-19 17:16:59 +00:00
Jerry Zhang	940959ebbf	[quant][fix] Add quant_min/quant_max for default dynamic quantization observer (#89267 ) Summary: This is needed for choose qparams, but previously it is not configurable, and in the reference quantization flow with decomposed Tensor, we are making this explicit Test Plan: tested in future PR Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89267 Approved by: https://github.com/vkuzo	2022-11-19 16:08:31 +00:00
Michael Voznesensky	808bdbab89	Fix try/except flow where DataDependentOutputException is getting wrapped in a RuntimeError (#89314 ) Repro fixed ``` def fn(a): return a.repeat_interleave(14, dim=0).repeat_interleave(14, dim=1) x = torch.ones(14, 14).to(dtype=torch.int64) opt_fn = torch._dynamo.optimize("eager")(fn) opt_fn(x) ``` Fixes [#1886](https://github.com/pytorch/torchdynamo/issues/1886) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89314 Approved by: https://github.com/anijain2305, https://github.com/eellison	2022-11-19 07:16:29 +00:00
Horace He	419ef2cdcf	Added utility to count memory reads/written in Inductor (#89203 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89203 Approved by: https://github.com/jansel, https://github.com/ngimel	2022-11-19 04:18:26 +00:00
kshitij12345	7a2930b357	add jvp test with non-contig inputs (#89131 ) Ref: https://github.com/pytorch/functorch/issues/1029 We update `test_jvp` to do contiguous and non-contiguous testing in a single test. Prev time for `test_jvp` : ~28s New time for `test_jvp`: ~45s Pull Request resolved: https://github.com/pytorch/pytorch/pull/89131 Approved by: https://github.com/zou3519	2022-11-19 04:09:29 +00:00
Michael Voznesensky	631baecbcd	Add --explain flag to bench (#89316 ) TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 time python benchmarks/dynamo/torchbench.py --accuracy --explain --backend aot_eager --train --only BERT_pytorch Dynamo produced 76 graphs with 75 graph break and 198 ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/89316 Approved by: https://github.com/ezyang	2022-11-19 03:35:09 +00:00
Yuxin Wu	e6996ea172	Don't redefine __STDC_FORMAT_MACROS (#89310 ) Similar to https://github.com/pytorch/pytorch/pull/39608 and https://github.com/pytorch/pytorch/pull/6676 This causes a compile error in our internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89310 Approved by: https://github.com/kit1980	2022-11-19 02:24:21 +00:00
Nikolay Korovaiko	8c0515dbff	cast C++ py-bound SymNode to SymInt correctly (#89295 ) Unfortunately, it's a bit hard to test purely on the Pytorch core side, but it passes the XLA tests which are currently disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89295 Approved by: https://github.com/ezyang	2022-11-19 02:18:05 +00:00
Driss Guessous	2e72ec7982	Update sdp dispatch logic to enable fused backward (#89154 ) # Summary Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154 Approved by: https://github.com/cpuhrsch	2022-11-19 02:06:27 +00:00
Michael Lazos	85a87e635c	[dynamo] mutable local caching to make dynamo faster at tracing mutation (#89170 ) Make mutation faster to speed up tracing optimizers, helps with https://github.com/pytorch/torchdynamo/issues/1803 `replace_all` no longer iterates over the entire variable tracker data structure every time a mutation is performed Each variable tracker internally keeps a set of contained mutable variable trackers, to provide a hint to `replace_all`. This is populated with a call to `apply` from `__post_init__` in the base `VariableTracker` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89170 Approved by: https://github.com/jansel	2022-11-19 01:47:48 +00:00
Nikita Shulga	ea58955dda	Move bazel to c++17 (#89297 ) Splitting out various smaller pieces from https://github.com/pytorch/pytorch/pull/85969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89297 Approved by: https://github.com/huydhn	2022-11-19 01:13:08 +00:00
Animesh Jain	cad5772c2c	[dashboard][huggingface] skip accuracy checks for really large models… (#89273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89273 Approved by: https://github.com/desertfire	2022-11-19 00:22:45 +00:00
Howard Huang	ee907375fa	[small] Update error message (#89294 ) Summary: `RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].` to `RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].` Test Plan: sandcastle Differential Revision: D41405238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294 Approved by: https://github.com/awgu	2022-11-19 00:21:14 +00:00
zhxchen17	c3938bb97a	[functorch] introduce an experimental map() op. (#88767 ) Summary: We want to introduce an experimental control flow op: map() to export some models as FX graphs correctly. Some calrification on basic requirements we have in mind: 1. This op can nest cond() and other control flow primitives internally. 2. We don't necessarily need loop carried dependencies for the models we've seen. 3. This map() op can handle dynamically shaped tensor as input and return dynamically shaped output based on input shapes. 4. We should be able to pass through additional arguments to the loop body as extra arguments. In this diff we introduce a new control flow op `map()` which has the following semantics: ``` def map(f: Callable, xs: Tensor, args): # one possible implementation: return torch.stack([f(x, args) for x in xs]) ``` Test Plan: pytest functorch/test_control_flow.py CI Differential Revision: D41165796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88767 Approved by: https://github.com/zou3519	2022-11-19 00:19:50 +00:00
Edward Z. Yang	94b5c807fd	Detach fake tensors into val, so they aren't affected by metadata mutation (#89140 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89140 Approved by: https://github.com/bdhirsh	2022-11-19 00:08:14 +00:00
Nikita Shulga	885f8a56d4	[BE] Print backtraces from coredumps (#89309 ) By simply invoking `gdb python core -ex "bt" -ex "q"` Test plan: See: [linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/3500498821/jobs/5863369649#step:14:39) Not sure why multiprocessing tests SEGFAULT, but they do Pull Request resolved: https://github.com/pytorch/pytorch/pull/89309 Approved by: https://github.com/clee2000, https://github.com/huydhn	2022-11-18 23:44:57 +00:00
Tran Le	0e1fcc8aa8	[FX] Add type annotation to `getitem` node before `split_module` (#88510 ) Summary: Some nodes lost the type annotation during `split_module`, causing the submodels to be un-scriptable. This is because compiler always infer Tensor type, which is wrong for non-Tensor types. We attempt to infer type annotation for `getitem` node to improve scriptability. Test Plan: ``` buck2 test //caffe2/test:fx_experimental ``` Differential Revision: D41037819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88510 Approved by: https://github.com/xush6528	2022-11-18 23:19:14 +00:00
Wei Wang	ecfb4e064c	[Inductor CI] Use string format for cuda-arch-list input to prevent 8.0/9.0/10.0 etc from being interpreted as 8/9/10 (#89279 ) Currently or in future whenever we change the cuda-arch-list to num.0, github action or some agent would pass just num to TORCH_CUDA_ARCH_LIST This num is not regex matched during cuda arch analysis phase. (here: `c5fafb4e16/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake (L229)`) Example failure: https://github.com/weiwangmeta/pytorch/actions/runs/3495656108/jobs/5852735299 Unknown CUDA Architecture Name 8 in CUDA_SELECT_NVCC_ARCH_FLAGS This change reminds us to use e.g. '8.0', '9.0', '10.0' etc instead of 8.0, 9.0, 10.0 as GHA or some other agent may erroneously truncate it to pure numbers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89279 Approved by: https://github.com/desertfire, https://github.com/atalman	2022-11-18 23:05:50 +00:00
Bryce Long	7551136b81	Add NVTX markers that dump additional information for nvprim_nvfuser Dynamo graphs (#88259 ) dump information on graphs that NVFuser JIT compiles: - the markers show the list of ops, args, and inputs that make up the graph also dumps information on FX nodes that are not touched by NVFuser: - the markers show the op, name, and arg list of the node Pull Request resolved: https://github.com/pytorch/pytorch/pull/88259 Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry	2022-11-18 22:36:08 +00:00
Taylor Robie	35d5fc52f0	[Profiler] Don't raise SOFT_ASSERT in debug builds. (#89240 ) Enough people are hitting this issue that we need to turn off hard failures until the fire rate is zero in steady state. (via scuba logging.) Differential Revision: [D41382914](https://our.internmc.facebook.com/intern/diff/D41382914/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89240 Approved by: https://github.com/aaronenyeshi	2022-11-18 22:24:24 +00:00
Andrew Gu	bfffc8d8ef	[DDP][Docs] Add warning that `no_sync()` should include forward (#89244 ) The issue where the user only includes `loss.backward()` inside `no_sync()` but not the forward pass has arisen several times now. I think adding an explicit warning in the docs is worthwhile. Rendered doc: <img width="769" alt="Screen Shot 2022-11-17 at 9 21 32 PM" src="https://user-images.githubusercontent.com/31054793/202602005-22c000b7-1093-4eaf-ba66-9c929a66906b.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89244 Approved by: https://github.com/zhaojuanmao	2022-11-18 22:06:24 +00:00
David Berard	304b5de1b0	Re-enable test_hf_bert_fsdp (#89223 ) It looks like this failure was actually caused by https://github.com/pytorch/pytorch/pull/88629, see the revert message on that PR. It probably just looked like a flaky test on CI because of how quickly the PR was reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89223 Approved by: https://github.com/voznesenskym	2022-11-18 21:40:27 +00:00
Edward Z. Yang	ba605c3b04	Don't trace when we track_tensor_tree (#89139 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89139 Approved by: https://github.com/bdhirsh	2022-11-18 20:15:20 +00:00
Edward Z. Yang	e04dc35a6a	Symintify obeys_layout_contract (#89138 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89138 Approved by: https://github.com/bdhirsh	2022-11-18 20:15:20 +00:00
Zain Rizvi	837ca8f344	Remove --retry-all-errors from environment with old curl (#89298 ) The version of curl on the `ubuntu-latest` box doesn't support the `--retry-all-errors` param and is breaking periodic builds Example: https://github.com/pytorch/pytorch/actions/runs/3495466804/jobs/5852265880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89298 Approved by: https://github.com/huydhn	2022-11-18 19:36:09 +00:00
Huy Do	ee2ce3fef6	Set make max load when building libtorch (#89237 ) The nccl build is still OOM sometimes when using `$(MAKE)`: ``` virtual memory exhausted: Cannot allocate memory Makefile:73: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o' failed make[5]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o] Error 1 make[5]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src/collectives/device' ``` * https://github.com/pytorch/pytorch/actions/runs/3476485191/jobs/5811758058 * https://github.com/pytorch/pytorch/actions/runs/3422228421/jobs/5702153639 So trying to set the same limit here as when building with ninja Pull Request resolved: https://github.com/pytorch/pytorch/pull/89237 Approved by: https://github.com/malfet	2022-11-18 18:55:33 +00:00
vfdev-5	7ec8a4d2a2	Vectorized horizontal flip implementation (#88989 ) When we benchmarked image processing transforms in torchvision : tensor vs pillow we saw that horizontal flip on uint8 data `(3, X, X)` is 2-3x slower. Due to the fact that output's first stride is negative, implementation does a simple data copy using [`basic_loop`](`8371bb8a3d/aten/src/ATen/native/cpu/Loops.h (L286)`). In this PR, a vectorized path is added for horizontal flip op for dtypes: uint8, int, float32, long and double and there is a speed-up that reduces the gap between PIL and tensor ops ``` CPU capability usage: AVX2 [----------------------------------------------------------------- Horizontal flip -----------------------------------------------------------------] \| torch (1.14.0a0+git2ed1d29) PR \| Pillow (9.3.0) \| torch (1.14.0.dev20221116+cu116) nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------ channels=3, size=256, dtype=torch.int64 \| 101.307 (+-0.904) \| \| 111.364 (+-0.328) channels=3, size=520, dtype=torch.int64 \| 462.369 (+-2.184) \| \| 505.602 (+-0.541) channels=3, size=712, dtype=torch.int64 \| 1855.441 (+-6.528) \| \| 1828.370 (+-8.600) channels=1, size=256, dtype=torch.int32 \| 22.282 (+-0.130) \| 44.218 (+-0.936) \| 34.651 (+-0.162) channels=1, size=520, dtype=torch.int32 \| 72.180 (+-0.076) \| 166.639 (+-1.180) \| 118.820 (+-0.210) channels=1, size=712, dtype=torch.int32 \| 129.621 (+-0.649) \| 307.140 (+-2.221) \| 216.104 (+-0.793) channels=3, size=256, dtype=torch.uint8 \| 51.685 (+-0.200) \| 44.171 (+-0.818) \| 361.611 (+-0.276) channels=3, size=520, dtype=torch.uint8 \| 223.320 (+-0.726) \| 166.607 (+-2.256) \| 1462.012 (+-4.917) channels=3, size=712, dtype=torch.uint8 \| 423.298 (+-1.156) \| 307.067 (+-1.999) \| 2738.481 (+-1.715) channels=1, size=256, dtype=torch.float32 \| 22.281 (+-0.056) \| 44.149 (+-0.808) \| 35.316 (+-0.028) channels=1, size=520, dtype=torch.float32 \| 72.268 (+-0.106) \| 166.631 (+-1.212) \| 119.504 (+-0.340) channels=1, size=712, dtype=torch.float32 \| 129.777 (+-0.632) \| 307.078 (+-1.909) \| 216.987 (+-0.185) channels=1, size=256, dtype=torch.float16 \| 32.789 (+-0.081) \| \| 34.044 (+-0.039) channels=1, size=520, dtype=torch.float16 \| 112.693 (+-0.478) \| \| 117.445 (+-0.125) channels=1, size=712, dtype=torch.float16 \| 203.644 (+-0.791) \| \| 213.283 (+-0.397) channels=3, size=256, dtype=torch.float64 \| 102.058 (+-0.333) \| \| 108.404 (+-0.346) channels=3, size=520, dtype=torch.float64 \| 473.139 (+-1.327) \| \| 503.265 (+-0.365) channels=3, size=712, dtype=torch.float64 \| 1854.489 (+-9.513) \| \| 1844.345 (+-1.371) channels=1, size=256, dtype=torch.int16 \| 11.927 (+-0.056) \| \| 33.993 (+-0.037) channels=1, size=520, dtype=torch.int16 \| 39.724 (+-0.148) \| \| 117.577 (+-0.153) channels=1, size=712, dtype=torch.int16 \| 68.264 (+-0.133) \| \| 213.118 (+-0.157) Times are in microseconds (us). ``` ``` CPU capability usage: AVX512 [----------------------------------------------------------------- Horizontal flip ------------------------------------------------------------------] \| torch (1.14.0a0+git2ed1d29) PR \| Pillow (9.3.0) \| torch (1.14.0.dev20221118+cu116) nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------- channels=3, size=256, dtype=torch.int64 \| 131.244 (+-1.954) \| \| 135.649 (+-4.066) channels=3, size=520, dtype=torch.int64 \| 522.032 (+-4.660) \| \| 539.822 (+-10.420) channels=3, size=712, dtype=torch.int64 \| 1041.111 (+-53.575) \| \| 1322.411 (+-80.017) channels=1, size=256, dtype=torch.int32 \| 10.108 (+-0.414) \| 49.164 (+-1.000) \| 34.606 (+-0.865) channels=1, size=520, dtype=torch.int32 \| 93.218 (+-1.417) \| 191.985 (+-5.047) \| 133.664 (+-5.372) channels=1, size=712, dtype=torch.int32 \| 167.919 (+-2.854) \| 353.574 (+-6.568) \| 246.162 (+-5.753) channels=3, size=256, dtype=torch.uint8 \| 34.710 (+-0.541) \| 49.005 (+-0.923) \| 136.603 (+-2.339) channels=3, size=520, dtype=torch.uint8 \| 154.873 (+-3.049) \| 191.729 (+-4.997) \| 534.329 (+-10.754) channels=3, size=712, dtype=torch.uint8 \| 290.319 (+-4.819) \| 351.619 (+-6.978) \| 997.119 (+-33.086) channels=1, size=256, dtype=torch.float32 \| 10.345 (+-0.338) \| 49.105 (+-0.942) \| 35.478 (+-0.733) channels=1, size=520, dtype=torch.float32 \| 81.131 (+-5.281) \| 191.697 (+-4.555) \| 133.554 (+-4.193) channels=1, size=712, dtype=torch.float32 \| 169.581 (+-3.476) \| 352.995 (+-10.792) \| 251.089 (+-7.485) channels=1, size=256, dtype=torch.float16 \| 35.259 (+-0.612) \| \| 35.154 (+-0.924) channels=1, size=520, dtype=torch.float16 \| 132.407 (+-1.980) \| \| 131.850 (+-5.611) channels=1, size=712, dtype=torch.float16 \| 240.192 (+-5.479) \| \| 239.555 (+-7.273) channels=3, size=256, dtype=torch.float64 \| 129.649 (+-2.349) \| \| 130.429 (+-6.240) channels=3, size=520, dtype=torch.float64 \| 548.534 (+-5.179) \| \| 622.568 (+-25.720) channels=3, size=712, dtype=torch.float64 \| 1208.091 (+-77.095) \| \| 1679.204 (+-316.292) channels=1, size=256, dtype=torch.int16 \| 7.801 (+-0.115) \| \| 34.517 (+-0.482) channels=1, size=520, dtype=torch.int16 \| 36.010 (+-0.855) \| \| 131.001 (+-1.686) channels=1, size=712, dtype=torch.int16 \| 87.395 (+-1.355) \| \| 237.731 (+-4.181) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/c0421f54c8aed655b042dd1ce4cb621e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88989 Approved by: https://github.com/lezcano, https://github.com/datumbox, https://github.com/peterbell10, https://github.com/ngimel	2022-11-18 18:46:53 +00:00
Yanbo Liang	81a4aeabdf	[Dynamo] Support Tensor.nelement & torch.cuda.is_available (#89164 ) Fix several errors in [7k github models](https://github.com/pytorch/torchdynamo/issues/1198). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89164 Approved by: https://github.com/soumith	2022-11-18 18:43:15 +00:00
Horace He	8a419cbffb	Added partial decomposition of conv_backward and grad_bias computation (#89128 ) `convolution_backward` often just kicks off the `sum` as a separate kernel. Splitting it off in a decomp allows us to fuse it into other ops: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L2150 Improves `convnext_base` from 373 img/s => 383 img/s Not sure what other models use convolution with bias haha. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89128 Approved by: https://github.com/ezyang	2022-11-18 17:33:17 +00:00
Jerry Zhang	38ccd08f9b	[quant][fx][be] Refactor replace observer with q/dq op code (#89247 ) Summary: This is a refactor to prepare for future extensions, no functionality changes Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89247 Approved by: https://github.com/vkuzo, https://github.com/andrewor14	2022-11-18 17:29:36 +00:00
zhxchen17	c219b55b5f	Use standard __func__ macro in symbolic shape. (#89264 ) Summary: I saw the following issue only on Windows build in PR #88767: ``` RuntimeError: AttributeError: 'SymNode' object has no attribute 'torch::impl::PythonSymNodeImpl::ge' ``` It's only on Windows because we get the attributes of SymNode in C++ with `__FUNCTION__` macro, which is not in C++ standard, therefore has platform specific behavior. In this case, MSVC will include a function's namespace and class name, which is not intended here. Instead we should use `__func__`. see: https://en.cppreference.com/w/cpp/language/function#Function_definition godbolt example to show the difference: https://godbolt.org/z/PGfvecxPx Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89264 Approved by: https://github.com/ezyang	2022-11-18 17:03:53 +00:00
Richard Howell	12a97444c3	[xplat] remove -weak_framework (#89233 ) Summary: The `-weak_framework` flag is no longer necessary, Buck will weakly link frameworks depending on the `target_sdk_version` of the binary being linked. Test Plan: Compare IG load commands before and after change with P553208168 ``` load command difference in Instagram.app/Frameworks/InstagramXplatFramework.framework/InstagramXplatFramework --- /tmp/tmpvd97s2v0 2022-11-16 12:13:54.082910598 -0800 +++ /tmp/tmpj20r_4ca 2022-11-16 12:13:54.082910598 -0800 @@ -9,7 +9,7 @@ /System/Library/Frameworks/CoreHaptics.framework/CoreHaptics (compatibility version 1.0.0, current version 1.0.0, weak) /System/Library/Frameworks/CoreImage.framework/CoreImage (compatibility version 1.0.0, current version 5.0.0) /System/Library/Frameworks/CoreLocation.framework/CoreLocation (compatibility version 1.0.0, current version 2780.0.17) - /System/Library/Frameworks/CoreML.framework/CoreML (compatibility version 1.0.0, current version 1.0.0, weak) + /System/Library/Frameworks/CoreML.framework/CoreML (compatibility version 1.0.0, current version 1.0.0) /System/Library/Frameworks/CoreMedia.framework/CoreMedia (compatibility version 1.0.0, current version 1.0.0) /System/Library/Frameworks/CoreServices.framework/CoreServices (compatibility version 1.0.0, current version 1226.0.0) /System/Library/Frameworks/CoreTelephony.framework/CoreTelephony (compatibility version 1.0.0, current version 0.0.0) @@ -33,9 +33,9 @@ /System/Library/Frameworks/Security.framework/Security (compatibility version 1.0.0, current version 60420.40.34) /System/Library/Frameworks/SystemConfiguration.framework/SystemConfiguration (compatibility version 1.0.0, current version 1241.40.2) /System/Library/Frameworks/UIKit.framework/UIKit (compatibility version 1.0.0, current version 6109.1.108) - /System/Library/Frameworks/UserNotifications.framework/UserNotifications (compatibility version 1.0.0, current version 1.0.0, weak) + /System/Library/Frameworks/UserNotifications.framework/UserNotifications (compatibility version 1.0.0, current version 1.0.0) /System/Library/Frameworks/VideoToolbox.framework/VideoToolbox (compatibility version 1.0.0, current version 1.0.0) - /System/Library/Frameworks/WebKit.framework/WebKit (compatibility version 1.0.0, current version 614.2.9, weak) + /System/Library/Frameworks/WebKit.framework/WebKit (compatibility version 1.0.0, current version 614.2.9) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.0.0) /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.8) /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.32.0) ``` Both these changes are correct, WebKit is available from 8.0, UserNotifications from 10.0 and CoreML from 11.0. Instagram has a deployment target of 12.4. Reviewed By: ebgraham Differential Revision: D41348639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89233 Approved by: https://github.com/malfet	2022-11-18 16:30:53 +00:00
andrewor14	19e66fcec2	[Quant] Allow setting fixed qparams for inner LSTM ops (#88456 ) Summary: In both eager and FX graph mode quantization, `torch.ao.nn.quantizable.LSTM` is used as an observed custom module, which is responsible for inserting its own observers. By default, the user specifies a single QConfig for the custom module (either through QConfigMapping or by setting the "qconfig" attribute"), and all inner ops will [inherit this QConfig](`dc00bb51b8/torch/ao/nn/quantizable/modules/rnn.py (L366-L378)`) and use the same observer/fake_quantize constructors. Today, users who wish to override this behavior must extend `torch.ao.nn.quantizable.LSTM` and write a lot of custom code to manually assign the QConfigs to the inner ops. This commit alleviates this burden on the user by providing a helper function to assign QConfigs with custom observers. An example use case of this is providing a reference implementation for a backend kernel that hardcodes qparams for efficiency. Example usage: ``` import torch from torch.ao.quantization import get_default_qconfig_mapping from torch.ao.quantization.fx.custom_config import ( PrepareCustomConfig, ConvertCustomConfig, ) class MyModel(torch.nn.Module): ... class UserLSTM(torch.ao.nn.quantizable.LSTM): @classmethod def from_float(cls, other): assert isinstance(other, cls._FLOAT_MODULE) linear_output_obs_ctr = FixedQParamsObserver.with_args( scale=2 -11, zero_point=2 15, dtype=torch.qint32) sigmoid_obs_ctr = FixedQParamsObserver.with_args( scale=2 -16, zero_point=0, dtype=torch.qint32) tanh_obs_ctr = FixedQParamsObserver.with_args( scale=2 -15, zero_point=2 15, dtype=torch.qint32) cell_state_obs_ctr = FixedQParamsObserver.with_args( scale=2 -11, zero_point=0, dtype=torch.qint32) hidden_state_obs_ctr = FixedQParamsObserver.with_args( scale=2 -7, zero_point=2 7, dtype=torch.quint8) return torch.ao.quantization.utils._get_lstm_with_individually_observed_parts( float_lstm=other, linear_output_obs_ctr=linear_output_obs_ctr, sigmoid_obs_ctr=sigmoid_obs_ctr, tanh_obs_ctr=tanh_obs_ctr, cell_state_obs_ctr=cell_state_obs_ctr, hidden_state_obs_ctr=hidden_state_obs_ctr, ) qconfig_mapping = get_default_qconfig_mapping() example_inputs = (torch.rand(5, 3, 50), torch.rand(1, 3, 50), torch.randn(1, 3, 50)) prepare_custom_config = PrepareCustomConfig() \ .set_float_to_observed_mapping(torch.nn.LSTM, UserLSTM) convert_custom_config = ConvertCustomConfig() \ .set_observed_to_quantized_mapping(UserLSTM, torch.ao.nn.quantized.LSTM) model = MyModel() model = prepare_fx(model, qconfig_mapping, example_inputs, prepare_custom_config=prepare_custom_config) model(example_inputs) # calibrate model = convert_fx(model, convert_custom_config=convert_custom_config) model(example_inputs) ``` Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/88456 Approved by: https://github.com/jerryzh168, https://github.com/vkuzo	2022-11-18 16:27:12 +00:00
Bin Bao	19fcb80551	[inductor] Skip DALLE2_pytorch in torchbench (#89288 ) Summary: DALLE2_pytorch fails in eager as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89288 Approved by: https://github.com/Krovatkin	2022-11-18 16:21:17 +00:00
Bin Bao	1f7c0ff6e7	[inductor] Temporarily disable functorch_dp_cifar10 test in TorchBench (#89281 ) Summary: The failure wasn't caught because of a land race. Skip the test for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89281 Approved by: https://github.com/Krovatkin	2022-11-18 16:07:44 +00:00
Howard Huang	55e55d95ea	Update torch.distributed.DistBackendError type (#89235 ) Summary: Update torch.distributed.DistBackendError type based on https://fb.workplace.com/groups/pyreqa/posts/5753993921357059 Test Plan: Pyre tests should pass? let sandcastle run Reviewed By: markkm Differential Revision: D41384130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89235 Approved by: https://github.com/awgu	2022-11-18 15:27:15 +00:00
lezcano	154e58c032	Add most in-place references/decompositions (#88117 ) We add most in-place references in a generic way. We also implement a wrapper to implement the annoying interface that `nn.functional` nonlinearities have. We fix along the way a couple decompositions for some non-linearities by extending the arguments that the references have. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88117 Approved by: https://github.com/mruberry	2022-11-18 14:59:46 +00:00
lezcano	6741443c7c	Simplify maybe_resize_out (#88116 ) The previous behaviour would call `resize_` on 0-sized elements even when their size was correct. This would make some test fail, as resize_ may be an in-place operation and it's not supported by some subsystems Pull Request resolved: https://github.com/pytorch/pytorch/pull/88116 Approved by: https://github.com/mruberry	2022-11-18 14:59:45 +00:00
lezcano	ce0e22a81a	Fix names of some reference functions (#88115 ) The `__name__` field of some binary reference functions was wrong. We fix this to be consistent with unary reference functions. In the future, we should probably make the binary reference wrapper return a wrapper itself to avoid all those calls to `partial`. This change helps performing some homogeneous treatment of functions by their name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88115 Approved by: https://github.com/mruberry	2022-11-18 14:59:43 +00:00
Jacob Hayes	2e358cc98f	Add platform markers for linux only extra_install_requires (#88826 ) Fixes #88049 https://github.com/pytorch/pytorch/pull/85097 added new extra dependencies on `nvidia-*`. They are linux (GPU) only packages, but were not marked as such, causing issues installing pytorch 1.13 via Poetry (and possibly other tools that follow PyPI's metadata API) on non-Linux systems. This "fixes" the issue by adding the `; platform_system = 'Linux'` marker on these dependencies, but the main problem of different metadata for different wheels is a [somewhat larger issue](https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269). https://github.com/pytorch/pytorch/pull/85097 used `;` as a delimiter for splitting the different deps, but that is the delimiter used in markers, so I changed to split on `\|`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88826 Approved by: https://github.com/neersighted, https://github.com/lalmei, https://github.com/malfet	2022-11-18 14:09:21 +00:00
Nikita Shulga	5654fed23e	Export c10/[macros\|util] headers to be used by internal inductor builds (#89249 ) Summary: Fixes package boundary violation that existed in previous implementation Test Plan: CI Differential Revision: D41391862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89249 Approved by: https://github.com/izaitsevfb	2022-11-18 10:51:07 +00:00
Iris	4c6724985d	[PT-D][Checkpoint] Update import and update docstring for distributed checkpoint (#89256 ) Update test import and docstring as we have moved distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (https://github.com/pytorch/pytorch/pull/88698). Test: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/89256 Approved by: https://github.com/fduwjj	2022-11-18 09:49:39 +00:00
Jiewen Tan	2dcacc6b99	[LTC] Upstream short_metrics (#89186 ) Summary: This pull request upstreams pytorch/xla#4148. Test Plan: xla CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89186 Approved by: https://github.com/JackCaoG	2022-11-18 09:28:48 +00:00
HDCharles	c5fafb4e16	[ao] maintain BC for is_activation_post_process (#89260 ) Summary: tests are failing due to code packaged with trained models calling now defunct function names (is_activation_post_process). this diff maintains BC temporarily until the cached code can be refreshed Test Plan: no functional change Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89260 Approved by: https://github.com/jerryzh168	2022-11-18 07:58:51 +00:00
Michael Lazos	30c3e5afb0	Disable tracing `zero_grad()` (#88731 ) Tracing through zero grad is slow, and doesn't provide any benefits. Helps https://github.com/pytorch/torchdynamo/issues/1803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88731 Approved by: https://github.com/anijain2305	2022-11-18 07:46:38 +00:00
Huy Do	afdc48f843	Gate CUDA-only inductor tests by HAS_CUDA (#89251 ) This is to prevent these tests from running on platform where CUDA doesn't exist such as macos. And they are quite flaky https://hud.pytorch.org/failure/test_linear_permute_fusion_cpu there failing the CI from time to time Pull Request resolved: https://github.com/pytorch/pytorch/pull/89251 Approved by: https://github.com/soumith, https://github.com/desertfire	2022-11-18 07:39:18 +00:00
kshitij12345	6a964c16e5	[flaky] relax tolerance conv1d_vs_scipy (#89193 ) Fixes https://github.com/pytorch/pytorch/issues/89087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89193 Approved by: https://github.com/kit1980	2022-11-18 07:31:10 +00:00
PumeTu	fc1c0cd3ef	Add support trace on MPS backend (#87910 ) Fixes [#87221](https://github.com/pytorch/pytorch/issues/87221) `trace` now supported on MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/87910 Approved by: https://github.com/kulinseth, https://github.com/malfet	2022-11-18 07:24:33 +00:00
maxren	7beb151889	[xnnpack][executorch] remove unordered_set from xnn_compiler (#89231 ) Removing unrodered_set from xnncompiler for executorch. While some STL libraries are unavoidable, and I think it should be ok for delegate to pull these libraries, unordered_set wasn't really needed, and we should be serializing the number of external ids anyways After this, the backend classes should be good to hg copy into executorch Differential Revision: [D41227391](https://our.internmc.facebook.com/intern/diff/D41227391/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89231 Approved by: https://github.com/salilsdesai, https://github.com/cccclai	2022-11-18 07:07:19 +00:00
Zain Rizvi	ab75982d3a	Always retry curl downloads (#89157 ) Modify our curl commands so that they always retry downloads. By default, curl only retries what it considers to be "transient" errors, based on the server's response. However, curl's estimate of what's transient is very conservative. By adding the --retry-all-errors parameter we'll always retry curl commands. In particular, I'm hoping this mitigates errors where curl fails with the below error ([logs](https://github.com/pytorch/pytorch/actions/runs/3468758110/jobs/5794939941)) `curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to ossci-linux.s3.amazonaws.com:443` Some of the modified downloads didn't even have retries, so I added them in More details: https://everything.curl.dev/usingcurl/downloads/retry Pull Request resolved: https://github.com/pytorch/pytorch/pull/89157 Approved by: https://github.com/kit1980, https://github.com/malfet	2022-11-18 07:03:24 +00:00
Aidyn-A	3bc78295c2	Fix consistentcy of histc on CPU and CUDA (#87832 ) Fixes #87657 The main reason why `histc` returns slightly different outputs is the difference on how bin position is calculate. The CPU calculates it as: `449778a939/aten/src/ATen/native/cpu/HistogramKernel.cpp (L168-L170)` which is basically `(i - a) / (b - a) * N`, while cuda code `449778a939/aten/src/ATen/native/cuda/SummaryOps.cu (L41)` which is `(i - a) * N / (b - a)`. For some cases like in #87657 the order of arithmetic operations matters due to the floating point round-off. ________________ Not sure where would be the most appropriate place to put the unit test. Hope `test_reductions::test_histc` will do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87832 Approved by: https://github.com/soumith	2022-11-18 05:08:47 +00:00
Sherlock Huang	f1fb586bc6	Symintify repeat_interleave.self_int (#89111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89111 Approved by: https://github.com/ezyang	2022-11-18 05:04:02 +00:00
Sherlock Huang	ba5e39e106	Fix tol for test_nvfuser_correctness__softmax_backward_data_cuda (#89178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89178 Approved by: https://github.com/kit1980	2022-11-18 05:03:51 +00:00
Yoni Chechik	6f609dd0e0	docs: conv2d `padding` attribute- add `int` option (#85004 ) `padding: int` already exists but isn't mentioned in the genereted docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/85004 Approved by: https://github.com/albanD, https://github.com/kit1980	2022-11-18 04:29:02 +00:00
Jacob Szwejbka	6f4f69f54d	[Executorch] [Quantization] New pattern for dynamic dequant (#89236 ) Summary: The op exposed should be qparams, and then we have concerns about prims not being supported so make q and dq ops that take in tensors Test Plan: unit test Differential Revision: D41382580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89236 Approved by: https://github.com/jerryzh168	2022-11-18 04:13:05 +00:00
Jerry Zhang	f4efc5e821	[quant][be] Move some helper functions to the top level to reduce function length (#89246 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/89246 Approved by: https://github.com/vkuzo	2022-11-18 04:05:27 +00:00
PyTorch MergeBot	6ed14c7dcf	[vision hash update] update the pinned vision hash (#89102 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89102 Approved by: https://github.com/pytorchbot	2022-11-18 03:45:56 +00:00
Jiewen Tan	3c2676de3d	[LTC] Restore GetPythonFrames (#89122 ) Summary: pytorch/pytorch@936e930 delete the registration of GetPythonFramesFunction. Restore that and add a test case to prevent regression. Test Plan: python test/lazy/test_debug_util.py Fixes pytorch/xla#4206. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89122 Approved by: https://github.com/JackCaoG	2022-11-18 03:37:14 +00:00
John Detloff	65bcd1f880	Add previously deleted circleci readme back to repo (#85598 ) This readme was deleted here: https://github.com/pytorch/pytorch/pull/73224 I chatted with the author, who doesn't remember exactly why it was deleted but suspects it was due either to out of date contents or because of the upcoming migration to github actions. With that said, we have references to this readme through our circleci directory, and since we do still have a lot of circleci workflows I feel this readme still adds a lot of value. (I recently did some CI tasks that required me to dig this readme up in order to solve a problem). I recommend we restore this file with a warning that its contents may be out of date, until our CircleCI workflows are entirely migrated to Github Actions Pull Request resolved: https://github.com/pytorch/pytorch/pull/85598 Approved by: https://github.com/clee2000, https://github.com/malfet	2022-11-18 03:17:37 +00:00
mikey dagitses	92f9214a31	add -Wnarrowing as error to cmake builds (#89207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89207 Approved by: https://github.com/wconstab, https://github.com/malfet	2022-11-18 03:16:18 +00:00
Raman kumar	fd0efb01a7	[MPS] Support for median with dim (#88807 ) ## Summary ⚡ Aim: Add support for aten::median for MPS backend (Fixes #87220) This is fresh clean PR from the previous [PR](https://github.com/pytorch/pytorch/pull/88554) - Implementing the new median function in aten/src/ATen/native/mps/operations/ReduceOps.mm - Adding it to aten/src/ATen/native/native_functions.yaml - Adding it to existing test_median ### this will works like this 🪶 median of entire input tensor on MPS `torch.median(mps_inputTensor)` median of along a dim `torch.median(mps_inputTensor, dim=[int], keepdim=[Bool])` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88807 Approved by: https://github.com/kulinseth	2022-11-18 02:53:42 +00:00
Dmytro Dzhulgakov	9fd00f194a	Fix the kineto daemon build condition (#89174 ) If we're not building the lite interpreter we shouldn't be disabling Kineto. This eliminates a step from https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/89174 Approved by: https://github.com/kimishpatel, https://github.com/malfet	2022-11-18 02:42:45 +00:00
David Boetius	b652fbc57a	Fix torch.nn.functional.gelu docstring formatting (#89061 ) The docstring of `torch.nn.functional.gelu` is formatted incorrectly, so that part of the math isn't rendered and there are extra blocks when there shouldn't: https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html I didn't build the docs, so I am not 100% sure that I got the formatting right, but I am confident. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89061 Approved by: https://github.com/bdhirsh, https://github.com/kit1980	2022-11-18 01:57:41 +00:00
Huy Do	177621a0b2	Use pytest-flakefinder to rerun tests multiple times (#89106 ) Per title. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest. ### Testing * https://github.com/pytorch/pytorch/actions/runs/3484930558 * https://github.com/pytorch/pytorch/actions/runs/3484930319 Manually download the test report artifacts and verify that that pytest test_ops is called multiple times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89106 Approved by: https://github.com/clee2000	2022-11-18 00:11:44 +00:00
Dmitry Tomshin	57e05e822d	Issue 68576 prefetch factor (#88972 ) Fixes #68576 This PR allows set the `prefetch_factor=None` making it really optional according to the documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/88972 Approved by: https://github.com/kit1980	2022-11-18 00:10:50 +00:00
Sean Ross-Ross	2b3ac879a7	feat: adding view_copy_batch_rule and opinfo for view_copy (#88150 ) to add view_copy to vmap dispatch and adding opinfo part of https://github.com/pytorch/functorch/issues/825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88150 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2022-11-17 23:36:18 +00:00
Bin Bao	31b10e7d40	Enable inductor CI for TorchBench (#87465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87465 Approved by: https://github.com/malfet	2022-11-17 23:16:21 +00:00
erjia	3d8a853a87	[DataPipe] Add container template for _Fork and _Demux (#89216 ) - This would remove the hard-coded check within `_ChildDataPipe`. - Add `get_length_by_instance` to parent class to make sure there is a chance that child DataPipe can have different lengths - Prevent Error when `__del__` executed when the object has already been removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/89216 Approved by: https://github.com/NivekT	2022-11-17 23:06:41 +00:00
keineahnung2345	e2229a89b0	Fix typo in aten/src/README.md (#89175 ) remove redundant "have to" Pull Request resolved: https://github.com/pytorch/pytorch/pull/89175 Approved by: https://github.com/kit1980	2022-11-17 22:28:23 +00:00
Charlie Yan	a695fcf201	Add tests for replicate multiple modules (#89099 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89099 Approved by: https://github.com/zhaojuanmao	2022-11-17 22:27:15 +00:00
Nikita Shulga	767f6aa49f	[JIT][Security] Do not blindly eval input string (#89189 ) Introduce `_eval_no_call` method, that evaluates statement only if it does not contain any calls(done by examining the bytecode), thus preventing command injection exploit Added simple unit test to check for that `torch.jit.annotations.get_signature` would not result in calling random code. Although, this code path exists for Python-2 compatibility, and perhaps should be simply removed. Fixes https://github.com/pytorch/pytorch/issues/88868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89189 Approved by: https://github.com/suo	2022-11-17 22:05:30 +00:00
Huy Do	fbbf368745	Fix distributed test paths when running periodic multigpu job (#89225 ) Some distributed tests are moved to a new location after https://github.com/pytorch/pytorch/pull/88698. This is currently failing periodic multigpu job: * https://github.com/pytorch/pytorch/actions/runs/3484486207/jobs/5829301159 * https://github.com/pytorch/pytorch/actions/runs/3484486207/jobs/5829301093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89225 Approved by: https://github.com/clee2000	2022-11-17 21:33:59 +00:00
mikey dagitses	f057a45faf	reland "support running test_mobile_profiler with buck1/buck2 and OSS (#89001 )" (#89091 ) We modify this to no longer use std::experimental::filesystem::path and use our own custom type instead. This reverts commit c53a5ac6cca7e2e7d7c47b1a816c7eaa2e7a7704. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89091 Approved by: https://github.com/r-barnes, https://github.com/malfet	2022-11-17 21:04:23 +00:00
Xiao Wang	e856a4d66b	Add an env var to skip cudnn version compatibility check (#89184 ) skip the check by setting `PYTORCH_SKIP_CUDNN_COMPATIBILITY_CHECK=1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89184 Approved by: https://github.com/ngimel	2022-11-17 20:10:52 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	04169c5b6e	Rewrite assert statement with torch._assert under config (#88246 ) This diff rewrites assert statement in python with torch._assert under config. The resulting graph looks something like: ``` SOURCE CODE: def f(x): assert x[0] == 3 return x.cos() CAPTURED GRAPH: graph(): %arg0 : [#users=2] = placeholder[target=arg0] %getitem : [#users=1] = call_function[target=operator.getitem](args = (%arg0, 0), kwargs = {}) %eq : [#users=1] = call_function[target=operator.eq](args = (%getitem, 3), kwargs = {}) %_assert : [#users=0] = call_function[target=torch._assert](args = (%eq, "assertion_error"), kwargs = {}) %cos : [#users=1] = call_method[target=cos](args = (%arg0,), kwargs = {}) return cos ``` Note that this introduces side-effect as it could error out while executing graph, but the assertion can eliminated via DCE if we choose to ignore it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88246 Approved by: https://github.com/jansel	2022-11-17 19:49:31 +00:00
William Wen	af448e84eb	Fix bug in dynamo dashboard summary stats diff (#89226 ) Fixes issue where a suite may not be present in one of the logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89226 Approved by: https://github.com/anijain2305	2022-11-17 19:20:49 +00:00
PyTorch MergeBot	706f791a19	Revert "Support masked_fill (#88736 )" This reverts commit 2b131b1d43b10a2a005f3f042f920a62501e4e2d. Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'	2022-11-17 18:27:08 +00:00
PyTorch MergeBot	8e4c9828f4	Revert "Reland "Towards unifying symbolic and non symbolic fake tensor (#89038 )" (#89143 )" This reverts commit e686b8c3ba93cb7caa314c78bf84dbd2d7df9683. Reverted https://github.com/pytorch/pytorch/pull/89143 on behalf of https://github.com/ZainRizvi due to This seems to be causing the test_make_fx_symbolic_exhaustive_rad2deg_cpu_float32 and test_make_fx_symbolic_exhaustive_inplace_rad2deg_cpu_float32 test to fail across multiple jobs	2022-11-17 17:02:36 +00:00
Jiong Gong	cd81a700ec	Fix buffer overflow from AddressSanitizer checks due to inaccurate bfloat16 representation of large integer (#89210 ) Fixes #88939 The root cause of the issue is that BF16 cannot accurately represent big integer values. In the test case below, `539` as one of the corner pixel index is wrongly represented as `540` (from `fc60a1865e/aten/src/ATen/native/UpSample.h (L271)`) and then the access out of the range with this index. Thanks to @malfet for the investigation and initial fix. I also reported an issue https://github.com/pytorch/pytorch/issues/89212 to track the issue of inaccurate integer representation of bf16 that need to be addressed in other places of PyTorch. ```python import torch def test(): arg_1 = torch.rand([1, 10, 540, 540], dtype=torch.bfloat16).clone() res = torch.nn.functional.interpolate(arg_1,2,mode='bilinear',align_corners=True) test() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89210 Approved by: https://github.com/malfet	2022-11-17 16:43:16 +00:00
Wang, Eikan	2b131b1d43	Support masked_fill (#88736 ) Support `masked_fill` to address the GPT2 performance issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736 Approved by: https://github.com/jansel, https://github.com/jgong5	2022-11-17 15:18:29 +00:00
Edward Z. Yang	e686b8c3ba	Reland "Towards unifying symbolic and non symbolic fake tensor (#89038 )" (#89143 ) This reverts commit cf6003f0469ae1440d4a8585860c2c5f4c738707. Differential Revision: [D41363992](https://our.internmc.facebook.com/intern/diff/D41363992) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89143 Approved by: https://github.com/albanD	2022-11-17 13:55:06 +00:00
Will Constable	bdc9911575	Fix typo in dist_util.py (#89167 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89167 Approved by: https://github.com/davidberard98	2022-11-17 08:45:27 +00:00
ecao	3beccbc299	Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460 ) ### Description * add BFloat16 support for mish and hardtanh backward on CPU. * optimize the performance for silu ### Testing - optimize the performance for silu: bfloat16 single socket (28 cores): ``` before: 1x128x1024 forward 0.090 s backward 0.218 s 10x128x1024 forward 0.146 s backward 0.314 s after: 1x128x1024 forward 0.064 s backward 0.100 s 10x128x1024 forward 0.085 s backward 0.133 s ``` single core: ``` before: 1x128x1024 forward 0.300 s backward 0.606 s 10x128x1024 forward 2.825 s backward 5.834 s after: 1x128x1024 forward 0.156 s backward 0.239 s 10x128x1024 forward 1.447 s backward 2.165 s ``` - Add BFloat16 support for mish and backward of hardtanh on CPU. single socket (20 cores): op \| shape \| fp32 / s \| fp32 / s \| bf16 / s \| bf16 / s -- \| -- \| -- \| -- \| -- \| -- \| \| forward \| backward \| forward \| backward silu \| [10, 128, 10, 10] \| 4.41E-05 \| 7.67E-05 \| 5.32E-05 \| 9.38E-05 \| [10, 128, 80, 80] \| 0.0008 \| 0.001788 \| 0.00067 \| 0.001031 mish \| [10, 128, 10, 10] \| 0.000356 \| 0.000427 \| 0.000367 \| 0.000436 \| [10, 128, 80, 80] \| 0.004527 \| 0.005807 \| 0.004757 \| 0.005393 hardtanh \| [10, 128, 10, 10] \| / \| 3.97E-05 \| / \| 4.45E-05 \| [10, 128, 80, 80] \| / \| 0.001748 \| / \| 0.000645 single core: op \| shape \| fp32 / s \| fp32 / s \| bf16 / s \| bf16 / s -- \| -- \| -- \| -- \| -- \| -- \| \| forward \| backward \| forward \| backward silu \| [10, 128, 10, 10] \| 1.17E-04 \| 1.91E-04 \| 1.35E-04 \| 2.23E-04 \| [10, 128, 80, 80] \| 0.007434 \| 0.013141 \| 0.008464 \| 0.013044 mish \| [10, 128, 10, 10] \| 0.00103 \| 0.00122 \| 0.00106 \| 0.001227 \| [10, 128, 80, 80] \| 0.065629 \| 0.078418 \| 0.067779 \| 0.077214 hardtanh \| [10, 128, 10, 10] \| / \| 1.18E-04 \| / \| 9.30E-05 \| [10, 128, 80, 80] \| / \| 0.010773 \| / \| 0.005834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460 Approved by: https://github.com/mingfeima, https://github.com/malfet	2022-11-17 08:15:52 +00:00
Mark Saroufim	37c85cf5f2	Add warning if tensor cores are not used (#88844 ) Fixes https://github.com/pytorch/torchdynamo/issues/1839 Should I do this for all backends or just inductor? ## Test On a V100 I got from AWS ```python from torch._dynamo import optimize import torch def fn(x, y): a = torch.cos(x) b = torch.sin(y) return a + b new_fn = optimize("inductor")(fn) a = new_fn(torch.Tensor(1),torch.Tensor(1)) print(a) ``` ## New logs ``` (sourcetorch) ubuntu@ip-172-31-31-152:~/test$ python test.py /home/ubuntu/pytorch/torch/_dynamo/eval_frame.py:318: UserWarning: Tensor cores are available but not enabled. Consider setting torch.backends.cuda.matmul.allow_tf32 == True in your python script for speedups warnings.warn( tensor([1.3717]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88844 Approved by: https://github.com/ngimel, https://github.com/mlazos, https://github.com/anijain2305	2022-11-17 07:24:58 +00:00
Yanbo Liang	b72f5b9ae3	[Dynamo] Support typing.Mapping & Support function as argument (#88963 ) These missing features come from https://github.com/pytorch/benchmark/pull/1302, where we'd like to enable E2E hf_bert dynamo train/eval. The dependent [HuggingFace accelerate library](https://huggingface.co/docs/accelerate/index) requires these improvements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88963 Approved by: https://github.com/jansel	2022-11-17 06:57:42 +00:00
AllenTiTaiWang	126e44173d	[ONNX] Add onnx-script into ONNX docs (#89078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89078 Approved by: https://github.com/BowenBao	2022-11-17 06:27:17 +00:00
Animesh Jain	74610a1ced	[dynamo][benchmarks] HF - Fix seq len and batch sizes (#89165 ) Fixes many models in https://github.com/pytorch/torchdynamo/issues/1842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89165 Approved by: https://github.com/ngimel	2022-11-17 06:14:24 +00:00
Andrew M. James	a41f70603a	Round out rad2deg sparse support (#88442 ) - Add sparse coo dispatch - Modify backward to work with sparse compressed layouts - Enable sparse_compressed autograd testing - Correct layout support attributes on OpInfo Pull Request resolved: https://github.com/pytorch/pytorch/pull/88442 Approved by: https://github.com/cpuhrsch	2022-11-17 06:00:23 +00:00
Rachel030219	70fb673e51	Use software approach to catch overflow ( `c10/utils/safe_numerics.h` ) on ARM devices (#89042 ) Fixes #89040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89042 Approved by: https://github.com/malfet	2022-11-17 05:55:28 +00:00
Aaron Gokaslan	54fca6a9da	Fix: prefer .is_none() over .is(py::none()) for pybind11 in caffe2 (#88199 ) Follow up to #88051 . I noticed that I missed a few spots in the caffe2 folder. Prefer `.is_none()` over `.is(py::none())` as `.is_none()` is more efficient since it avoid reference counting increments and decrements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88199 Approved by: https://github.com/albanD, https://github.com/kit1980	2022-11-17 05:01:11 +00:00
PyTorch MergeBot	4e1d19c5a5	Revert "Redefine the simdlen semantic: (#88482 )" This reverts commit fce6d6b3dcc879720bc45143426b86232106818a. Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554	2022-11-17 04:58:53 +00:00
Lukas Hoenig	81a8fdc40d	[MPS] Add binary operations dtype precedence test case (#87545 ) See https://github.com/pytorch/pytorch/pull/84742 and https://github.com/pytorch/pytorch/pull/78319. The test case tests that - for the binary operations (add, sub, mul, div), - for all data types (dtypes), - for a range of representative values and their combinations, - for various shapes and ways of creating the test tensors, the contents and dtype of the result tensor is identical for the MPS and CPU backends. It adds about 15-18s runtime to `test_mps.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87545 Approved by: https://github.com/kit1980	2022-11-17 04:54:27 +00:00
ecao	44c9185f91	Fix empty input issue of convolution for channels last memory format (#86521 ) Fixes empty input convolution issue : when input is empty e.g. shape of (0, 3, 3, 4) and weight is channels last format, at::_unsafe_view will raise "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead." Pull Request resolved: https://github.com/pytorch/pytorch/pull/86521 Approved by: https://github.com/jgong5, https://github.com/malfet	2022-11-17 04:47:45 +00:00
maxren	637e764ec5	[xnnpack][executorch] Pass xnnexecutor pointer to compileModel() (#89090 ) Here we pass XNNExecutor* to compile model so that XNNExecutor can be allocated by runtime. This signature change is for executorch: ``` XNNExecutor compileModel(void* buffer) --> void compileModel(void* buffer, XNNExecutor* executor) ``` The intended usecase for allocating Executor and Compiling the serialized flatbuffer: ``` XNNExecutor* executor = runtime_allocator->allocateList<jit::xnnpack::delegate::XNNExecutor>(1); XNNCompiler::compileModel(processed.buffer, executor); ``` Differential Revision: [D41208387](https://our.internmc.facebook.com/intern/diff/D41208387/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89090 Approved by: https://github.com/digantdesai	2022-11-17 04:29:25 +00:00
Colin Taylor	24b9890f03	[torchrec] [composable] update ShardedEmbeddingBagCollection to be use registered EBCs with shardedTensors as registered modules (#758 ) (#88026 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/758 This PR fixes a bug in FSDP/DDP, where ShardedTensors are not supported even if passed in as params to ignore. this is important for composability because TorchRec named_parameters() will return FQN of shardedTensors (as defined in goals) It defines device of ShardedTensor to be None when local_tensor() does not exist on rank update ShardedEmbeddingBagCollection to be composable according to https://docs.google.com/document/d/1TBJSd5zgEg6cRcXv3Okuj7bBkqQwGS2IPh4TLWNNzFI/edit Differential Revision: D40458625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88026 Approved by: https://github.com/wanchaol, https://github.com/rohan-varma	2022-11-17 04:26:13 +00:00
Kazuaki Ishizaki	1cd6ebe095	Fix typos in messages under torch (#89049 ) This PR fixes typos of messages in `.py` files under torch directory. Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049 Approved by: https://github.com/lezcano	2022-11-17 04:18:14 +00:00
maxren	d1f48f05ce	[xnnpack][Bug Fix] Pass serialized model by reference (#89089 ) Two changes - Remove XNNCompiler Dependence on std::string by passing void* - Grab ser_model by reference: This bug was causing data pointers given to xnn_runtime to be freed because ser_model was on the stack. Differential Revision: [D41208380](https://our.internmc.facebook.com/intern/diff/D41208380/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89089 Approved by: https://github.com/digantdesai	2022-11-17 04:17:23 +00:00
maxren	366f1b2c2f	[xnnpack][lite-int] Freeze/Inline module to remove reference to self (#88863 ) We need to inline graph before converting from torchscript to xnnpack flatubuffer. Remove graph dependence on self. This will later help us work with constant data. Differential Revision: [D41049858](https://our.internmc.facebook.com/intern/diff/D41049858/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88863 Approved by: https://github.com/digantdesai	2022-11-17 04:14:57 +00:00
Jerry Zhang	1adb7b9b84	[nn][utils] Preserve requires_grad from original weight and bias in fuse conv/linear bn weights (#89100 ) Summary: att, previously we just call nn.Parameter which will have requires_grad=True by default, after this PR we will preserve the requires_grad Test Plan: python test/test_nn.py TestFusionUtils Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41343694](https://our.internmc.facebook.com/intern/diff/D41343694) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89100 Approved by: https://github.com/ngimel	2022-11-17 03:58:16 +00:00
Kazuaki Ishizaki	a5f04e9a91	Fix typos in .md and .rst files (#88962 ) This PR fixes typos `Github` in `.md` and `.rst` files. `Github` -> `GitHub` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88962 Approved by: https://github.com/kit1980	2022-11-17 03:37:02 +00:00
Huy Do	573eaf1225	Analyze and upload disabled tests rerun to S3 (#89083 ) Analyze and upload disabled tests rerun to S3. Note that this only picks up `test-reports` from `rerun_disable_tests` workflows. ### Testing Running the script manually `python -m tools.stats.check_disabled_tests --workflow-run-id 3473068035 --workflow-run-attempt 1 --repo pytorch/pytorch` and see the files successfully uploaded to s3://ossci-raw-job-status/rerun_disabled_tests/3473068035/1 Rockset collection created https://console.rockset.com/collections/details/commons.rerun_disabled_tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/89083 Approved by: https://github.com/clee2000	2022-11-17 03:36:58 +00:00
Wang, Eikan	fce6d6b3dc	Redefine the simdlen semantic: (#88482 ) This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`. Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows. - _simdlen = None_: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2. - _simdlen <=1_: Explicitly disable SIMD - _simdlen > 1_: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-17 03:27:54 +00:00
AllenTiTaiWang	c3acb9c885	[ONNX] Add Internal Utils: onnx_proto_utils.py for onnx/onnx-script/onnx_proto (#88376 ) Added `onnx_proto_utils.py` for onnx/onnx-script related process. The idea is like jit_utils.py, and to simplify what we have in `torch/onnx/utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88376 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-17 03:08:09 +00:00
Charlie Yan	f3af5ba48e	[WIP] Composable API: `replicate` and `DistributedState` (#87649 ) This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is: - create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API. - create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object - install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward Pull Request resolved: https://github.com/pytorch/pytorch/pull/87649 Approved by: https://github.com/zhaojuanmao	2022-11-17 03:06:31 +00:00
Riley Dulin	f73d9a79fe	[torch][fx] Fix PassManager to not use a class variable mutable list (#89108 ) Summary: I found a confusing bug in the PassManager that only happens when you instantiate one multiple times: it will use old passes and constraints! This occurs because the class-level declarations initialize it to an empty list, but the problem is that class initializers only run once, and are creating class variables. This means the same empty list was being reused every time, except after the first time it isn't empty. The empty list has to be created in `__init__` newly each time or else it'll be shared. Note that this is the same type of bug as using an empty list as a default parameter, where it'll reuse the same list pointer and not make it empty each time. The better way to do this is with either: * An immutable default parameter like an empty tuple, that you create a new list from: `self.passes = list(passes)` * Use None and then create the empty list inside `__init__` I chose the latter as it's less likely to cause a behavior change due to the changed default. Note that for immutable values like `False` and `1` this doesn't apply as you can't mutate that value for everyone. Test Plan: Added a test to ensure that the pass state is not saved. Without my change, this test would fail as it would run all of the `2 * x` passes first, then all of the `3 * x` passes. Differential Revision: D41327056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89108 Approved by: https://github.com/angelayi	2022-11-17 02:43:33 +00:00
Wanchao Liang	ac0a6f381d	[dtensor] disable op db tests for now (#89162 ) context: https://github.com/pytorch/pytorch/issues/89160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89162 Approved by: https://github.com/fduwjj	2022-11-17 02:31:23 +00:00
Animesh Jain	30d9fb9157	[dynamo][reland] API Support for nn.Module (#89113 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/89113 Approved by: https://github.com/ezyang	2022-11-17 02:03:48 +00:00
William Wen	f5e2cb5249	Add comprehensive minifier tests (#88022 ) Adds tests for https://github.com/pytorch/torchdynamo/issues/1241. To run: `pytest test/dynamo/test_minifier.py`. Actually runs minifier launcher script and repro scripts, rather than just checking for existence of the minifier launcher script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88022 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2022-11-17 02:02:29 +00:00
Kazuaki Ishizaki	088f2fa567	Fix typos in messages under test (#89121 ) This PR fixes typos of messages in `.cpp` and `.py` files under test directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89121 Approved by: https://github.com/mruberry, https://github.com/kit1980	2022-11-17 01:55:03 +00:00
Horace He	716f70f19a	Added conv constraint that infers layouts (#89031 ) The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts. So, for example, we'll often have something like ``` a: channels-last b = foo(a) c = convolution(a) ``` In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel. However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done). This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors. The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes within `aten.convolution`, which are particularly bad (since Inductor can't fuse them). Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031 Approved by: https://github.com/ngimel, https://github.com/jansel	2022-11-17 01:52:35 +00:00
Huy Do	251fdda77b	Add pytest-flakefinder as a test dependency (#89103 ) This is used to re-run tests multiple times to determine their flakiness status. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest Per their documentation, `pytest-repeat` doesn't work with `unittest.Testcase` it seems, so trying https://github.com/dropbox/pytest-flakefinder instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/89103 Approved by: https://github.com/clee2000	2022-11-17 01:45:50 +00:00
keineahnung2345	0d87a4fec8	Fix typo in Dispatcher.h (#89045 ) Fix typo in Dispatcher.h: hamespace -> namespace Pull Request resolved: https://github.com/pytorch/pytorch/pull/89045 Approved by: https://github.com/bdhirsh, https://github.com/kit1980	2022-11-17 01:09:55 +00:00
John Detloff	80b6761863	Update README.md (#85534 ) Our jenkins builds are gone, so this badge is broken and should be removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/85534 Approved by: https://github.com/ngimel, https://github.com/kit1980	2022-11-17 01:06:15 +00:00
R Max Espinoza	3af5cf4de1	doc(typo): memroy -> memory (#89126 ) Minor typo in comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89126 Approved by: https://github.com/kit1980	2022-11-17 01:03:34 +00:00
Charlie West-Taylor	cfd552547f	Use the Python frame safely in _pythonCallstack (#88993 ) Currently, the result of `PyEval_GetFrame()` is piped straight to `Py_INCREF`. However, `PyEval_GetFrame` [may return null](https://docs.python.org/3/c-api/reflection.html#c.PyEval_GetFrame), which seems to be the case sometimes, when calling `_pythonCallstack` from another thread. This is handled in the subsequent `while (nullptr != frame)` block, but `Py_INCREF`, called before it, [doesn't handle this case](https://docs.python.org/3/c-api/refcounting.html#c.Py_INCREF), so the program segfaults. The safe form of `Py_INCREF` is `Py_XINCREF`, so use that instead ([docs](https://docs.python.org/3/c-api/refcounting.html#c.Py_XINCREF)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88993 Approved by: https://github.com/albanD	2022-11-17 00:59:15 +00:00
Nikolay Korovaiko	8506b305df	handle scatter(Scalar) overload in inductor (#88894 ) Relanding https://github.com/pytorch/pytorch/pull/88210 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88894 Approved by: https://github.com/desertfire	2022-11-17 00:38:47 +00:00
atalman	0c835e25bb	Fix nightly build binary errors (#89153 ) This is pretty much self explanatory issues Two typo's in generate generate binary script caused workflows to be generated with invalid parameters: 1 .generated-linux-binary-libtorch-pre-cxx11-master.yml 2 .generated-macos-arm64-binary-wheel-nightly.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/89153 Approved by: https://github.com/malfet	2022-11-17 00:30:12 +00:00
AllenTiTaiWang	98379a3949	[ONNX] Add onnx-script test cases (#86907 ) The test cases for #86906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86907 Approved by: https://github.com/BowenBao	2022-11-16 23:57:25 +00:00
Will Constable	f920bfaf2a	Use torchrun for dynamo/distributed.py (#89149 ) Mainly wanted to confirm torchrun works fine with dynamo/ddp, but it is also a better system than manually launching processes. Partially addresses issue #1779 New run commands ------------ single process: python benchmarks/dynamo/distributed.py [args] multi-gpu (e.g. 2 gpu on one host): torchrun --nproc_per_node 2 benchmarks/dynamo/distributed.py [args] Pull Request resolved: https://github.com/pytorch/pytorch/pull/89149 Approved by: https://github.com/aazzolini	2022-11-16 23:05:34 +00:00
Fuzzkatt	8ba62bdff5	add test_c10d_spawn_ucc.py (#86508 ) Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508 Approved by: https://github.com/kwen2501	2022-11-16 22:50:11 +00:00
Mikayla Gawarecki	ec61951f07	Fix inaccuracy in nt constructor documentation + broken rendering (#89152 ) Rendering was broken and docstring seemed to be inaccurate ![Screen Shot 2022-11-16 at 2 16 28 PM](https://user-images.githubusercontent.com/35276741/202273588-a2da5b7b-1a6d-46bb-a74e-c0de9a0fd064.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89152 Approved by: https://github.com/cpuhrsch	2022-11-16 22:32:46 +00:00
Mikayla Gawarecki	5848704ef8	Removed unecessary check in `select_nested` (#89150 ) Implementation in #88585 should work for all dimensions. Removed unnecessary check that constrained select to dims 0 and 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89150 Approved by: https://github.com/cpuhrsch	2022-11-16 22:11:37 +00:00
Andrew Gu	ee1d375bf9	[FSDP] Add fast path for `NO_SHARD` `clip_grad_norm_()` (#89137 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89137 Approved by: https://github.com/rohan-varma	2022-11-16 22:08:50 +00:00
Yanbo Liang	e70f446a16	[Dynamo] Fix bug in NamedTupleVariable (#89110 ) Fixes https://github.com/pytorch/torchdynamo/issues/1866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89110 Approved by: https://github.com/jansel	2022-11-16 21:59:31 +00:00
William Wen	640af8d70a	More dynamo dashboard improvements (#89155 ) A number of dashboard improvements: - Add accuracy failures to warnings section - Add regression detection to all metrics (speedup, compile time, peak memory), not just accuracy - Add testing flag to update-dashboard to prevent image/comment uploads - Add section for comparing summary statistics (passrate, speedup) between 2 most recent reports - Show names of reports for summary stats diff and regression detection sections - Remove metric graphs from the comment (they can still be found in the generated text file) Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1317565972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89155 Approved by: https://github.com/anijain2305	2022-11-16 21:54:27 +00:00
Nikolay Korovaiko	305b9b1f0e	Fix XLASymNode.str() no str() attribute error (#89093 ) This fixes https://github.com/pytorch/xla/issues/4199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89093 Approved by: https://github.com/ezyang	2022-11-16 21:54:20 +00:00
Edward Z. Yang	4908a12542	Reland "SymIntify convolution backend calculation (#89069 )"" (#89142 ) This reverts commit 90db86be108184a6c86c73e1b01012352c72e66b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89142 Approved by: https://github.com/albanD, https://github.com/malfet	2022-11-16 21:41:47 +00:00
HDCharles	45c62a3377	[ao] making _is_activation_post_process private (#87520 ) Summary: same function in observer and quantize, consolidated to a single function. Note the definitions were slightly different, I've changed the definition to be maximally inclusive so that the name of the function is more accurate Test Plan: python test/test_public_bindings.py python test/test_quantization.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709276](https://our.internmc.facebook.com/intern/diff/D40709276) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87520 Approved by: https://github.com/jcaip	2022-11-16 21:31:57 +00:00
Iris	aee96bbf5a	[PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698 ) Context in RFC: https://github.com/pytorch/pytorch/issues/86620 .rst file will be finalized in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698 Approved by: https://github.com/wanchaol	2022-11-16 21:06:38 +00:00
soulitzer	6b521bbf35	Prevent module full_backward_hook from erroring in double backward (#88357 ) Also clarifies documentation to say "execute if and only if gradients wrt outputs are computed" (previously, "execute every time gradients wrt inputs are computed") See https://docs.google.com/document/d/1tFZKYdsSzRBJ7Di7SWt8X8fSg-E3eiUPwomMF10UyhM/edit for more details regarding the question: 'should module full_backward_hooks be called every time the gradients wrt module inputs are called, or should module full_backward_hooks only be called when the "backward for the module" have been computed?' Fixes https://github.com/pytorch/pytorch/issues/88312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88357 Approved by: https://github.com/albanD	2022-11-16 19:27:30 +00:00
BowenBao	0581331963	[ONNX] Document ONNX diagnostics (#88371 ) Reference pages: - Landing page: https://docs-preview.pytorch.org/88371/onnx_diagnostics.html - Individual rule: https://docs-preview.pytorch.org/88371/generated/onnx_diagnostics_rules/POE0004%3Aoperator-supported-in-newer-opset-version.html An initial PR to setup the document generation for ONNX diagnostics. * Add document page for ONNX diagnostics. * Add document generation for diagnostics rules from `rules.yaml`. * Add dependency on `myst-parser` for markdown to rst parsing. More content to be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88371 Approved by: https://github.com/abock, https://github.com/justinchuby, https://github.com/malfet, https://github.com/kit1980	2022-11-16 19:21:46 +00:00
Yanbo Liang	848e7240a1	[Dynamo] Add a dummy profiler to avoid activating real profiler (#88930 ) See context at https://github.com/pytorch/torchdynamo/issues/1721#issuecomment-1312396059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88930 Approved by: https://github.com/jansel	2022-11-16 19:08:49 +00:00
andrewor14	61801799a0	[Quant][bc-breaking] Remove overwrite_output_observer (#88620 ) Summary: When the BackendConfig was first introduced, `overwrite_output_observer` and `overwrite_output_fake_quantize` were added to ensure fixed qparams ops like `torch.nn.Sigmoid` and `torch.nn.Tanh` used the correct observers and fake quantizes. However, this is hacky because the BackendConfig should not set the observer constructors themselves, but should instead specify only requirements on the observers. Later, https://github.com/pytorch/pytorch/pull/80184 added the correct observers to `get_default_qconfig_mapping` along with validation logic that throws an error if incorrect observers were specified. With this change, we no longer need to overwrite the observers from the BackendConfig, since we expect the user to pass in the correct observers for these ops. This commit removes these overwrite observer settings in the BackendConfig. Instead, we represent the observer constraints for fixed qparams ops through the existing DTypeWithConstraints mechanism. Note that, however, to be consistent with other DTypeWithConstraints checks, we no longer throw an error if an incorrect observer is specified, but simply ignore the offending QConfig and log a warning instead. This is the BC-breaking part of the change. BC-breaking notes: ``` from torch.ao.quantization.qconfig import default_qconfig from torch.ao.quantization.quantize_fx import prepare_fx model = ModelWithFixedQParamsOps() qconfig_mapping = QConfigMapping().set_global(default_qconfig) example_inputs = ... prepare_fx(model, qconfig_mapping, example_inputs) ``` Before this commit, running the above leads to an exception because the wrong observers are used for fixed qparams ops. After this commit, the above will only encounter a warning, and the fixed qparams ops will not be quantized. In both cases, switching to `get_default_qconfig_mapping` will cause the fixed qparams ops to be quantized. Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/88620 Approved by: https://github.com/jerryzh168	2022-11-16 18:44:12 +00:00
Huy Do	a6ef2c7634	Support test-config filter logic for rocm (#89046 ) The logic used by `mem_leak_check` https://github.com/pytorch/pytorch/pull/88373 is currently not applied to rocm, i.e. `06486cd008` because its workflows don't have the test-config filtering logic yet (linux, mac, and windows all have it already). In another work, rocm tests always run with mem leak check disabled at the moment. We want that but also to run the test with mem leak check enabled periodically one per day. This PR closes that gap Pull Request resolved: https://github.com/pytorch/pytorch/pull/89046 Approved by: https://github.com/clee2000	2022-11-16 18:25:38 +00:00
Peter Bell	7b0adc290a	Run tests from test/inductor in inductor CI job (#88957 ) CUDA inductor tests are currently not run in CI because the only jobs that have triton installed don't actually run these test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88957 Approved by: https://github.com/ngimel, https://github.com/seemethere	2022-11-16 17:54:13 +00:00
lezcano	58ebf92cf0	Add bfloat16 support to torch.prod to align with torch.cumprod (#87205 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/87205 Approved by: https://github.com/mruberry	2022-11-16 17:46:54 +00:00
lezcano	3320915303	Fix decomp for embedding_backward and simplify the decomposition of embedding_dense and embedding_dense_backward (#87204 ) See the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/87204 Approved by: https://github.com/Chillee	2022-11-16 17:46:54 +00:00
lezcano	e1ecf53d84	Simplify linspace decomp and increase its tolerance (#87203 ) This is an interesting one Since this is an operation that's intrinsically defined on the reals, we should perform the ops on that dtype always, and just cast to the desired dtype at the end. This simplifies the decomposition. Now, I started looking at this one when I started seeing failures on a test that's added in a later PR. What's going on here is that, by doing an upcast to a higher dtype and then cast down to integers, sometimes there's an off-by-one error. I think this is fine, as the decomposition is more accurate than the original function, which goes in line with the whole PrimTorch effort. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87203 Approved by: https://github.com/mruberry	2022-11-16 17:46:54 +00:00
bmedishe	d2d22d89d9	test_unary_ufuncs few tests enabled on rocm which are passing (#89007 ) This PR is to enable tests which are skip on rocm from test package test_unary_ufuncs.py::TestUnaryUfuncsCUDA <html> <body> <!--StartFragment--><div ccp_infra_version='3' ccp_infra_timestamp='1667423453335' ccp_infra_user_hash='1693798314' ccp_infra_copy_id='81491a4a-67e6-4e87-aa71-47d953d2499a' data-ccp-timestamp='1667423453335'><html><head><meta name=ProgId content=Excel.Sheet><meta name=Generator content="Microsoft Excel 15"></head><body link="#0563C1" vlink="#954F72"> test_file \| test_name \| test_class -- \| -- \| -- test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_large_tan_cuda_float64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_bfloat16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_float16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_float32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_float64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_int16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_int32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_int64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_int8 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_atan_cuda_uint8 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int8 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_2_cuda_uint8 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int16 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int32 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int64 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int8 \| (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs \| test_reference_numerics_small_polygamma_polygamma_n_4_cuda_uint8 \| (__main__.TestUnaryUfuncsCUDA) </body></html></div><!--EndFragment--> </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89007 Approved by: https://github.com/mruberry	2022-11-16 17:42:26 +00:00
Jacob Szwejbka	7f55db4fb0	add quantize_decomposed_dynamic to op lib (#88855 ) Summary: Needed for dynamic quant reference pattern graphs. Test Plan: added unittest Differential Revision: D41205030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88855 Approved by: https://github.com/jerryzh168	2022-11-16 16:59:36 +00:00
PyTorch MergeBot	cf6003f046	Revert "Towards unifying symbolic and non symbolic fake tensor (#89038 )" This reverts commit 37d54239c7ea88fd9c98dcac3fcc9b98a6f9e9d1. Reverted https://github.com/pytorch/pytorch/pull/89038 on behalf of https://github.com/ezyang due to executorch segfaults	2022-11-16 16:52:47 +00:00
Kirtesh Patil	fe276ea0f9	[UCC] Add pre & post processing for CPU collectives (#89030 ) Summary: The CPU block in `collective_post` was missing pre & post processing. The reduce-scatter implementaion expects use of pre-processing callback to flatten the input tensors, however, the missing invocation meant grabage values were being passed. Test Plan: Tested the reduce-scatter collective using PARAM Reviewed By: eastzone Differential Revision: D41291592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89030 Approved by: https://github.com/kingchc, https://github.com/kwen2501	2022-11-16 16:40:24 +00:00
PyTorch MergeBot	90db86be10	Revert "SymIntify convolution backend calculation (#89069 )" This reverts commit 09ed8b67e24cfe29f3fa7b5dd28eaa7749229f12. Reverted https://github.com/pytorch/pytorch/pull/89069 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2022-11-16 16:36:27 +00:00
Angel Avila	cf4b4b1b06	Fix python types in pybind function signatures (#89115 ) Fixes #88958 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89115 Approved by: https://github.com/ezyang	2022-11-16 16:30:56 +00:00
AllenTiTaiWang	abe41aee77	[ONNX] Support custom Op with onnx-script local function (#86906 ) Extend `register_custom_op` to support onnx-script local function. The FunctionProto from onnx-script is represented by custom op and inserted into ModelProto for op execution. NOTE: I did experiments on >2GB case of a simple model with large initializers: ```python import torch class Net(torch.nn.Module): def __init__(self, B, C): super().__init__() self.layer_norm = torch.nn.LayerNorm((B, C), eps=1e-3) def forward(self, x): return self.layer_norm(x) N, B, C = 3, 25000, 25000 model = Net(B, C) x = torch.randn(N, B, C) torch.onnx.export(model, x, "large_model.onnx", opset_version=12) ``` And it turns out we won't get model_bytes > 2GB after `_export_onnx` pybind cpp function, as we split initializer in external files in that function, and have serialization before return the model bytes, which protobuf is not allowed to be larger than 2GB at any circumstances. The test cases can be found in the next PR #86907 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/86906 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-16 15:08:55 +00:00
mindest	9fe36a0214	[ONNX] Extra support for bernoulli export (#88655 ) * add opset 15 support for `bernoulli`. * add extra export options for different `bernoulli` cases: `x.bernoulli(p)` where `p` is a tensor or float. Fixes #88299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88655 Approved by: https://github.com/BowenBao	2022-11-16 15:08:41 +00:00
Edward Z. Yang	37d54239c7	Towards unifying symbolic and non symbolic fake tensor (#89038 ) Fake tensor behaves pretty differently depending on if you have symbolic shapes or not. This leads to bugs; for example, we weren't getting correct convolution_backward strides because we bypassed the correct stride logic in fake tensor on symbolic shapes. This PR attempts to unify the two codepaths. I don't manage to unify everything, but I get most of it. The algorithm is delicate and I'm still hosing down test failures. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89038 Approved by: https://github.com/anjali411	2022-11-16 14:02:43 +00:00
Edward Z. Yang	09ed8b67e2	SymIntify convolution backend calculation (#89069 ) We will need this to implement a convolution meta function that is SymInt aware. I use templates so that regular convolution code is not affected by the change. No tests for symbolic ints directly; that will come in a subsequent PR which also needs to refactor fake tensors. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89069 Approved by: https://github.com/SherlockNoMad	2022-11-16 14:02:43 +00:00
Edward Z. Yang	5e0c01330c	SymIntArrayRef type caster (#89074 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89074 Approved by: https://github.com/SherlockNoMad	2022-11-16 14:02:39 +00:00
Nikita Karetnikov	57af0c8245	Bug fix: make sure `copy_impl` doesn't read out of bounds (#88544 ) Fixes #88543. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88544 Approved by: https://github.com/lezcano	2022-11-16 13:23:38 +00:00
anjali411	dc40d3f93f	Add meta impl for grid_sampler_2d_backward (#88745 ) TODO: add an OpInfo Pull Request resolved: https://github.com/pytorch/pytorch/pull/88745 Approved by: https://github.com/ezyang	2022-11-16 13:01:47 +00:00
Jiawen Liu	5270122773	[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#89118 ) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: 1.15x -> 1.36x speedup Test Plan: CI Reviewed By: bertmaher, jansel, jianyuh Differential Revision: D41071665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89118 Approved by: https://github.com/jianyuh	2022-11-16 10:37:30 +00:00
PyTorch MergeBot	9d28775c1d	Revert "Rewrite assert statement with torch._assert under config (#88246 )" This reverts commit 62ba15e10e875ce088dff26e872605ee70c8c04a. Reverted https://github.com/pytorch/pytorch/pull/88246 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2022-11-16 09:45:49 +00:00
Animesh Jain	9d2f5a2784	[dynamo] Support if cond on NNModuleVariable (#89095 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89095 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2022-11-16 08:51:30 +00:00
Wanchao Liang	f20b3f2e57	[dtensor] PART 8: move tensor parallel api and tests to core distributed (#88180 ) This PR moves tensor/parallel folder and tests to torch.distributed. part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88180 Approved by: https://github.com/aazzolini	2022-11-16 08:07:50 +00:00
Wanchao Liang	0230e52b54	[dtensor] PART 7: move remaining DTensor tests to core distributed (#88179 ) This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88179 Approved by: https://github.com/aazzolini	2022-11-16 08:07:49 +00:00
Wanchao Liang	550a019fb8	[dtensor] PART 6: move DTensor op tests to core distributed (#88551 ) This PR moves DTensor op tests to core distributed, including prop_rule, pointwise op, matrix op tests, etc. part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88551 Approved by: https://github.com/aazzolini	2022-11-16 08:07:48 +00:00
Wanchao Liang	527c5bdb45	[dtensor] PART 5: move DTensor basic tests to core distributed (#88178 ) This PR moves DTensor basic tests to torch.distributed, including dtensor, device_mesh tests part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88178 Approved by: https://github.com/fduwjj	2022-11-16 08:07:46 +00:00
Wanchao Liang	1b88476320	[dtensor] PART 4: move remaining DTensor ops to core distributed (#88550 ) This PR moves the view related DTensor ops to core distributed, tests will be add in follow up PRs part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88550 Approved by: https://github.com/fduwjj	2022-11-16 08:07:44 +00:00
Wanchao Liang	2dcf0978a2	[dtensor] PART 3: move most DTensor ops to core distributed (#88177 ) This PR moves most DTensor ops to torch.distributed._tensor. We will add all tests in the following PRs. part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88177 Approved by: https://github.com/fduwjj	2022-11-16 08:07:42 +00:00
Wanchao Liang	4b945967de	[dtensor] PART 2: move DTensor abstraction and APIs to core distributed (#88176 ) This PR moves the core DTensor abstraction and high level APIs to torch.distributed._tensor folder, which includes the following: 1. DTensor class 2. high level APIs (distribute_tensor/module) 3. dispatching logic 4. redistribute logic part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88176 Approved by: https://github.com/fduwjj	2022-11-16 08:07:41 +00:00
Wanchao Liang	370fc5cb42	[dtensor] PART 1: move DeviceMesh and placement to core distributed (#88549 ) This PR creates `torch.distributed._tensor` package and moves DeviceMesh, PlacementTypes to it part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88549 Approved by: https://github.com/fduwjj	2022-11-16 08:07:39 +00:00
Huy Do	59ba15f374	Upload CSV test reports from inductor (#89112 ) Inductor test report artifacts are now on HUD but its files are in CSV format instead of the default XML files from pytest or unittest that we expect. So this PR uploads both suffixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/89112 Approved by: https://github.com/desertfire	2022-11-16 07:44:43 +00:00
Jiawen Liu	7e66d1d6cd	[Inductor] Support Shape Padding for aten.mm in Inductor (#89086 ) Summary: Support shape padding for aten.mm in Inductor (originally from [#88709](https://github.com/pytorch/pytorch/pull/88709)) Differential Revision: D41315078 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89086 Approved by: https://github.com/jianyuh	2022-11-16 06:27:13 +00:00
Kenichi Maehashi	e2f0648750	Add an option to include actual license terms to the output (#85624 ) When building products using PyTorch, it is often required to display license terms for all dependencies. The feature itself has been implemented in #81500 but it seems there are no options to enable it. This PR implements the option. cc/ @mattip @rgommers Pull Request resolved: https://github.com/pytorch/pytorch/pull/85624 Approved by: https://github.com/rgommers, https://github.com/seemethere	2022-11-16 05:07:53 +00:00
Johannes Pitz	8ebbd5a89a	Easier to understand event_dim computation (#81396 ) Fixes #81254 Only easier to understand, not a real fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81396 Approved by: https://github.com/fritzo, https://github.com/kit1980	2022-11-16 04:38:32 +00:00
Sherlock Huang	ce2f8700ba	Symintify numel(), infer_size, prims.elementwise_meta (#88956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88956 Approved by: https://github.com/ezyang	2022-11-16 03:36:00 +00:00
Driss Guessous	b291c1213a	Create native function for determining which implementation of SDP to call (#89029 ) # Summary Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029 Approved by: https://github.com/cpuhrsch	2022-11-16 03:07:54 +00:00
Andrew Gu	397f100672	[FSDP] Test `named_parameters()` in forward (`use_orig_params=True`) (#89066 ) This adds a unit test following the FSDP change in https://github.com/pytorch/pytorch/pull/88781. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89066 Approved by: https://github.com/fegin	2022-11-16 03:01:16 +00:00
Huy Do	46ba0150cb	Increase slow grad check timeout (#89079 ) Now that periodic jobs are run under `mem_leak_check` mode with parallelization turning off. It's very easy for `linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck / test` to timeout because one of the shards is very close to the 4h mark. * `2452e3f99a` * `35e668b5ce` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89079 Approved by: https://github.com/clee2000	2022-11-16 02:39:22 +00:00
PyTorch MergeBot	9f0b2c73f3	Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859 )" This reverts commit d60abe4b9521e235c0e9beb00cda0d6c5673f4e0. Reverted https://github.com/pytorch/pytorch/pull/88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI	2022-11-16 01:13:00 +00:00
Edward Z. Yang	d96dd8ff09	Add int64_t, SymInt overloads for all binary operators in C++ (#89063 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89063 Approved by: https://github.com/SherlockNoMad	2022-11-16 01:08:31 +00:00
Edward Z. Yang	431642111f	Move ConvParams methods directly on struct (#89062 ) This reduces boilerplate. Also, I plan to add a template parameter to ConvParams; without moving the methods onto the struct, I would have to manually template every method. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89062 Approved by: https://github.com/SherlockNoMad	2022-11-16 01:08:31 +00:00
Edward Z. Yang	49f0be0762	Hide ConvParams struct from ConvUtils.h (#89059 ) It isn't actually used outside of Convolution.cpp, so no reason to publish it. I intend to turn this into a template, so moving it with the method definitions is very convenient. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89059 Approved by: https://github.com/SherlockNoMad	2022-11-16 01:08:27 +00:00
Salil Desai	19cacecf34	Fix and Re-enable test_quantize_fx_lite_script_module.py (#88897 ) Summary: After D35984526 (`416899d1a9`), ```torch.ao.quantization.quantize_fx.prepare_fx``` requires passing in ```example_args```. This diff fixes the calls to ```prepare_fx``` in this test by adding in ```example_args``` as necessary. Test Plan: ``` buck test caffe2/test:fx_quantization_lite ``` ``` ✓ ListingSuccess: caffe2/test:fx_quantization_lite : 3 tests discovered (39.689) ✓ Pass: caffe2/test:fx_quantization_lite - test_conv2d (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (44.451) ✓ Pass: caffe2/test:fx_quantization_lite - test_embedding (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (45.462) ✓ Pass: caffe2/test:fx_quantization_lite - test_submodule (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (45.933) Summary Pass: 3 ListingSuccess: 1 Finished test run: https://www.internalfb.com/intern/testinfra/testrun/3096224827259146 ``` Differential Revision: D41227335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88897 Approved by: https://github.com/dagitses	2022-11-16 00:56:12 +00:00
Richard Zou	3bc327993f	PyDispatcher integration with functorch (#88785 ) This PR teaches PyDispatcher and PyOperator about functorch transforms. It is important that PyDispatcher/PyOperator dispatch with functorch transforms, because this is our plan for higher-order operators (operators that accept functions as arguments). Examples of these include: - functorch transforms over the existing cond operator (control flow) - autograd.Function support for functorch (which I am working towards), - AOTDispatcher (should be a higher order operator) Concretely, the problem with teaching PyDispatcher/PyOperator about functorch is that the stack-based dispatching logic (DynamicLayerStack) is hidden inside the fallbacks for two dispatch keys (DynamicLayer{Front, Back}). PyDispatcher doesn't know about C++ boxed fallbacks, our plan on record for that is that we need to reimplement all of them in Python (but can call helper functions in C++ to make our lives easier). Instead of exposing all of what DynamicLayer{Front, Back} do to python, this PR takes the approach of re-implementing part of the stack-based dispatching in Python. The motivation is that this is more sane and follows what the "ideal" implementation of functorch would have been: - each transform should be a "mode" - there should be no TLS dispatch key set hackery. functorch needs to do this hackery today to re-use VariableType implementations. This PR: - exposes the DynamicLayerStack to Python - The DynamicLayerStack is a stack of Interpreters. These get exposed to Python as well. - Interpreters can run operations (Interpreter.process) or lower them to the next interpreter in the stack (Interpreter.lower) - To use a PyOperator with functorch transforms, a developer needs to register a rule for each transform (vmap, grad, jvp, ...). - The PyOperator API is NOT user-facing. Things like autograd.Function support for functorch will end up going through the autograd.Function API. Question for reviewers: - Does this design make sense? - I'm trying to split up the "functorch support for autograd.Function" work into logical pieces. Would it be better if I didn't? (the full thing is a bit long - 1000-2000 LOC). Test Plan: - new tests that construct PyOperator and compose them with functorch transforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/88785 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-11-16 00:46:59 +00:00
Richard Zou	2268a3215c	[functorch] add switch to enable autograd.Function (#88784 ) This is mostly a debug or "if you know what you're doing" switch for now. It is not public API. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/88784 Approved by: https://github.com/samdow, https://github.com/soulitzer	2022-11-16 00:46:59 +00:00
PyTorch MergeBot	0ce22574b1	Revert "Enable correct supported activities for kineto on rocm (#88207 )" This reverts commit 35093fc1ab9749e6b763acead007e56b54c6375b. Reverted https://github.com/pytorch/pytorch/pull/88207 on behalf of https://github.com/kit1980 due to Broke test_kineto on trunk / win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu)	2022-11-16 00:45:41 +00:00
Shunting Zhang	a13433940c	allow loading model from a path in torchbench (#89028 ) Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument. This PR is split out from #88904 Here is the usage: Specify the path and class name of the model in format like: --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME> Due to the fact that dynamo changes current working directory, the path should be an absolute path. The class should have a method get_example_inputs to return the inputs for the model. An example looks like ``` class LinearModel(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Test command: ``` # python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager WARNING:common:torch.cuda.is_available() == False, using CPU cpu eval LinearModel 0.824x p=0.00 ``` Content of model_collection.py ``` from torch import nn import torch class LinearModel(nn.Module): """ AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes. Make sure this model calls 2 linear layers to avoid being skipped. """ def __init__(self, nlayer=2): super().__init__() layers = [] for _ in range(nlayer): layers.append(nn.Linear(10, 10)) self.layers = nn.Sequential(*layers) def forward(self, x): return self.layers(x) def get_example_inputs(self): return (torch.randn(2, 10),) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89028 Approved by: https://github.com/jansel	2022-11-16 00:29:08 +00:00
Michael Lazos	60ffeb9866	Don't iterate over graph when adding graph input (#89084 ) helps with https://github.com/pytorch/torchdynamo/issues/1803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89084 Approved by: https://github.com/jansel	2022-11-16 00:08:34 +00:00
Charlie Yan	ee05f47bdd	Rebase and re-land thread PG (#88795 ) The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795 Approved by: https://github.com/huydhn, https://github.com/wanchaol	2022-11-15 21:58:58 +00:00
Michael Wootton	35093fc1ab	Enable correct supported activities for kineto on rocm (#88207 ) A compile time guard was preventing ActivityType::CUDA from being available on rocm. This caused both the GPU_FALLBACK and CUDA modes to be active at the same time. So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times. This caused incorrect (and often negative) cuda times, in e.g. table(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88207 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2022-11-15 21:40:47 +00:00
Bin Bao	d0130cd21e	Enable test_ops for inductor (#88994 ) Summary: skip several unsupported test cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/88994 Approved by: https://github.com/Krovatkin	2022-11-15 21:40:36 +00:00
mikey dagitses	67af734ade	skip test that is broken in head (#88759 ) Test Plan: Rely on CI. Differential Revision: D41156351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88759 Approved by: https://github.com/zou3519	2022-11-15 21:33:38 +00:00
Catherine Lee	175b7e1cde	print xpass (#89020 ) Print unexpected success as XPASS. I will submit a PR to test-infra so that the log classifier can find these Ex: https://github.com/pytorch/pytorch/actions/runs/3466368885/jobs/5790424173 ``` test_import_hipify (__main__.TestHipify) ... ok (0.000s) test_check_onnx_broadcast (__main__.TestONNXUtils) ... ok (0.000s) test_prepare_onnx_paddings (__main__.TestONNXUtils) ... ok (0.000s) test_load_standalone (__main__.TestStandaloneCPPJIT) ... ok (16.512s) ====================================================================== XPASS [4.072s]: test_smoke (__main__.TestCollectEnv) ---------------------------------------------------------------------- ---------------------------------------------------------------------- Ran 31 tests in 24.594s FAILED (skipped=7, unexpected successes=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89020 Approved by: https://github.com/huydhn, https://github.com/seemethere	2022-11-15 21:27:14 +00:00
Nikita Vedeneev	8dc3353b0b	add `to(dtype)` support for all sparse compressed formats (#89055 ) Fixes [#88419](https://github.com/pytorch/pytorch/issues/88419) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89055 Approved by: https://github.com/cpuhrsch	2022-11-15 21:16:18 +00:00
Nikita Shulga	da2afcb1e0	Add test for out-of-bounds Tensor access on GPU (#39211 ) Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error. As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard. Closes https://github.com/pytorch/pytorch/issues/38944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211 Approved by: https://github.com/ezyang	2022-11-15 21:06:02 +00:00
Fabio Rocha	d47b94fa8e	[inductor] Added bucketize to decomp table (#88348 ) These are the benchmark results vs eager ``` [--------------------------- bucketize ----------------------------] \| eager \| decomp 32 threads: -------------------------------------------------------- ((16384, 1024), (16,)), (True, True) \| 600 \| 464 ((16384, 1024), (16,)), (True, False) \| 542 \| 464 ((16384, 1024), (16,)), (False, True) \| 780 \| 731 ((16384, 1024), (16,)), (False, False) \| 777 \| 731 ((16384, 1024), (64,)), (True, True) \| 624 \| 515 ((16384, 1024), (64,)), (True, False) \| 603 \| 515 ((16384, 1024), (64,)), (False, True) \| 789 \| 718 ((16384, 1024), (64,)), (False, False) \| 786 \| 718 ((16384, 1024), (256,)), (True, True) \| 878 \| 820 ((16384, 1024), (256,)), (True, False) \| 891 \| 830 ((16384, 1024), (256,)), (False, True) \| 897 \| 900 ((16384, 1024), (256,)), (False, False) \| 900 \| 900 ((16384, 1024), (1024,)), (True, True) \| 2000 \| 1890 ((16384, 1024), (1024,)), (True, False) \| 1950 \| 1892 ((16384, 1024), (1024,)), (False, True) \| 1990 \| 1962 ((16384, 1024), (1024,)), (False, False) \| 1990 \| 2060 ((16384, 1024), (4096,)), (True, True) \| 3405 \| 3155 ((16384, 1024), (4096,)), (True, False) \| 3244 \| 3154 ((16384, 1024), (4096,)), (False, True) \| 3282 \| 3219 ((16384, 1024), (4096,)), (False, False) \| 3278 \| 3220 ((16384, 1024), (16384,)), (True, True) \| 4626 \| 4672 ((16384, 1024), (16384,)), (True, False) \| 4629 \| 4671 ((16384, 1024), (16384,)), (False, True) \| 4662 \| 4829 ((16384, 1024), (16384,)), (False, False) \| 4665 \| 4824 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88348 Approved by: https://github.com/ngimel	2022-11-15 21:03:28 +00:00
Fabio Rocha	9262d18e1b	[inductor] Introduce CSEVariable type and use it to track if Triton variables are scalar (#88347 ) This fixes https://github.com/pytorch/torchdynamo/issues/1515 To fix it, we need to keep track of whether a Triton variable is a scalar (so we can not use a mask when doing indirect loads through them). This requires a way of annotating variable names generated by CSE with properties. So now CSE will use CSEVariable class to keep track of variables and let backends subclass it so they can annotate them with whatever information they want. TritonCSEVariable is such a subclass that track the `is_scalar` property. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88347 Approved by: https://github.com/jgong5, https://github.com/ngimel	2022-11-15 20:52:37 +00:00
Colin Taylor	edd2dea859	[torch] [analytics] add dynamo to analytics (#88915 ) Summary: as title. Differential Revision: D41237602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88915 Approved by: https://github.com/jansel	2022-11-15 20:46:03 +00:00
Colin Taylor	3e2ba60ac0	[torch] [analytics] add pytorch event logger callsites to torch.save and torch.load (#89003 ) Summary: as title. Differential Revision: D41239419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89003 Approved by: https://github.com/ezyang, https://github.com/dzhulgakov	2022-11-15 20:36:16 +00:00
Nikita Shulga	d8466964b3	Add range check to multi margin loss target (#89008 ) Fixes https://github.com/pytorch/pytorch/issues/88724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89008 Approved by: https://github.com/ngimel	2022-11-15 20:35:51 +00:00
Colin Taylor	18c1f2f82e	[torch] [analytics] add pytorch event logger callsites to transformers and encoder/decoders (#88896 ) Differential Revision: D41227275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88896 Approved by: https://github.com/mikekgfb	2022-11-15 20:35:36 +00:00
Driss Guessous	ff6d2a6d1b	Add mem efficient backward (#88856 ) # Registers the derivative for mem efficient backward - Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32 - I also made updates based off of Xformer main branch and flash-attention cutlass branch. - This will enable the fused backward to be called for scaled dot product attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856 Approved by: https://github.com/cpuhrsch	2022-11-15 20:22:57 +00:00
Jiawen Liu	d60abe4b95	[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859 ) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: 1.15x -> 1.36x speedup Differential Revision: D41071665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88859 Approved by: https://github.com/jianyuh, https://github.com/jansel	2022-11-15 19:34:38 +00:00
Xiao Wang	f5df685090	Enable channels_last_3d on SyncBatchNorm (#88401 ) This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format. With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got master: ``` DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec ``` This PR: ``` DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec ``` This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906 Close https://github.com/pytorch/pytorch/issues/88021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401 Approved by: https://github.com/ngimel	2022-11-15 19:25:53 +00:00
Taylor Robie	8023c9dc64	[Profiler] Memory profiler part 3: Schema parsing and mutable arguments (#86854 ) The appropriate annotation for a block of memory is a function of time: an input can be mutated in-place to become an activation, a clever kernel might steal the memory of a detached input (such as a mask) to use as output memory, etc. We could pessimistically assume that all ops mutate all of their inputs, however inspection of schema allows us to significantly narrow that assumption with minimal effort. Checking schemas also allows us to distinguish between dispatcher ops (which have load bearing semantics) and user annotations with reasonably high precision. Differential Revision: [D40220390](https://our.internmc.facebook.com/intern/diff/D40220390/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86854 Approved by: https://github.com/chaekit	2022-11-15 19:17:57 +00:00
Taylor Robie	2439bc1e9b	[Profiler] Memory profiler part 2: Config validation (#86853 ) Memory profiling requires `record_shapes`, `profile_memory`, and `with_stack`. This PR just adds a skeleton endpoint with a good error message if certain flags are missing. Differential Revision: [D39920801](https://our.internmc.facebook.com/intern/diff/D39920801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86853 Approved by: https://github.com/chaekit	2022-11-15 19:17:57 +00:00
mikey dagitses	279dcce702	disable test that fails in fbcode (#88786 ) Summary: caffe2/test:torch_cuda - test_advanced_indexing_assignment_lazy (test_view_ops.TestViewOpsLAZY) RuntimeError: TorchScript backend not yet supported in FBCODE/OVRSOURCE builds File "/usr/local/fbcode/platform010/lib/python3.8/unittest/suite.py", line 163, in _handleClassSetUp setUpClass() File "/re_cwd/fbcode/buck-out/opt/gen/caffe2/test/torch_cuda#binary,link-tree/torch/testing/_internal/common_device_type.py", line 506, in setUpClass torch._lazy.ts_backend.init() File "/re_cwd/fbcode/buck-out/opt/gen/caffe2/test/torch_cuda#binary,link-tree/torch/_lazy/ts_backend.py", line 6, in init torch._C._lazy_ts_backend._init() Test Plan: Rely on CI. Differential Revision: D41170545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88786 Approved by: https://github.com/zou3519	2022-11-15 19:08:31 +00:00
Taylor Robie	1db0f735e8	[Profiler] Account for caching when assigning IDs (#88917 ) The python tracer caches information about module and optimizer state. That means that for subsequent calls, the presence of a Tensor in these fields does not imply that the Tensor is still live; just that it was live during the first call. (I should perhaps rename the fields to something like `stale_parameters` to convey this.) Unless we discard subsequent calls ID assignment get tripped up when it see's a Tensor that was already released. Differential Revision: [D41226827](https://our.internmc.facebook.com/intern/diff/D41226827/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88917 Approved by: https://github.com/chaekit	2022-11-15 18:24:15 +00:00
Zain Rizvi	ee4412381e	Allow ROCm runners to have 2 or more gpus (#89011 ) [This run](https://github.com/pytorch/pytorch/actions/runs/3432340660/jobs/5721731207) failed claiming that it couldn't detect GPUs on the runner. Inspecting the rocminfo output (higher up in logs) show that it in fact had three GPUs, but the workflow is currently setup to expect either 2 or 4 gpus. The workflow files currently have no way of specifying wither it'll get a 2 gpu or a 4 gpu machine, so really 2 is all any test can expect to get. [This old PR](https://github.com/pytorch/pytorch/pull/72142/files) shows that historically ROCm runners only had 4 gpus, then later the logic was extended to expect 2 GPU runners as well. It's not clear how the ROCm runner ended up with 3 gpus instead of 2 or 4 (something for ROCm folks to look into) but there doesn't seem to be a good reason for ROCm workflows to fail if 3 (or 5) gpus ever show up on a machine. This PR makes the workflows resilient to ROCm having these alternate GPU counts Also filed https://github.com/pytorch/pytorch/issues/89012 against the ROCm team to explore why the runner only had 3 gpus Pull Request resolved: https://github.com/pytorch/pytorch/pull/89011 Approved by: https://github.com/huydhn	2022-11-15 17:55:29 +00:00
Pruthvi Madugundu	2819df9a19	[ROCm] Enable python ref executor UTs for ROCm (#88981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88981 Approved by: https://github.com/mruberry	2022-11-15 17:49:00 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	62ba15e10e	Rewrite assert statement with torch._assert under config (#88246 ) This diff rewrites assert statement in python with torch._assert under config. The resulting graph looks something like: ``` SOURCE CODE: def f(x): assert x[0] == 3 return x.cos() CAPTURED GRAPH: graph(): %arg0 : [#users=2] = placeholder[target=arg0] %getitem : [#users=1] = call_function[target=operator.getitem](args = (%arg0, 0), kwargs = {}) %eq : [#users=1] = call_function[target=operator.eq](args = (%getitem, 3), kwargs = {}) %_assert : [#users=0] = call_function[target=torch._assert](args = (%eq, "assertion_error"), kwargs = {}) %cos : [#users=1] = call_method[target=cos](args = (%arg0,), kwargs = {}) return cos ``` Note that this introduces side-effect as it could error out while executing graph, but the assertion can eliminated via DCE if we choose to ignore it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88246 Approved by: https://github.com/jansel	2022-11-15 17:14:59 +00:00
anjali411	b815f1fc50	Symintify view_as_complex and view_as_real (#89052 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #89052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89052 Approved by: https://github.com/ezyang	2022-11-15 16:28:36 +00:00
HDCharles	b9029fc449	[ao] quant_type.py fixing public v private (#87519 ) Summary: made _get_quant_type_to_str private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709282](https://our.internmc.facebook.com/intern/diff/D40709282) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87519 Approved by: https://github.com/jcaip	2022-11-15 15:42:31 +00:00
Sherlock Huang	5faa2792fa	Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761 Approved by: https://github.com/ezyang	2022-11-15 13:34:45 +00:00
Masaki Kozuki	63e16216d8	[c10d] Implement `__instancecheck__` for `c10d::ReduceOp` (#88275 ) Summary: - Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__` - Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests Rel: - #81272 - #84243 - #87191 - #87303 - #87555 Ref: - https://github.com/pybind/pybind11/issues/2696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275 Approved by: https://github.com/wanchaol	2022-11-15 13:21:41 +00:00
Chen Lai	2452e3f99a	Update xnnpack graph schema to use xnode and xvalue (#89036 ) There are different nodes definition like [Node in autograd](https://www.internalfb.com/code/fbsource/fbcode/caffe2/torch/csrc/autograd/function.h?lines=108-609&reveal=108-609) and onnxnodes and etc. Understand namespace can be used where nodes from definition are used together, however it's still better to slightly differentiate the name. Differential Revision: [D41002324](https://our.internmc.facebook.com/intern/diff/D41002324/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89036 Approved by: https://github.com/mcr229	2022-11-15 10:34:45 +00:00
Chen Lai	8c46a5de3a	Add debug handle to xnnpack schema (#89033 ) As title, add three things to the schema 1. debug handle for each node 2. file identifier, so we can sanity check we are getting the xnnpack schema flatbuffers file, instead of other random binary 3. extension, so the dumped binary will end up with its own extension like `myschema.xnnpack` (maybe can have a better name) instead of the default extension `.bin` Differential Revision: [D40906970](https://our.internmc.facebook.com/intern/diff/D40906970/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89033 Approved by: https://github.com/mcr229	2022-11-15 09:49:54 +00:00
PyTorch MergeBot	50c18217a3	Revert "Add mem efficient backward (#88856 )" This reverts commit 35e668b5ced25e735b6e523d557ed7fd60267914. Reverted https://github.com/pytorch/pytorch/pull/88856 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2022-11-15 09:37:09 +00:00
Wenzhe Xue	5314af5383	Set correct size of `attr::output_layouts` when the graph has multiple outputs in JIT oneDNN fuser (#88496 ) Bug: Previously, `initOutputLayouts()` was called after creating a graph and before merging other nodes. It is a vector with one element. So when a graph contains multiple outputs, e.g. using AOTAutograd compile in my case, layout_propagation pass try to access out of range elements in the vector. Then it comes to the second bug in `useOpaqueLayout()`, the out of range checks the index with the updated output size instead of the size of the vector. Then used `[]` to access the element, which is out of range. Fixes the above two issues: 1. check the offset is within range with the size of `attr::output_layouts` vector instead of another variable. This check catches the error now. 2. change the place to initial `attr::output_layouts` after node merging. The graph may change with node merging. Thus we moved the initialization in layout_propagation with the complete graph. Added test time: `Ran 1 test in 0.383s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88496 Approved by: https://github.com/jgong5, https://github.com/sanchitintel	2022-11-15 07:29:55 +00:00
peterjc123	60e59c0755	Fix get_default_qat_qconfig for PT 1.13 (#88876 ) See https://github.com/pytorch/pytorch/pull/84329/files#r1019916766 for more context Pull Request resolved: https://github.com/pytorch/pytorch/pull/88876 Approved by: https://github.com/jgong5, https://github.com/vkuzo	2022-11-15 06:36:24 +00:00
Natalia Gimelshein	5ed90c40f8	enable index_put test (#89019 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/89019 Approved by: https://github.com/desertfire	2022-11-15 06:16:15 +00:00
Iris	68fd8f3706	[BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004 ) This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py). Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004 Approved by: https://github.com/H-Huang	2022-11-15 06:13:17 +00:00
Huy Do	21dd311077	Add a mode to rerun all disabled tests (without running anything else) (#88646 ) Rerun all disabled test to gather their latest result so that we can close disabled tickets automatically. When running under this mode (RERUN_DISABLED_TESTS=true), only disabled tests are run while the rest are skipped `<skipped message="Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run" type="skip"/>` The logic is roughly as follows, the test runs multiple times (n=50) * If the disabled test passes, and it's flaky, do nothing because it's still flaky. In the test report, we'll see the test passes with the following skipped message: ``` <testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00"> <skipped message="{"flaky": True, "num_red": 4, "num_green": 0, "max_num_retries": 3, "rerun_disabled_test": true}" type="skip"/> </testcase> ``` * If the disabled test passes every single time, and it is not flaky anymore, mark it so that it can be closed later. We will see the test runs and passes, i.e. ``` <testcase classname="TestCommonCUDA" name="test_out_warning_linalg_lu_factor_cuda" time="0.170" file="test_ops.py" /> ``` * If the disabled test fails after all retries, this is also expected. So only report this but don't fail the job (because we don't care about red signals here), we'll see the test is skipped (without the `flaky` field), i.e. ``` <testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00"> <skipped message="{"num_red": 4, "num_green": 0, "max_num_retries": 3, "rerun_disabled_test": true}" type="skip"/> </testcase> ``` This runs at the same schedule as `mem_leak_check` (daily). The change to update test stats, and (potentially) grouping on HUD will come in separated PRs. ### Testing * pull https://github.com/pytorch/pytorch/actions/runs/3447434434 * trunk https://github.com/pytorch/pytorch/actions/runs/3447434928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88646 Approved by: https://github.com/clee2000	2022-11-15 05:08:26 +00:00
Elias Ellison	73d71ae3d6	[WIP] Unwrap View in Reinterpret View (#89016 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89016 Approved by: https://github.com/ngimel	2022-11-15 04:40:13 +00:00
Everton Constantino	dd6beca854	Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils test. (#83693 ) Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils.cpp:ClipGradNorm as this is the proper way to compare equality between floating point values. This avoids `test_api` ClipGradNorm failing for WoA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83693 Approved by: https://github.com/ngimel, https://github.com/kit1980	2022-11-15 04:10:52 +00:00
PyTorch MergeBot	ce8a45c282	[vision hash update] update the pinned vision hash (#89026 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89026 Approved by: https://github.com/pytorchbot	2022-11-15 03:32:03 +00:00
Jiawen Liu	55b88cde0a	[Inductor] Build Shape Padding in Inductor (#88709 ) Summary: Build shape padding for matmul/bmm/addmm in Inductor Differential Revision: D41071282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88709 Approved by: https://github.com/bertmaher, https://github.com/Chillee	2022-11-15 03:10:36 +00:00
Edward Z. Yang	cbdb683dc8	Add test that bias gradient is properly tested in same_two_models (#88995 ) See https://github.com/pytorch/pytorch/pull/88629#issuecomment-1313850324 for why this got broken. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88995 Approved by: https://github.com/albanD	2022-11-15 02:55:43 +00:00
William Wen	45d2daaf85	Fix lookup file update in dashboard (#89024 ) Lookup file should be updated before graphs are generated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89024 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2022-11-15 02:32:55 +00:00
Michael Gschwind	1f88b208ac	Fix cuda/cpu check on NoneType (Unit test) (#88970 ) Summary: Fix cuda/cpu check on NoneType (unit test) Test Plan: sabdcastle/ github CI/CD Differential Revision: D41208798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88970 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2022-11-15 01:25:19 +00:00
Driss Guessous	35e668b5ce	Add mem efficient backward (#88856 ) # Registers the derivative for mem efficient backward - Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32 - I also made updates based off of Xformer main branch and flash-attention cutlass branch. - This will enable the fused backward to be called for scaled dot product attention Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856 Approved by: https://github.com/cpuhrsch	2022-11-15 01:10:35 +00:00
Zain Rizvi	f3462833bd	Use same retry logic as macos binary builds (#89014 ) Occasionally the command to download sccache via curl fails with network errors (example below). The default curl retry option only retries errors that are considered "transient", but but the set of actual transient commands is greater than what curl considers to be transient. This PR modifies the retry logic for downloading sccache to match what's in https://github.com/pytorch/pytorch/blob/master/.github/templates/macos_binary_build_workflow.yml.j2#L79-L89, using the retry action to ensure we both retry all transient errors, and including a longer retry delay to give the transient issue time to resolve itself. Example failure from [this run](https://github.com/pytorch/pytorch/actions/runs/3422664884/jobs/5700595220): ``` Run sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:11 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:13 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:14 --:--:-- 0 curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to s3.amazonaws.com:443 Error: Process completed with exit code 35. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89014 Approved by: https://github.com/huydhn	2022-11-15 01:01:40 +00:00
XiaobingSuper	7a37bbed15	Take input striding for conv fusion op based on eager output (#88864 ) As https://github.com/pytorch/pytorch/pull/88706, we also change the input stride check using eager output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88864 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-15 00:55:07 +00:00
Jongsoo Park	0544a32ba3	[inductor] fix could not find as_strided with config.triton.mm=triton (#88946 ) Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels Reviewed By: bertmaher Differential Revision: D40836677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946 Approved by: https://github.com/jansel	2022-11-15 00:48:49 +00:00
wswartworth	92c78f37af	improving torch.linalg.lstsq documentation formatting (#89013 ) Fixes #80441 The highlighting in the documentation for torch.linalg.lstsq was incorrect due to a newline that sphinx doesn't parse correctly. Instead of writing the tensors directly, I used randn to generate the tensors. This seems to be more consistent with how other documentation is written. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89013 Approved by: https://github.com/lezcano	2022-11-14 23:58:46 +00:00
Edward Z. Yang	8df64abc6d	Fix some naughty uses of reshape/flatten (#88999 ) Mutating after reshape/flatten is bad! And it turns out the corresponding view operations are guaranteed to work too. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88999 Approved by: https://github.com/albanD	2022-11-14 23:38:35 +00:00
PyTorch MergeBot	c53a5ac6cc	Revert "support running test_mobile_profiler with buck1/buck2 and OSS (#89001 )" This reverts commit 3b33a2794e07b5216aa473da67755af3aa6e6433. Reverted https://github.com/pytorch/pytorch/pull/89001 on behalf of https://github.com/kit1980 due to Broke trunk / macos-12-py3-x86-64-lite-interpreter / build	2022-11-14 23:36:17 +00:00
Kazuaki Ishizaki	3c3bd55bea	[testing] fix a key in parse_namespace() (#88969 ) This PR fixes an incorrect key name of `mappings` dict in `parse_namespace()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88969 Approved by: https://github.com/kit1980	2022-11-14 23:24:34 +00:00
Yanbo Liang	911a1349dd	[Dynamo] Fix torch.is_tensor and torch.overrides.is_tensor_like (#88704 ) Fixes error from 7k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_arashwan_matrixnet.py Error: ``` AssertionError: torch.* op returned non-Tensor bool call_function <function is_tensor at 0x7fca94d0faf0> from user code: File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_arashwan_matrixnet.py", line 749, in scatter return scatter_map(inputs) File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_arashwan_matrixnet.py", line 741, in scatter_map assert not torch.is_tensor(obj), 'Tensors not supported in scatter.' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88704 Approved by: https://github.com/jansel	2022-11-14 22:45:50 +00:00
mikey dagitses	3b33a2794e	support running test_mobile_profiler with buck1/buck2 and OSS (#89001 ) Summary: Internally we are switching to a new version of buck, but we also must keep this working in OSS. Test Plan: Rely on CI. Differential Revision: D41270673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89001 Approved by: https://github.com/r-barnes, https://github.com/osalpekar, https://github.com/malfet	2022-11-14 22:11:29 +00:00
Nikita Shulga	074278f393	[CI] Push `latest` and hash+CUDAver tags (#88971 ) For nightly docker build to simulate the behavior of `push_nightly_docker_ghcr.yml` Tested in https://github.com/pytorch/pytorch/actions/runs/3465221336/jobs/5787694933 Fixes https://github.com/pytorch/pytorch/issues/88833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88971 Approved by: https://github.com/seemethere	2022-11-14 21:54:46 +00:00
PyTorch MergeBot	b2082833c6	Revert "woof (#89010 )" This reverts commit 4570bd6030c97577d2fa994857d0a022ef7563a4. Reverted https://github.com/pytorch/pytorch/pull/89010 on behalf of https://github.com/ezyang due to whoops this actually landed	2022-11-14 21:21:09 +00:00
Edward Z. Yang	4570bd6030	woof (#89010 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D41276175](https://our.internmc.facebook.com/intern/diff/D41276175) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89010 Approved by: https://github.com/bigfootjon	2022-11-14 20:58:27 +00:00
anjali411	f80992217d	Remove skip (#88979 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88979 Approved by: https://github.com/voznesenskym	2022-11-14 20:56:17 +00:00
Jerry Zhang	540b42a1a8	[quant][executorch] Support quant fusion for cat in quant in executorch stack (#88960 ) Summary: * added cat in executorch backend config * added quant fusion for "dq - cat - q" pattern Test Plan: buck run executorch/exir/tests:quant_fusion_pass -- "executorch.exir.tests.test_quant_fusion_pass.TestQuantFusionPass.test_cat" Reviewed By: qihqi Differential Revision: D41111054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88960 Approved by: https://github.com/JacobSzwejbka	2022-11-14 19:27:46 +00:00
Kazuaki Ishizaki	e0c194f10b	Fix typos in messages under torch (#88961 ) This PR fixes typos of messages and parms in c++ source and head files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88961 Approved by: https://github.com/albanD	2022-11-14 19:06:41 +00:00
Peter Bell	3d79ced8cf	wrap_pybind_function: support member function pointers (#88932 ) This updates `wrap_pybind_function` to use `invoke` and adds the `invoke_traits` object which is analogous to `function_traits` but for member functions it includes the class as an explicit argument. To test this is working properly, I've also applied it to the `CUDAGraph` binding code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88932 Approved by: https://github.com/albanD	2022-11-14 18:47:34 +00:00
William Wen	36d87465fb	Fix long comment error on dashboard (#89002 ) Fix dashboard comment failure due to the following trace: ``` Traceback (most recent call last): File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1180, in <module> DashboardUpdater(args).update() File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1119, in update self.comment_on_gh(comment) File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1096, in comment_on_gh subprocess.check_call( File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 368, in check_call retcode = call(popenargs, kwargs) File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 349, in call with Popen(popenargs, **kwargs) as p: File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 951, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 1821, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) OSError: [Errno 7] Argument list too long: '/data/home/anijain/miniconda/bin/gh' srun: error: a100-st-p4d24xlarge-27: task 0: Exited with exit code 1 ``` That is, we were trying to execute a gh command in the OS that was too long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89002 Approved by: https://github.com/davidberard98	2022-11-14 18:43:50 +00:00
Sean Ross-Ross	cdb798faef	_get_nested_attr should return a value in the general case (#88822 ) Fixes https://github.com/pytorch/functorch/issues/1053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88822 Approved by: https://github.com/zou3519	2022-11-14 18:39:45 +00:00
Khushi Agrawal	f1a5044de0	[primTorch] _refs & opinfo alpha_dropout (#87989 ) Add _refs and OpInfo for `nn.functional.alpha_dropout` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87989 Approved by: https://github.com/mruberry	2022-11-14 18:18:45 +00:00
Ivan Yashchuk	b0c86caa1d	Remove cpu path from lobpcg's basis helper (#88984 ) Fixes https://github.com/pytorch/pytorch/issues/88650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88984 Approved by: https://github.com/lezcano	2022-11-14 17:49:30 +00:00
Natalia Gimelshein	06f1b52705	don't use prims.unsqueeze in group_norm (#88927 ) inductor doesn't have prims.squeeze lowering, so this breaks it. Longer term, `squeeze` with multiple dimensions is not a prim, nvfuser implements it with a loop, inductor uses `_squeeze_multiple` helper which turns it into a loop. Prim should accept only a single dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88927 Approved by: https://github.com/eellison	2022-11-14 17:37:24 +00:00
Peter Bell	c8f3d1c134	Run test_torchinductor_opinfo CPU tests if triton not installed (#88934 ) These test are not run currently because normal CI workers don't have triton installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88934 Approved by: https://github.com/ngimel	2022-11-14 15:49:34 +00:00
Brian Hirsh	ec4eadac5b	reland "Do not use unsafe restriding for subclasses (#87610 )" (#88343 ) This reverts commit 5b75b19f51837e162cc0e5e5757dfd9bef437c67. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88343 Approved by: https://github.com/ezyang	2022-11-14 13:42:51 +00:00
XiaobingSuper	9943d46aab	TorchDynamo: skip convolution fusion when convolution's padding is string (#88794 ) Currently, the fusion convolution doesn't support the case when padding is a string, we will support it at the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88794 Approved by: https://github.com/jansel, https://github.com/jgong5	2022-11-14 12:39:47 +00:00
XiaobingSuper	15ef0660c5	Fake Tensor For (ConvFusion) Propagation (#88414 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88414 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-14 12:35:09 +00:00
PyTorch MergeBot	5e6cefd258	Revert "Run test_torchinductor_opinfo CPU tests if triton not installed (#88934 )" This reverts commit 8371bb8a3dddbead709bc1e9d26715818a34fa8a. Reverted https://github.com/pytorch/pytorch/pull/88934 on behalf of https://github.com/peterbell10 due to Inductor tests failing on master	2022-11-14 12:02:43 +00:00
Peter Bell	8371bb8a3d	Run test_torchinductor_opinfo CPU tests if triton not installed (#88934 ) These test are not run currently because normal CI workers don't have triton installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88934 Approved by: https://github.com/ngimel	2022-11-14 10:51:12 +00:00
XiaobingSuper	072920c281	TorchDynamo: Add convolution binary+unary fusion for cpu in inference mode (#88412 ) This PR is about enabling the fusion of conv+binary+relu, which will improve the vision model's performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88412 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-14 10:35:41 +00:00
PyTorch MergeBot	cb4842c949	[xla hash update] update the pinned xla hash (#88982 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88982 Approved by: https://github.com/pytorchbot	2022-11-14 10:29:26 +00:00
Kazuaki Ishizaki	03296844aa	Fix typos in messages under aten (#88964 ) This PR fixes typos of messages and parms in c++ source files under `aten` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88964 Approved by: https://github.com/lezcano	2022-11-14 09:50:50 +00:00
XiaobingSuper	4ad7b17fab	TorchDynamo: Add convolution binary(inplace) fusion for cpu in inference mode (#88403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88403 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-14 08:42:40 +00:00
iLeGend	06486cd008	fix typo: AT_MKLDNN_EBABLED => AT_MKLDNN_ENABLED (#88952 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88952 Approved by: https://github.com/XiaobingSuper	2022-11-14 03:39:46 +00:00
PyTorch MergeBot	eea506aee1	Revert "Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761 )" This reverts commit 9eabcc370f4c3a04be85cb1f878038f10716bdc3. Reverted https://github.com/pytorch/pytorch/pull/88761 on behalf of https://github.com/suo due to much broken `9eabcc370f`	2022-11-14 01:58:47 +00:00
Aaron Gokaslan	48dc24ddce	Fix: [ATen] Add some missing moves (#88514 ) Related to #88512 , but for ATen. This should reduce a number of copies and inefficient atomic smart pointer increments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88514 Approved by: https://github.com/jgong5, https://github.com/ezyang	2022-11-13 22:05:41 +00:00
Sherlock Huang	9eabcc370f	Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761 Approved by: https://github.com/ezyang	2022-11-13 21:30:53 +00:00
Nikita Karetnikov	76af71444a	[primTorch] Add ref for `complex` (#88562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88562 Approved by: https://github.com/ezyang	2022-11-13 20:31:16 +00:00
Jason Ansel	8f7e519f12	Skip dynamo benchmark tests under TSAN (#88895 ) Summary: Fixes T137546804 Test Plan: ``` buck2 test mode/opt-tsan //caffe2/benchmarks/dynamo:test buck2 test mode/opt //caffe2/benchmarks/dynamo:test ``` Differential Revision: D41226384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88895 Approved by: https://github.com/anijain2305	2022-11-13 19:42:42 +00:00
anjali411	52be0c42ab	meta function for max_pool2d_with_indices_backward (#88743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88743 Approved by: https://github.com/lezcano, https://github.com/ezyang	2022-11-13 18:31:56 +00:00
PyTorch MergeBot	98bcb4acb6	Revert "[reland][dynamo] Better support for nn.Module (#88959 )" This reverts commit e950afc3958c9bae5d61cbc99bc088309141df6d. Reverted https://github.com/pytorch/pytorch/pull/88959 on behalf of https://github.com/malfet due to Broke `test_accuracy_issue1`	2022-11-13 16:21:14 +00:00
Animesh Jain	897d029a73	[reland][dynamo] fixes dict changed during runtime error (#88877 ) Reland https://github.com/pytorch/pytorch/pull/87526 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88877 Approved by: https://github.com/ezyang	2022-11-13 16:20:45 +00:00
Andrew Gu	4284862db6	[Dynamo][FSDP] Migrate to `ModuleWrapPolicy` (#88453 ) Hello @wconstab! As you saw, `transformer_auto_wrap_policy()` is a misnomer and actually works for any module classes. The PR before this one tries to add a class `ModuleWrapPolicy` that takes in the `module_classes` in its constructor and works just like `transformer_auto_wrap_policy()` without requiring the `functools.partial()`. I hope you do not mind if we update the dynamo benchmarks util file with this migration. The PR before this one might require some back and forth within FSDP devs, so I apologize for any consequent updates to this PR, which in itself is an easy change. I will request review once we know the previous PR is good for land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88453 Approved by: https://github.com/wconstab	2022-11-13 14:56:30 +00:00
Chen Lai	bca75fd2d3	Move xnnpack taget to fb code base (#88909 ) 1. Move the source file list to the `build_variables.bzl`, as it's the source of truth for both internal buck build and oss build 2. Move target definitions to `fb` internal folder 3. Some changes are triggered from auto format. Differential Revision: [D40906961](https://our.internmc.facebook.com/intern/diff/D40906961/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40906961/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/88909 Approved by: https://github.com/mcr229	2022-11-13 12:04:35 +00:00
Animesh Jain	2b12bfce88	[dynamo] Skip frame when graph break in a loop (#88857 ) This fixes excessing recompilation issue in tacotron2 but has few caveats - https://github.com/pytorch/torchdynamo/issues/330 For tacotron2, the repro is something like this ~~~ def inner(x): return torch.sin(x) def fn(x): for _ in range(100): inner(x) torch._dynamo.graph_break() return x ~~~ The problem here is that Dynamo has guards on the TUPLE_ITERATOR_LEN whenever a graph break happens. Therefore, we keep on recompiling. This PR checks if there is a backedge (helps with while loop) in presence of a graph break. If there is, Dynamo skips processing this frame. Therefore, Dynamo gets called when inner is called, and we compile only once. Note that, if there was no graph break, we will unroll the original loop, and see one graph with 100 sin operations (just as before, so no changes there). The caveat is - We are skipping the frame, so if we have something like this ~~~ def fn(x): for _ in range(100): # 1000s of lines of PyTorch code torch._dynamo.graph_break() return x ~~~ Dynamo will skip processing this frame, and might miss on the optimization. Completely open for suggestions. Happy to re-implement if there is a better way to handle this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88857 Approved by: https://github.com/jansel, https://github.com/yanboliang	2022-11-13 09:53:38 +00:00
Animesh Jain	e950afc395	[reland][dynamo] Better support for nn.Module (#88959 ) Relanding https://github.com/pytorch/pytorch/pull/88629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88959 Approved by: https://github.com/msaroufim	2022-11-13 08:19:45 +00:00
Michael Voznesensky	06ce1338bc	[dynamo] Port all pytorch/dynamo and test/dynamo pieces over from symbolic-shapes branch (#88768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88768 Approved by: https://github.com/jansel, https://github.com/ezyang	2022-11-13 04:50:21 +00:00
Andrew Gu	4f2639e56a	[FSDP] Fix `FSDP.clip_grad_norm_()` for `NO_SHARD` (#88955 ) This PR fixes `FSDP.clip_grad_norm_()` for `NO_SHARD`, which previously "double-counted" each gradient `world_size`-many times. This does not address any discrepancies between `FULL_SHARD` and DDP. (Note that the unit tests do show parity between `FULL_SHARD` and DDP when using `FSDP.clip_grad_norm_()` and `nn.utils.clip_grad_norm_()` respectively on one iteration.) The added unit test code path tests mixing nested FSDP instances with both `FULL_SHARD` and `NO_SHARD` to ensure that the `local_sharded_norm` and `local_nonsharded_norm` computations are interoperating correctly. I want to test non-FSDP root instance in the future, but this is BC breaking since we need to make `clip_grad_norm_()` a static method, which would require a different method call syntax (`FSDP.clip_grad_norm_(root_module, ...)` vs. `root_module.clip_grad_norm_(...)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88955 Approved by: https://github.com/zhaojuanmao	2022-11-13 02:38:38 +00:00
Edward Z. Yang	46796fe5e9	Fix XLA symbolic shapes binding (#88928 ) Obsoletes https://github.com/pytorch/pytorch/pull/88772 Mostly revolves around NOT assuming that the inside is a SymNode, but instead duck-typed to be a SymNode. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88928 Approved by: https://github.com/SherlockNoMad	2022-11-13 00:31:27 +00:00
Aleksandar Samardžić	2aca97cc9a	Vectorized CPU code implementing left shift operator. (#88607 ) This PR adds vectorized implementation for CPU version of left shift operator. All of the tests run by `pytest test/test_ops.py -vk left_shift` pass. Here are some additional details: <details> <summary> Benchmarking script (writen by Philip, with small tweaks by Mario) comparing left shifts with multiplications - on par now </summary> ```python import torch from torch import Tensor from torch.utils.benchmark import Timer, Compare from itertools import product from functools import partial # These functions exist, because torch.jit.script does not support `torch.iinfo` def _num_value_bits(dtype): if dtype == torch.uint8: return 8 else: # torch.int32 return 31 def _max_value(dtype): if dtype == torch.uint8: return 255 else: # torch.int32 return 2147483647 def bitshift(image, dtype): num_value_bits_input = _num_value_bits(image.dtype) num_value_bits_output = _num_value_bits(dtype) return image.to(dtype).bitwise_left_shift_(num_value_bits_output - num_value_bits_input) def mul(image, dtype): input_max = float(_max_value(image.dtype)) output_max = float(_max_value(dtype)) factor = int((output_max + 1) // (input_max + 1)) image = image.to(dtype) return image * factor size = 256 image = torch.randint(0, 256, (3, size, size), dtype=torch.uint8) dtype = torch.int32 def gen_inputs(): devices = ("cpu",) fns = (mul, bitshift) threads = (1,) for device, fn, threads in product(devices, fns, threads): yield f"Bitshift {device} {image.dtype}", str(tuple(image.shape)), threads, fn, image, dtype def benchmark(label, sub_label, threads, f, args, kwargs): return Timer("f(args, *kwargs)", globals=locals(), label=label, description=f.__name__, sub_label=sub_label, num_threads=threads).blocked_autorange() results = [] for args in gen_inputs(): results.append(benchmark(args)) compare = Compare(results) compare.trim_significant_figures() compare.print() ``` </details> <details> <summary> Test script exercising large number of combinations of left shift operands that I've used for further testing (validates results through comparing with results generated by NumPy) </summary> ```python import numpy as np import torch # Testing shifting of non-negative numbers only, but will test all # possible RHS shift values for given type. For int8 and int16, we'll # test shifting all of non-negative values represntable by type. For # the rest of data types, we'll test shifting some random numbers in # the corresponding range. def _create_inputs(dtype): info = torch.iinfo(dtype) if dtype == torch.int8 or dtype == torch.int16: ntests = info.max + 1 x = torch.arange(info.max + 1, dtype=dtype, device="cpu", requires_grad=False) else: ntests = 100000 x = torch.randint(info.max + 1 if dtype != torch.int64 else info.max, (ntests,), dtype=dtype, device="cpu", requires_grad=False) y = torch.tensor(range(info.bits), dtype=dtype, device="cpu", requires_grad=False) xy = torch.cartesian_prod(x, y) return (xy[:, 0], xy[:, 1]) torch.manual_seed(0) # Perform testing for each datatype supported, and compare results # with ones generated by numpy. for dtype in (torch.int8, torch.int16, torch.int32, torch.int64): (x, y) = _create_inputs(dtype) z = x << y xnp = x.numpy() ynp = y.numpy() znp = z.numpy() assert((znp == (xnp << ynp)).all()) ``` </details> <details> <summary> Benchmarking script running the left shift operator on tensors of different length (and varying number of bits to shift) </summary> ```python import torch import pickle import itertools from torch.utils.benchmark import Timer, Compare torch.manual_seed(0) # Edit this part if needed. lengths = [1024, 4096, 16384, 65536] rhss = [1, 2, 7, 8, 15, 16, 31, 32, 63, 64] benchmark_name = "lshift" label = "" dtypes = [torch.int8, torch.int16, torch.int32, torch.int64] results = [] # Create an argument pair for testing. Argument are tensors of given # datatype and length, LHS for each shift operation is a random # number, and RHS is given value that is same for all of them. def _make_args(dtype, length, rhs): info = torch.iinfo(dtype) imax = info.max return (torch.randint(info.max, (length,), dtype=dtype, device="cpu", requires_grad=False), rhs * torch.ones((length,), dtype=dtype, device="cpu", requires_grad=False)) # Run shift operation for vectors of given lenghts and for given # number of bits to be shifted, and remember timings. for dtype, length, rhs in itertools.product(dtypes, lengths, rhss): x, y = _make_args(dtype, length, rhs) timer = Timer("x << y", globals=globals(), label=benchmark_name, description=label, sub_label=f"dtype={dtype},length={length}", num_threads=1) results.append(timer.blocked_autorange()) # Gather results. compare = Compare(results) compare.trim_significant_figures() compare.print() # Print results. with open("{}.pickle".format(label), "wb") as f: pickle.dump(results, f) ``` </details> <details> <summary> Results of running above benchmarking script - results manually merged for runs of viable/strict (labeled "master" in the table below) and my branch (labeled "mybranch" in the table below) </summary> ``` [------------------- lshift -------------------------------] \| master \| mybranch 1 threads: ------------------------------------------------ dtype=torch.int8,length=1024 \| 3 \| 3 dtype=torch.int8,length=4096 \| 5 \| 3 dtype=torch.int8,length=16384 \| 14 \| 5 dtype=torch.int8,length=65536 \| 51 \| 15 dtype=torch.int16,length=1024 \| 3 \| 3 dtype=torch.int16,length=4096 \| 4 \| 3 dtype=torch.int16,length=16384 \| 11 \| 5 dtype=torch.int16,length=65536 \| 39 \| 13 dtype=torch.int32,length=1024 \| 3 \| 2 dtype=torch.int32,length=4096 \| 4 \| 3 dtype=torch.int32,length=16384 \| 10 \| 4 dtype=torch.int32,length=65536 \| 35 \| 12 dtype=torch.int64,length=1024 \| 3 \| 3 dtype=torch.int64,length=4096 \| 4 \| 3 dtype=torch.int64,length=16384 \| 11 \| 6 dtype=torch.int64,length=65536 \| 36 \| 20 Times are in microseconds (us). ``` </details> All of the testing/benchmarking was conducted on qpu3, that supports AVX2 only. For basic validation of AVX-512 update of left shift implementation for 8-bit operands (that is the only one that is non-trivial in AVX-512 case), [Compiler Explorer](https://godbolt.org/) is used, with GCC trunk and `-mavx512f -mavx512bw` flags added. Here are further details: <details> <summary> C program used for basic validation of AVX-512 vectorized version for 8-bit operands </summary> ``` #include <stdio.h> #include <stdint.h> #include <string.h> #include <immintrin.h> static void print_m512i_int8(const __m512i* x) { int8_t val[64]; memcpy(val, x, sizeof(val)); for (int i = 0; i < 64; ++i) { if (i > 0) printf(", "); printf("%d", (int)val[i]); } printf("\n"); } int main() { __m512i a = _mm512_set_epi8(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1); __m512i b = _mm512_set_epi8(7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0); // ------- Copied code from vec512_int.h // Mask used to set upper 8 bits of each 16-bit value to 0, and keep // lower 8 bits. __m512i mask = _mm512_set_epi16(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff); // Convert 8-bit operands from lower lanes to 16-bit values, and // perform vectorized shift. Make sure that upper 8 bits of 16-bit // results are all 0. __m256i a_lo_8 = _mm512_extracti64x4_epi64(a, 0); __m256i b_lo_8 = _mm512_extracti64x4_epi64(b, 0); __m512i a_lo_16 = _mm512_cvtepi8_epi16(a_lo_8); __m512i b_lo_16 = _mm512_cvtepi8_epi16(b_lo_8); __m512i c_lo_16 = _mm512_and_si512(_mm512_sllv_epi16(a_lo_16, b_lo_16), mask); // Convert 8-bit operands from upper lanes to 16-bit values, and // perform vectorized shift. Make sure that upper 8 bits of 16-bit // results are all 0. __m256i a_hi_8 = _mm512_extracti64x4_epi64(a, 1); __m256i b_hi_8 = _mm512_extracti64x4_epi64(b, 1); __m512i a_hi_16 = _mm512_cvtepi8_epi16(a_hi_8); __m512i b_hi_16 = _mm512_cvtepi8_epi16(b_hi_8); __m512i c_hi_16 = _mm512_and_si512(_mm512_sllv_epi16(a_hi_16, b_hi_16), mask); // Cast 16-bit results back into 8-bit values and merge them // together (using unsigned saturation with higher 8 bits set to 0 // above ensures that results are correct). Values are merged per // lanes, so this is not yet the final result. __m512i c_perm = _mm512_packus_epi16(c_lo_16, c_hi_16); // Permute values so that final result is produced. __m512i idx = _mm512_set_epi64(7, 5, 3, 1, 6, 4, 2, 0); __m512i c = _mm512_permutexvar_epi64(idx, c_perm); // ------- End copied print_m512i_int8(&c); // Expected output: 1(x8), 2(x8), 4(x8), 8(x8), 16(x8), 32(x8), 64(x8), 128(x8), -128(x8) return 0; } ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88607 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10	2022-11-13 00:31:11 +00:00
Howard Huang	df1df9d10a	[16/N] Add _allgather_base custom op with CPU/CUDA implementation (#88889 ) Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88889 Approved by: https://github.com/kwen2501	2022-11-12 22:31:07 +00:00
ydwu4	3765621356	torchdynamo support self.modules() for nn_module (#88695 ) This PR allows models to call self.modules() during dynamo tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88695 Approved by: https://github.com/voznesenskym	2022-11-12 20:00:51 +00:00
soulitzer	27dc03e09b	Turn internal assert when saved tensor is detached inplace into torch check (#88860 ) Fixes https://github.com/pytorch/pytorch/issues/88809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88860 Approved by: https://github.com/albanD	2022-11-12 18:33:18 +00:00
Nikita Karetnikov	4270bb37da	[primTorch] Improve `narrow` and `narrow_copy`: refs, tests, docs (#87045 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87045 Approved by: https://github.com/mruberry	2022-11-12 15:03:50 +00:00
Howard Huang	6e5f736d86	[15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846 ) Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846 Approved by: https://github.com/kwen2501	2022-11-12 14:23:45 +00:00
PyTorch MergeBot	ae2c668cc0	Revert "[dynamo][api] Better support of torch.nn.Module (#88629 )" This reverts commit c83348597b195f2da1cca0e8318c878b104bce5d. Reverted https://github.com/pytorch/pytorch/pull/88629 on behalf of https://github.com/anijain2305 due to job failing on master https://github.com/pytorch/pytorch/actions/runs/3449914495/jobs/5758267231	2022-11-12 07:52:56 +00:00
Jerry Zhang	6b775c42dd	[quant][executorch] Support quant fusion for reshape in quant in executorch stack (#88858 ) Summary: This diff added support for fusing "dq - reshape - q" to a reshape op, the op is needed in wakeword model Test Plan: buck test executorch/exir/tests:quant_fusion_pass Reviewed By: qihqi, JacobSzwejbka Differential Revision: D41111069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88858 Approved by: https://github.com/JacobSzwejbka	2022-11-12 07:52:44 +00:00
PyTorch MergeBot	34641c4384	Revert "Add comprehensive minifier tests (#88022 )" This reverts commit 5ff600aa6e40c6b4d426594bbb1f446f005b7fb3. Reverted https://github.com/pytorch/pytorch/pull/88022 on behalf of https://github.com/wconstab due to Seems to be causing CI failures relating to minifier test and some /tmp/ path not existing	2022-11-12 05:16:41 +00:00
Animesh Jain	c83348597b	[dynamo][api] Better support of torch.nn.Module (#88629 ) This is an API change, so please review carefully. With this PR, torchdynamo returns an `OptimizedModule` class object, a subclass of `torch.nn.Module`, when asked to optimize a `nn.Module` object. Most of the methods are redirected to the original `nn.Module`, which is installed as `_mod` in the `OptimizedModule`. This is helpful for many cases ``` mod = MockModule() opt_mod = torch._dynamo.optimize()(mod) print(opt_mod) # Works opt_mod = opt_mod.to(device="cuda") print(opt_mod) # Works opt_mod(input) # Triggers recompile if necessary, earlier we were shedding the TorchDynamo wrapper opt_mod.parameters() # Refers to the original module ``` Topics unclear to me * I have overridden many methods to raise NotImplementedError. A careful review of those will be good. * hooks * For the optimized forward, should we call torchdynamo optimization on `__call__` or `forward` * What else to test Pull Request resolved: https://github.com/pytorch/pytorch/pull/88629 Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/msaroufim	2022-11-12 04:45:17 +00:00
Andrew Gu	d01bf1d1f1	[FSDP] Introduce `ModuleWrapPolicy` for simplicity (#88450 ) BC Breaking Change This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap" suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves. This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code). In a follow-up, I want to rename `min_num_params` to `min_nonwrapped_numel` in `size_based_auto_wrap_policy`, which is also BC breaking. Again, this is to differentiate between "params" being `nn.Parameter`s and "numel" being the unit for `param.numel()`. Overview This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is: ``` module_classes: Set[Type[nn.Module]] = ... auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls=module_classes, ) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` Now, users can instead write: ``` auto_wrap_policy = ModuleWrapPolicy(module_classes) fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...) ``` This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`). `ModuleWrapPolicy` inherits from an abstract base class `FSDPPolicy` that expects a `policy` property. This decouples the construct of such `FSDPPolicy` classes and their actual `policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `FSDPPolicy`, so this approach is fully backward compatible from a functionality perspective. I call this base class `FSDPPolicy` to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing `FlatParameter`s, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in `fully_shard()` to simply `policy` instead of `auto_wrap_policy`. This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88450 Approved by: https://github.com/zhaojuanmao	2022-11-12 04:14:32 +00:00
PyTorch MergeBot	b2b0a0d3ba	[vision hash update] update the pinned vision hash (#88920 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88920 Approved by: https://github.com/pytorchbot	2022-11-12 03:21:08 +00:00
Chien-Chin Huang	ae4074669e	[FSDP][state_dict][6/N] Remove most FSDP module dependency from _optim_utils (#88638 ) What This PR removes most `FullyShardedDataParallel` dependencies from `optim_utils`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88638 Approved by: https://github.com/awgu	2022-11-12 03:16:37 +00:00
Bin Bao	4108367123	Exclude poolformer_m36 from the inductor model test (#88908 ) Summary: The root cause is still to be investigated. Issue tracked at https://github.com/pytorch/torchdynamo/issues/1856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88908 Approved by: https://github.com/malfet	2022-11-12 03:10:25 +00:00
mikey dagitses	1e2327baf7	fix fx tests (#88886 ) Summary: Some source files are missing and TPX couldn't handle the default test names. Test Plan: Rely on CI. Differential Revision: D41218564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88886 Approved by: https://github.com/zou3519	2022-11-12 02:23:48 +00:00
Edward Z. Yang	66736ff425	Fix bug in OptionalTensorList (#88887 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88887 Approved by: https://github.com/anjali411	2022-11-12 02:19:46 +00:00
Edward Z. Yang	2b166532f7	Remove incorrect assert about hermetic state. (#88885 ) I'm not sure why I thought this assert was valid in the first place, and there's no comment about it. The assert is tantamount to saying, "no tensor objects should become dead via SafePyObject when hermetic mode is on." But suppose we run a Python GC while we're inside hermetic mode. This could result in us disposing non-hermetic tensors, which would hit decref. So the assert seems invalid. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88885 Approved by: https://github.com/anjali411, https://github.com/malfet	2022-11-12 02:19:45 +00:00
Jiaxu Zhu	2cd05a2818	Support torch.qint32 in Convert (#88871 ) Enable the `torch.qint32` when creating `quantize_per_tensor` function call in `convert_fx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88871 Approved by: https://github.com/jerryzh168	2022-11-12 01:20:52 +00:00
Will Constable	a3f3ec8fac	[FSDP+dynamo]: forward treats parameter-views as params (#88781 ) Dynamo+AotAutograd needs a way to wrap all tensors (whether inputs or params/buffers) in FakeTensor wrappers, and FSDP's mangling of parameters hides them from this wrapping. This PR unblocks running hf_bert and hf_T5 with FSDP under dynamo, whether using recursive wrapping around transformer layers or only applying FSDP around the whole model. Perf/memory validation and possibly optimization is the next step. `python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager` `python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager --fsdp_wrap` `python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager` `python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager --fsdp_wrap` The problem: Dynamo (Actually aot_autograd) trips up with FSDP becuase it must wrap all input tensors in FakeTensor wrappers, and it only knows to wrap graph inputs or named_(parameters, buffers). FSDP's pre_forward hook sets views (which are not nn.param) into the flatparam as attrs on the module with the same name as the original param, but they will not show up in named_parameters. - in use_orig_params mode, FSDP still de-registers params during pre-forward hook, then re-registers them post-forward - during forward (between the hooks), the params are setattr'd on the module as regular view tensors, not nn.Parameters - note: use_orig_params is the recommended way to use FSDP, and use_orig_params=False is being deprecated. So i only consider use_orig_params=True for this enablement The solution: - adding them to named_buffers is not possible because it interferes with how FSDP's `_apply` works - since they are not actual nn.parameters, register_parameter will complain about registering them - simply seting `module._parameters[name] = view` seems to be a viable workaround, despite being hacky, and FSDP code does modify _parameters directly already. Note: Manual checkpointing still isn't working with FSDP+dynamo, so that will have to be addressed in a follow up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88781 Approved by: https://github.com/ezyang, https://github.com/awgu	2022-11-12 01:17:23 +00:00
William Wen	5ff600aa6e	Add comprehensive minifier tests (#88022 ) Adds tests for https://github.com/pytorch/torchdynamo/issues/1241. To run: `pytest test/dynamo/test_minifier.py`. Actually runs minifier launcher script and repro scripts, rather than just checking for existence of the minifier launcher script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88022 Approved by: https://github.com/mlazos, https://github.com/anijain2305	2022-11-12 00:22:25 +00:00
Horace He	37c5b42fa6	Fix matmul decomp to use reshape instead of contiguous().view() (#88832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88832 Approved by: https://github.com/bertmaher, https://github.com/ngimel	2022-11-12 00:15:42 +00:00
Richard Zou	7c3adddd6c	[functorch] delete some unused files (#88763 ) Some post-merge cleanup. - packaging/ was for building standalone windows binaries - our flake8 config got superceded by PyTorch's. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88763 Approved by: https://github.com/samdow	2022-11-11 23:58:51 +00:00
Peter Bell	a7fa423f48	copy_: Short-circuit when self and src view the same data (#88884 ) This comes up if you use inplace operators on a slice, e.g. ```python import torch a = torch.rand(1000000, device="cuda") a[::2] = 2 ``` The last line looks as if it should be fully inplace, but is actually equivalent to: ```python tmp = a[::2] tmp = 2 a[::2] = tmp ``` Which results in `mul_` and `copy_` being called. With this PR, the redundant copy becomes a no-op and the above example is 2x faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88884 Approved by: https://github.com/ngimel	2022-11-11 23:31:15 +00:00
Yanbo Liang	6fe47b682f	[Dynamo] Fix str(Guard.obj_weakref) bug to re-ennable support overriding __getattr__ (#88564 ) See my inline comments! Pull Request resolved: https://github.com/pytorch/pytorch/pull/88564 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2022-11-11 22:31:32 +00:00
Kevin Tse	be8d88f8d0	[DataLoader] Removing DataLoader2 related code (#88848 ) Removing these lines of code as `DataLoader2` has been added to [TorchData](https://github.com/pytorch/data). I'm importing this to confirm it will not impact internal codes. Differential Revision: [D41201578](https://our.internmc.facebook.com/intern/diff/D41201578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88848 Approved by: https://github.com/ejguan	2022-11-11 22:27:01 +00:00
Nikita Shulga	f39cad50b7	Make InductorCPU usable in internally (#88870 ) Test Plan: `buck2 test mode/opt //caffe2/test:test_inductor -- --exact 'caffe2/test:test_inductor - test_dtype_mismatch_issue_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)'` Differential Revision: D41206109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88870 Approved by: https://github.com/izaitsevfb	2022-11-11 22:07:34 +00:00
BowenBao	fbc1878265	[ONNX] Pretty print diagnostic logging (#88261 ) Adds pretty print diagnostic logging. For example ```python import io import torch from torch.onnx._internal import diagnostics class CustomAdd(torch.autograd.Function): @staticmethod def forward(ctx, x, y): return x + y @staticmethod def symbolic(g, x, y): return g.op("custom::CustomAdd", x, y) class M(torch.nn.Module): def forward(self, x): return CustomAdd.apply(x, x) # trigger warning for missing shape inference. # rule = diagnostics.rules.node_missing_onnx_shape_inference torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) ``` By default, observe minimum summary of diagnostics ``` ========= Diagnostic Run torch.onnx.export version 1.14.0a0+git90a69c5 ========= verbose: False, log level: Level.ERROR ======================= 0 NONE 0 NOTE 3 WARNING 0 ERROR ======================== 3 WARNING were not printed due to the log level. ``` Adjusting the `verbose` and `level` argument. ```python diagnostics.engine.pretty_print(verbose=True, level=diagnostics.levels.WARNING) ``` Prints full log. ``` =============================== 1 Diagnostic Run =============================== ========= Diagnostic Run torch.onnx.export version 1.14.0a0+git90a69c5 ========= verbose: True, log level: Level.WARNING ======================= 0 NONE 0 NOTE 3 WARNING 0 ERROR ======================== WARNING: node-missing-onnx-shape-inference ========================================== The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. --------------------------- Stack: Python call stack --------------------------- frame: diagnostic = ExportDiagnostic(rule, level, message, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151 frame: n, utils._params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/_patch_torch.py:82 frame: <@beartype(torch.onnx._patch_torch._graph_op) at 0x7f62184b6710>:78 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: return function(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_deprecation.py:30 frame: return g.op("custom::CustomAdd", x, y) test_pretty_print.py:14 frame: return symbolic_fn(g, args) /home/bowbao/pytorch_dev/torch/onnx/utils.py:1716 frame: return beartyped(args, kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: graph = _C._jit_pass_onnx(graph, operator_export_type) /home/bowbao/pytorch_dev/torch/onnx/utils.py:663 frame: <@beartype(torch.onnx.utils._optimize_graph) at 0x7f62180e05f0>:85 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: module=module, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1123 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519 frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22 ---------------------------- Stack: C++ call stack ----------------------------- frame: (<unknown frame>) frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::UpdateReliable(torch::jit::Value, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::UpdateReliable(torch::jit::Node) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0xabd026 (0x7f625b599026 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown frame>) WARNING: node-missing-onnx-shape-inference ========================================== The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. --------------------------- Stack: Python call stack --------------------------- frame: diagnostic = ExportDiagnostic(rule, level, message, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151 frame: graph, params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/utils.py:688 frame: <@beartype(torch.onnx.utils._optimize_graph) at 0x7f62180e05f0>:85 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: module=module, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1123 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519 frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22 ---------------------------- Stack: C++ call stack ----------------------------- frame: (<unknown frame>) frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::UpdateReliable(torch::jit::Value, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::UpdateReliable(torch::jit::Node) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0x87d6d1 (0x7f625b3596d1 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::ONNXShapeTypeInference(std::shared_ptr<torch::jit::Graph>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0x33 (0x7f625b359cf3 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0xabdbae (0x7f625b599bae in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown frame>) WARNING: node-missing-onnx-shape-inference ========================================== The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function. --------------------------- Stack: Python call stack --------------------------- frame: diagnostic = ExportDiagnostic(rule, level, message, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151 frame: graph, params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/utils.py:1179 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519 frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347 frame: return beartyped(args, *kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81 frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22 ---------------------------- Stack: C++ call stack ----------------------------- frame: (<unknown frame>) frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::UpdateReliable(torch::jit::Value, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::UpdateReliable(torch::jit::Node) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0x87d6d1 (0x7f625b3596d1 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (torch::jit::ONNXShapeTypeInference(std::shared_ptr<torch::jit::Graph>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0x33 (0x7f625b359cf3 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0xabdbae (0x7f625b599bae in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so)) frame: (<unknown frame>) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88261 Approved by: https://github.com/abock, https://github.com/justinchuby	2022-11-11 21:59:16 +00:00
efiks	ea0ec9d71c	[tourch] BatchBoxCox - fix numerical issue in vectorized code (#88875 ) Summary: Usage of fast math in BatchBoxCox kernel provided different math results between dev and optimized versions which cause few internal test to fail. For now disabling the compiler optimized version and relying on ATEN vectors Differential Revision: D41211784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88875 Approved by: https://github.com/hyuen	2022-11-11 21:58:23 +00:00
Richard Barnes	dfb4b73e45	Fix unused variable 'options' warning in RNN.cpp (#88753 ) Fixes ``` /home/rbarnes/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:73:17: warning: unused variable 'options' [-Wunused-variable] TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory); ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88753 Approved by: https://github.com/soumith	2022-11-11 21:51:13 +00:00
Chien-Chin Huang	7aa144ac54	[FSDP][state_dict][5/N] Remove the FSDP module dependency from _state_dict_utils (#88637 ) What This PR completely removes the `FullyShardedDataParallel` dependency from `_state_dict_utils` -- `_state_dict_utils` now depends only on `_FSDPState` and all the utils modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88637 Approved by: https://github.com/awgu	2022-11-11 21:22:13 +00:00
Nikita Shulga	575e02df53	Fix CUDNN_PATH handling on Windows (#88898 ) Fixes https://github.com/pytorch/pytorch/issues/88873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88898 Approved by: https://github.com/kit1980	2022-11-11 21:19:26 +00:00
kshitij12345	f74946324e	[fix] allow saving python attr on Tensor and Parameter via torch.save (#81616 ) Fixes: https://github.com/pytorch/pytorch/issues/72129 TODO: * [x] Fix for Parameter Benchmark (Measurable diff for small tensors) ``` [-------------- Save and Load --------------] \| After PR \| Before PR 1 threads: ---------------------------------- () \| 111.7 \| 106.9 (4, 4) \| 114.4 \| 109.2 (128, 128) \| 135.2 \| 128.3 (1024, 1024) \| 1431.9 \| 1431.3 Times are in microseconds (us). ``` <details> <summary> Benchmark Script </summary> ```python import torch from torch.testing._internal.common_utils import BytesIOContext from torch.utils import benchmark import pickle shapes = ((), (4, 4), (128, 128), (1024, 1024)) sizes = [1, 64, 1024, 10000] results = [] def save_load_fn(t): with BytesIOContext() as f: torch.save(t, f) f.seek(0) torch.load(f) for shape in shapes: t = torch.randn(shape) label = 'Save and Load' sub_label = f'{shape}' results.append(benchmark.Timer( stmt='save_load_fn(t)', globals={'t': t, 'save_load_fn':save_load_fn}, label=label, sub_label=sub_label, description='Before PR', ).blocked_autorange(min_run_time=2)) compare = benchmark.Compare(results) compare.print() with open('before_pr.pkl', 'wb') as f: pickle.dump(results, f) # with open('after_pr.pkl', 'rb') as f: # after_pr = pickle.load(f) # with open('before_pr.pkl', 'rb') as f: # before_pr = pickle.load(f) # compare = benchmark.Compare(after_pr + before_pr) # compare.print() ``` </details> NOTE : BC-Breaking : After this PR, all tensors (also regular tensors) will be serialised using `_rebuild_from_type_v2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81616 Approved by: https://github.com/albanD, https://github.com/kurtamohler	2022-11-11 21:11:12 +00:00
PyTorch MergeBot	ba4d5aae06	Revert "rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218 )" This reverts commit 7f28be10e5e71efda37800384fa897785499bed1. Reverted https://github.com/pytorch/pytorch/pull/88218 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41211901	2022-11-11 19:13:05 +00:00
PyTorch MergeBot	4e5d7afe84	Revert "add DisableTorchFunction that matches DisableTorchDispatch (#88219 )" This reverts commit c0ecce15b5a54ff0185f9976e6bfb6f3a7de698d. Reverted https://github.com/pytorch/pytorch/pull/88219 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41211901	2022-11-11 19:08:30 +00:00
BowenBao	9d7d21f569	[ONNX] Add stack info to diagnostics (#87258 ) ~~Investigating strange bug releasing 'graph' right when returning from `_C._jit_pass_onnx`.~~ ~~Can be repro-ed locally via `test_cpp_diagnose`, with changes in this PR.~~ Resolved by https://github.com/pytorch/pytorch/pull/87829. This PR adds methods to record stack backtrace information to diagnostics. * #87830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87258 Approved by: https://github.com/abock	2022-11-11 18:58:15 +00:00
Chien-Chin Huang	3d1c5c89ed	[FSDP][state_dict][4/N] Move the core logic of summon full parameters to _unshard_params_utils.py (#88636 ) What `_summon_full_parameters` is required for state_dict. To enable composable FSDP state_dict, `_summon_full_params` must be accessible without FullyShardedDataParall. This PR move the core logic of `_summon_full_params` to `_unshard_params_utils`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88636 Approved by: https://github.com/awgu	2022-11-11 18:30:57 +00:00
Thiago Crepaldi	5f0783bd6d	Fix ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops (#88504 ) Follow-up for #87735 Once again, because BUILD_CAFFE2=0 is not tested for ONNX exporter, one scenario slipped through. A use case where the model can be exported without aten fallback when operator_export_type=ONNX_ATEN_FALLBACK and BUILD_CAFFE2=0 A new unit test has been added, but it won't prevent regressions if BUILD_CAFFE2=0 is not executed on CI again Fixes #87313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88504 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-11 17:43:46 +00:00
Elias Ellison	8ff2e34ca6	Take input striding for conv forward based on eager output (#88706 ) From discussion with @Chillee and @ngimel we'll likely need further fixes to ensure that we hit channels last kernels but this is still worth landing in its own right. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88706 Approved by: https://github.com/ngimel	2022-11-11 17:29:15 +00:00
PyTorch MergeBot	adfbd831cf	Revert "[Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339 )" This reverts commit 8f66ae413f8c9d7f2418d7f0b9f69d409c455b46. Reverted https://github.com/pytorch/pytorch/pull/88339 on behalf of https://github.com/mehtanirav due to Internal test failures	2022-11-11 17:03:25 +00:00
Kurt Mohler	89a326ff7e	Explicitly check filelike arg of `torch.save` (#88867 ) Fixes #88793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88867 Approved by: https://github.com/ezyang	2022-11-11 16:57:08 +00:00
Elias Ellison	a6832b08a3	Regularize bernouilli_ with bernouilli decomp (#88349 ) Fix for https://github.com/pytorch/torchdynamo/issues/1796. Just like the other [bernouilli decomp](https://github.com/pytorch/pytorch/blob/master/torch/_inductor/decomposition.py#L302) we need to pass `dtype=float32` to avoid `"check_uniform_bounds" not implemented` errors. Are we planning on enabling `TEST_WITH_TORCHINDUCTOR` ? Do I need to change anything with the tests ? Pull Request resolved: https://github.com/pytorch/pytorch/pull/88349 Approved by: https://github.com/desertfire	2022-11-11 16:53:02 +00:00
Nikita Karetnikov	1e8f95ace1	Symintify `broadcast_to` (#88776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88776 Approved by: https://github.com/ezyang	2022-11-11 15:49:43 +00:00
anjali411	d615d12289	Add meta impl for topk (#88694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88694 Approved by: https://github.com/ezyang	2022-11-11 15:28:41 +00:00
Chien-Chin Huang	3c7f96665e	[FSDP][state_dict][3/N] Change how state_dict utils access attributes in _FSDPState (#88635 ) What This PR Does _state_dict_utils currently accesses the FSDP states through module. To enable composable FSDP state_dict, these accesses need to go through _FSDPState. module is still required for most APIs as state_dict has to access per-module information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88635 Approved by: https://github.com/awgu	2022-11-11 15:20:36 +00:00
soulitzer	b92acee8f8	Add context manager to allow mutation on saved tensors (#79056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/79056 Approved by: https://github.com/albanD	2022-11-11 15:18:28 +00:00
Bin Bao	91b71cdbe4	[dynamo] Add torch.device to is_safe_constant (#88766 ) Test Plan: ``` PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py -k test_advancedindex_mixed_cpu_devices_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88766 Approved by: https://github.com/jansel	2022-11-11 15:06:17 +00:00
Chien-Chin Huang	324ac93a43	[FSDP][state_dict][2/N] Move state_dict related enums/dataclasses/states to state_dict_utils.py, api.py and init_state_dict() (#88481 ) Motivation: Several Enums, Dataclasses and states defined in fully_sharded_data_paralle.py should be moved to a place where the composable FSDP can access. This PR does the move. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88481 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-11-11 12:28:37 +00:00
Michael Gschwind	ee91c328da	Fix cuda/cpu check on NoneType (#88854 ) Summary: Fix cuda/cpu check on NoneType Test Plan: sabdcastle/ github CI/CD Differential Revision: D41203955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88854 Approved by: https://github.com/drisspg, https://github.com/ngimel	2022-11-11 12:19:31 +00:00
kshitij12345	d15a6b0c97	Error on ZeroTensor serialization (#88803 ) Follow-up : https://github.com/pytorch/pytorch/pull/88182#issuecomment-1308628415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88803 Approved by: https://github.com/anjali411	2022-11-11 08:51:29 +00:00
AllenTiTaiWang	b843f4db0a	[ONNX] Add test case for onnx::Max scalar type (#88751 ) Referenced by minimum cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/88751 Approved by: https://github.com/wschin, https://github.com/BowenBao	2022-11-11 07:08:56 +00:00
Eddie Yan	396c3b1d88	Use `atomicAdd` for `bfloat16` in Ampere and above (#84981 ) WIP to fix extremely slow `scatter_add` issue vs. fp16. The current changes seem to improve performance, but it still appears to lag behind the fp16 equivalent. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/84981 Approved by: https://github.com/ngimel	2022-11-11 05:23:48 +00:00
AllenTiTaiWang	a6d72f44a4	[ONNX] Add onnx::Max into standard Op for scalar type alignment (#88750 ) Easy fix for onnx::Max ScalarType Pull Request resolved: https://github.com/pytorch/pytorch/pull/88750 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-11 04:22:04 +00:00
PyTorch MergeBot	0de8f047c1	Revert "[dynamo] fixes dict changed during runtime error (#87526 )" This reverts commit cf04b36ce8f531730210b03eaa347977a1c2d75c. Reverted https://github.com/pytorch/pytorch/pull/87526 on behalf of https://github.com/anijain2305 due to error reported	2022-11-11 04:19:08 +00:00
Jane Xu	310335de48	Update lr_scheduler.pyi to match lr_scheduler.py (#88818 ) Following #88503, we should also update the pyi file Pull Request resolved: https://github.com/pytorch/pytorch/pull/88818 Approved by: https://github.com/soulitzer	2022-11-11 04:02:44 +00:00
Wei-Sheng Chin	86b7aa26f0	Fix FakeTensorProp on Module with Parameters or Buffers (#88700 ) In `FakeTensorMode.__torch_dispatch__`, the output is now always computed by meta kernels in ```python try: with in_kernel_invocation_manager(self): r = func(args, *kwargs) # <----- "r" can be a real tensor. except NotImplementedError as not_implemented_error: # no meta kernel registered, fallback to kernel for the device if not self.allow_fallback_kernels: raise not_implemented_error return run_fallback_kernel(self, func, args, kwargs, not_implemented_error) return self.wrap_meta_outputs_with_default_device_logic(r, func, args, kwargs) ``` For example, I observed a CPU tensor is generated when executing `aten.addmm` when running `FakeTensorProp`. Therefore, I'd like to allow `FakeTensorMode` to wrap real tensor as `FakeTensor` during the computation. Does this PR look a good direction to fix this problem? If yes, I can go ahead and add some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88700 Approved by: https://github.com/eellison, https://github.com/ezyang	2022-11-11 03:49:29 +00:00
Chien-Chin Huang	c4fc5d372f	[FSDP][state_dict][1/N] Moving state_dict logic to pre_state_dict_hook (#87900 ) This is one step toward the ultimate goal: remove the overwritten state_dict in FSDP. All the logic should be either in `pre_state_dict_hook` or `post_state_dict_hook`. Since current `nn.Module` does not support `pre_state_dict_hook`, this PR mimic `pre_state_dict_hook` by calling the pre hook inside post the hook, effectively ditching all the work done by `nn.Module.state_dict`. Once `pre_state_dict_hook` is supported by `nn.Module`, these pre hook calls can be moved out from the post hooks and be registered to `nn.Module.pre_state_dict_hook`. The major issue of this temporary solution is that `post_state_dict_hook` is called from the leaf node to the root node. This makes the `module._lazy_init()` invalid as FSDP assumes `_lazy_init()` to be called from the root. As a result, `FSDP.state_dict` currently contains only one logic -- calling `module._lazy_init()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87900 Approved by: https://github.com/rohan-varma	2022-11-11 03:41:40 +00:00
Emil Lynegaard	9d09968bbe	Disable check for dropout in MultiheadAttention fast_path (#88831 ) Since we already enforce eval mode for the fast_path, we do not need to also check for a falsy dropout value, as a model trained with dropout will have a non-zero dropout during eval mode, even though it won't be applied. Fixes #88806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88831 Approved by: https://github.com/drisspg	2022-11-11 03:34:57 +00:00
PyTorch MergeBot	3082378701	[vision hash update] update the pinned vision hash (#88853 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88853 Approved by: https://github.com/pytorchbot	2022-11-11 03:33:58 +00:00
Sherlock Huang	495e7b1c72	Ref for aten.full; symint changes in prim (#88762 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88762 Approved by: https://github.com/ezyang	2022-11-11 02:32:09 +00:00
Michael Voznesensky	3fbf748f21	Assert we have triton before scheduling on triton (#88849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88849 Approved by: https://github.com/wconstab, https://github.com/ngimel, https://github.com/jansel	2022-11-11 02:30:29 +00:00
anjali411	fc9e36dd42	Add meta support for scalar_tensor and argmax (#88590 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88590 Approved by: https://github.com/albanD	2022-11-11 01:31:00 +00:00
Nikolay Korovaiko	c961e45ee5	handle zero dims in reductions (#88280 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88280 Approved by: https://github.com/ngimel	2022-11-11 01:13:57 +00:00
Ryan Spring	534ae6ae47	[primTorch] Implement group norm reference (#87054 ) Add group norm reference Split from #81191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87054 Approved by: https://github.com/mruberry	2022-11-11 01:08:20 +00:00
HDCharles	072834d56d	[ao] qconfig_mapping.py fixing public v private (#87518 ) Summary: made _GLOBAL_DICT_KEY, _OBJECT_TYPE_DICT_KEY, _MODULE_NAME_REGEX_DICT_KEY, _MODULE_NAME_DICT_KEY, _MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709278](https://our.internmc.facebook.com/intern/diff/D40709278) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87518 Approved by: https://github.com/jcaip	2022-11-11 00:32:24 +00:00
Ian Graves	f9221bf53b	[pytorch] Enable memory map file support for Android, Apple, and CXX (#88545 ) Summary: See title. Left Windows out so it still compiles. Test Plan: Add a `#fail` below [this line](https://fburl.com/code/p0mlhlw4) and build for various platforms and confirm it fails which proves the `#ifdef` was hit. ``` buck2 build xplat/langtech/tuna/cli:tuclixAndroid buck2 build xplat/langtech/tuna/cli:tuclix ``` CI/CD for the rest. Differential Revision: D41054824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88545 Approved by: https://github.com/qihqi	2022-11-11 00:19:20 +00:00
PyTorch MergeBot	8441443132	Revert "Add nondeterministic error for `scatter` (#88244 )" This reverts commit e940a2f8e2a3aa9d98291e73b3d40fcffb6182c8. Reverted https://github.com/pytorch/pytorch/pull/88244 on behalf of https://github.com/mehtanirav due to Internal test failures	2022-11-10 23:56:49 +00:00
Nikita Shulga	62ef15e320	[MPS] Fix `test_embedding_dense_backward` (#88847 ) By copying randomly initialized weights distribution from MPS `nn.Embedding` to `cpu` Test plan: `python test_mps.py -k test_embedding_dense_backward --repeat 150` Fixes https://github.com/pytorch/pytorch/issues/88679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88847 Approved by: https://github.com/seemethere	2022-11-10 23:52:27 +00:00
Yanbo Liang	b30222e0c4	[Dynamo] Add complete support for Tensor.is_contiguous (#88407 ) Fixes https://github.com/pytorch/torchdynamo/issues/1783 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88407 Approved by: https://github.com/jansel	2022-11-10 23:47:21 +00:00
Dmytro Dzhulgakov	ae01615d75	Fix cupti search path in CMake (#88657 ) Minor fix for when cuda is installed via conda. In this case the libraries are in `lib` and not `lib64`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88657 Approved by: https://github.com/kit1980, https://github.com/malfet	2022-11-10 23:44:52 +00:00
Sherlock Huang	d9ad08ce8a	Symbolic shape: sym_floor , sym_sqrt, sym_int (#88760 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88760 Approved by: https://github.com/ezyang	2022-11-10 23:41:33 +00:00
Yanbo Liang	cc04cf50bf	[Inductor] Fix lowmem_dropout() missing 1 required positional argument: 'p' (#88716 ) Fixes error from 7k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_GuYuc_WS_DAN_PyTorch.py Error: ``` TypeError: lowmem_dropout() missing 1 required positional argument: 'p' While executing %lowmem_dropout : [#users=1] = call_function[target=torch._inductor.overrides.lowmem_dropout](args = (%avg_pool2d_9,), kwargs = {training: False}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88716 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/desertfire	2022-11-10 23:37:29 +00:00
BowenBao	500fd65531	[ONNX] Create common ExportTestCase base class (#88145 ) Refactor out a common base class `ExportTestCase`, for common things in `setUp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88145 Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/AllenTiTaiWang	2022-11-10 21:51:59 +00:00
BowenBao	20ae19aa1d	[ONNX] Improve diagnostic message formatting (#87830 ) * Reflect required arguments in method signature for each diagnostic rule. Previous design accepts arbitrary sized tuple which is hard to use and prone to error. ![image](https://user-images.githubusercontent.com/9376104/200381982-d1e905f0-a159-4ef5-8d2e-070524e8f5bf.png) * Removed `DiagnosticTool` to keep things compact. * Removed specifying supported rule set for tool(context) and checking if rule of reported diagnostic falls inside the set, to keep things compact. * Initial overview markdown file. * Change `full_description` definition. Now `text` field should not be empty. And its markdown should be stored in `markdown` field. * Change `message_default_template` to allow only named fields (excluding numeric fields). `field_name` provides clarity on what argument is expected. * Added `diagnose` api to `torch.onnx._internal.diagnostics`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87830 Approved by: https://github.com/abock	2022-11-10 21:42:17 +00:00
HDCharles	a6610faa93	[ao] qconfig_mapping_utils.py fixing public v private (#87517 ) Summary: made _get_object_type_qconfig, _get_module_name_regex_qconfig, _get_module_name_qconfig, _maybe_adjust_qconfig_for_module_type_or_name, _get_flattened_qconfig_dict _update_qconfig_for_qat private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709279](https://our.internmc.facebook.com/intern/diff/D40709279) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87517 Approved by: https://github.com/jcaip	2022-11-10 21:40:39 +00:00
Michael Lazos	c1553880de	Have kernel names include fused ops (#88624 ) - Propagates origin fx nodes through inlining during lowering - Concatenates op names into kernel name - Adds config to cap the number of ops in the kernel name so they don't get too long Caveats: - The ordering in the name may not match the order that the ops are executed in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-11-10 21:38:06 +00:00
HDCharles	ad2eba802c	[ao] fuser_method_mappings.py fixing public v private (#87516 ) Summary: made _get_valid_patterns, _DEFAULT_PATTERN_TO_FUSER_METHOD, _reverse3, _reverse2, _reverse_sequential_wrapper2, _DEFAULT_OP_LIST_TO_FUSER_METHOD, _sequential_wrapper2 private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709281](https://our.internmc.facebook.com/intern/diff/D40709281) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87516 Approved by: https://github.com/jcaip	2022-11-10 21:37:31 +00:00
maxren	37b468ac77	[xnnpack][lite-int][on-device] rebuild serialized modules at runtime (#88780 ) This is the on-device runtime work. We modify the compile and execute from our hacky solution from before to what will actually be running at runtime. First we rebuild our graph from the serialized flatbuffer string. We also introduce a runtime wrapper that inherits CustomClassHolder that allows us to forward along the built xnngraph runtime to our execute function Once the subgraph object has been rebuilt by our we pass it along to the runtime wrapper for us to forward along to execute At execute we prep the input/outputs and invoke the runtime using our runtime wrapper. Finally we forward those results to our execution Differential Revision: [D39413031](https://our.internmc.facebook.com/intern/diff/D39413031/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39413031/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/88780 Approved by: https://github.com/digantdesai	2022-11-10 21:35:28 +00:00
Catherine Lee	de38c87698	Use run_test in MPS (#88829 ) Run mps through run_test to get disable test infra, create xml files (which can then be used for flakiness detection), and reruns Also added the workflow steps for uploading the xml files Pull Request resolved: https://github.com/pytorch/pytorch/pull/88829 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-11-10 21:32:41 +00:00
Bert Maher	1ae772a663	[inductor] Remove import check for fast_flush (#88812 ) https://github.com/pytorch/pytorch/pull/88557/ has a guard to make sure that triton's `do_bench` includes the `fast_flush` argument. Since we've updated Triton to a sufficiently recent revision, we can remove that guard. Differential Revision: [D41185280](https://our.internmc.facebook.com/intern/diff/D41185280/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88812 Approved by: https://github.com/soumith	2022-11-10 21:12:20 +00:00
maxren	3a4e8736ad	[xnnpack][on-device] compiler --> executor object (#88779 ) #### XNN Compiler Object This is purely to abstract away the subgraph rebuild from the flatbuffer object. CompileModel return an executor object which we can use to setup inputs and run forward with. #### Executorch Considerations We Include ATen/utils for torch_check, this will be changed when moving to executorch Differential Revision: [D40733163](https://our.internmc.facebook.com/intern/diff/D40733163/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88779 Approved by: https://github.com/digantdesai	2022-11-10 21:09:22 +00:00
Mark Saroufim	394b998de2	sub setup.py install -> develop (#88507 ) If someone is building the project from source they're likely a contributor for which develop will be much more useful. For people that want to try the latest and greatest they can leverage the nightlies Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88507 Approved by: https://github.com/malfet	2022-11-10 21:04:38 +00:00
maxren	d5e1e2f0fc	[xnnpack][on-device] executor class (#88778 ) # Executor Class Executor object used to wrap our xnn_runtime object. The ideal flow of this object looks as such: ``` executor.set_inputs(vector<tensor> inputs, vector<tensor> outputs) executor.forward() ``` This will likely be returned by our delegate compile and given over to execute in order to run inference using the xnn runtime ##### Executorch Considerations ``` #include <ATen/Functions.h> #include <ATen/Utils.h> ``` These Aten functions are included in order to use at::Tensor when setting the inputs, this will change when used for Executorch because we will be switching from at::Tensor to whatever tensor abstraction is used for ET. Seems like they have the same call for `.data_ptr<float>()`, so realistically all logic here will be the same. ATen/Utils is used for TORCH_CHECK. We will switch to ET_CHECK_MESSAGE for executorch. Differential Revision: [D40733121](https://our.internmc.facebook.com/intern/diff/D40733121/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88778 Approved by: https://github.com/digantdesai	2022-11-10 21:01:46 +00:00
PyTorch MergeBot	29550e2c1d	Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566 )" This reverts commit 48b58930cbfa725ac25a9303d496c76bf983574d. Reverted https://github.com/pytorch/pytorch/pull/88566 on behalf of https://github.com/huydhn due to This change breaks trunk `48b58930cb`	2022-11-10 20:56:30 +00:00
erjia	90cf14ddf6	[DataPipe] Deprecating drop_empty_batches from Filter and other functional APIs (#88693 ) - Deprecating based on https://github.com/pytorch/data/issues/163 Corresponding PRs from TorchData: https://github.com/pytorch/data/pull/890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88693 Approved by: https://github.com/NivekT	2022-11-10 19:54:22 +00:00
Felix Divo	98ecd06580	Bring Unfold/Fold param doc order in line with code (#88819 ) Now the first parameter (if used as a positional argument) is the first that is listed in the docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88819 Approved by: https://github.com/ngimel	2022-11-10 19:29:29 +00:00
Howard Huang	1d54ce9d5d	[14/N] Refactor _new_process_group_helper() to remove repeated code (#88351 ) Changes: - refactor parts of `_new_process_group_helper()` to remove repeated code Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351 Approved by: https://github.com/kwen2501	2022-11-10 19:27:17 +00:00
William Wen	4bcf2c53e5	Add warnings & regressions info text (#88837 ) Add text about what warnings and accuracy regressions dropdowns mean. Sample: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1310770285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88837 Approved by: https://github.com/anijain2305	2022-11-10 19:22:09 +00:00
Jiewen Tan	3b8245ab12	[LTC] Make ComputePostOrder accept const T pointers (#88773 ) Summary: Since `c10::ArrayRef` now support `c10::ArrayRef<const T>`, let's restore `ComputePostOrder` to accept `const Node*` again, which is more suitable for the context of the given helpers. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88773 Approved by: https://github.com/JackCaoG	2022-11-10 18:34:19 +00:00
Jiawen Liu	48b58930cb	[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566 ) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: 1.15x -> 1.36x speedup Differential Revision: D41071665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88566 Approved by: https://github.com/jansel, https://github.com/jianyuh	2022-11-10 18:32:25 +00:00
PyTorch MergeBot	d157fca59c	Revert "Symintify `broadcast_to` (#88776 )" This reverts commit 3a09d9a129406a05ca7e82c1438f9aa83019f48d. Reverted https://github.com/pytorch/pytorch/pull/88776 on behalf of https://github.com/malfet due to Broke functorch/test_aotdispatch on M1, see `3a09d9a129`	2022-11-10 18:19:54 +00:00
Andrew Gu	6bf2776ac1	[FSDP][Perf] Do not call `pad` in no-padding case (#88769 ) - Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case. - This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88769 Approved by: https://github.com/zhaojuanmao	2022-11-10 18:18:55 +00:00
Bert Maher	d3178465ee	[dynamo] `VariableTracker.call_method` requires a name (#88311 ) Summary: as title Test Plan: Before: N2743445, After: N2748186. Note there's a new error, but at least we got past the easy one. Differential Revision: D40938415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88311 Approved by: https://github.com/brad-mengchi	2022-11-10 18:17:23 +00:00
Bert Maher	1e4079a476	[nnc] Disable opaque pointers mode in LLVM backend to allow getPointerElementType (#88798 ) As of LLVM 15 typed pointers are going away: https://llvm.org/docs/OpaquePointers.html. Thus `getPointerElementType` is no longer legal, since pointers are all opaque. I don't totally remember why we use it so prolifically, or whether there's an easy change to get rid of it, or whether we'd need a significant refactor to carry around `Type`s alongside `Value`s. But in any case, NNC is deprecated (see: TorchInductor) and will hopefully be gone before LLVM 16 is a thing. For now, we can apply the hack of turning off opaque pointer mode on the LLVMContext. Differential Revision: [D41176215](https://our.internmc.facebook.com/intern/diff/D41176215) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88798 Approved by: https://github.com/desertfire	2022-11-10 18:14:02 +00:00
Panagiotis Antoniadis	656d0de6c5	Change TORCH_INTERNAL_ASSERT to TORCH_CHECK and add a nice error message (#88804 ) Fixes #87672 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88804 Approved by: https://github.com/ezyang	2022-11-10 18:11:32 +00:00
Huy Do	79b049af5e	Switch to setup-nvidia action (#88757 ) Use the new [setup-nvidia](https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-nvidia/action.yml) action from test-infra. The new action is created so that it can be shared across different PyTorch repos. For examples: * [pytorch/pytorch](https://github.com/pytorch/pytorch/blob/master/.github/scripts/install_nvidia_utils_linux.sh) (fixed by this PR) * [pytorch/tau](https://github.com/pytorch/tau/blob/main/.github/workflows/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/tau/pull/595) * [pytorch/torchsnapshot](https://github.com/pytorch/torchsnapshot/blob/main/.github/scripts/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/torchsnapshot/pull/130) * [torch/multiply](https://github.com/pytorch/multipy/blob/main/.github/scripts/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/multipy/pull/264) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88757 Approved by: https://github.com/seemethere, https://github.com/atalman	2022-11-10 17:48:16 +00:00
Nikita Shulga	f98edfcc48	Make TorchElastic timer importable on Windows (#88522 ) Also, add `torch.distributed` to test imports, so that we would not regress in the future Fixes https://github.com/pytorch/pytorch/issues/85427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88522 Approved by: https://github.com/d4l3k	2022-11-10 17:42:20 +00:00
Nikita Karetnikov	4b898a7304	Symintify `adaptive_avg_pool3d` (#88783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88783 Approved by: https://github.com/ezyang	2022-11-10 15:23:54 +00:00
Nikita Karetnikov	3a09d9a129	Symintify `broadcast_to` (#88776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88776 Approved by: https://github.com/ezyang	2022-11-10 15:21:50 +00:00
samdow	c0ecce15b5	add DisableTorchFunction that matches DisableTorchDispatch (#88219 ) Closes #87990. This implements a new disable guard that matches DisableTorchDispatch (disables all subclasses and modes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88219 Approved by: https://github.com/ezyang	2022-11-10 14:51:13 +00:00
samdow	7f28be10e5	rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218 ) First half of #87990. This doesn't change any of the behavior and is just a rename Pull Request resolved: https://github.com/pytorch/pytorch/pull/88218 Approved by: https://github.com/ezyang, https://github.com/zou3519	2022-11-10 14:51:13 +00:00
XiaobingSuper	3e43ff2794	torchdynamo: add convolution add(relu) inplace fusion kernel (#88048 ) This PR is about add convolution add(relu) inplace fusion kernel which works for other.add_(conv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88048 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-10 13:54:37 +00:00
Philip Meier	e6561291b8	add hack to allow hybrid compressed sparse comparison in assertEqual (#88749 ) Hybrid sparse CSR tensors can currently not be compared to strided ones since `.to_dense` does not work: ```py import torch from torch.testing._internal.common_utils import TestCase assertEqual = TestCase().assertEqual actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12] ,[3, 13] ,[4, 14]]) expected = torch.stack([actual[0].to_dense(), actual[1].to_dense()]) assertEqual(actual, expected) ``` ``` main.py:4: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:54.) actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12] ,[3, 13] ,[4, 14]]) Traceback (most recent call last): File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 1098, in assert_equal pair.compare() File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 619, in compare actual, expected = self._equalize_attributes(actual, expected) File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 706, in _equalize_attributes actual = actual.to_dense() if actual.layout != torch.strided else actual RuntimeError: sparse_compressed_to_dense: Hybrid tensors are not supported The above exception was the direct cause of the following exception: Traceback (most recent call last): File "main.py", line 10, in <module> assertEqual(actual, expected) File "/home/philip/git/pytorch/torch/torch/testing/_internal/common_utils.py", line 2503, in assertEqual msg=(lambda generated_msg: f"{generated_msg}\n{msg}") if isinstance(msg, str) and self.longMessage else msg, File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 1112, in assert_equal ) from error RuntimeError: Comparing TensorOrArrayPair( id=(), actual=tensor(crow_indices=tensor([0, 2, 4]), col_indices=tensor([0, 1, 0, 1]), values=tensor([[ 1, 11], [ 2, 12], [ 3, 13], [ 4, 14]]), size=(2, 2, 2), nnz=4, layout=torch.sparse_csr), expected=tensor([[[ 1, 11], [ 2, 12]], [[ 3, 13], [ 4, 14]]]), rtol=0.0, atol=0.0, equal_nan=True, check_device=False, check_dtype=True, check_layout=False, check_stride=False, check_is_coalesced=False, ) resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead. ``` This adds a temporary hack to `TestCase.assertEqual` to enable this. Basically, we are going through the individual CSR subtensors, call `.to_dense()` on them, and stack everything back together. I opted to not do this in the common machinery, since that way users are not affected by this (undocumented) hack. I also added an xfailed test that will trigger as soon as the behavior is supported natively so we don't forget to remove the hack when it is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88749 Approved by: https://github.com/mruberry, https://github.com/pearu	2022-11-10 13:44:45 +00:00
Li-Huai (Allan) Lin	7c353eb395	[MPS] Fix softplus (#88555 ) 1. Fixes #87780 2. Fixes mps graph cache issue 3. Adds proper tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/88555 Approved by: https://github.com/kulinseth	2022-11-10 09:40:08 +00:00
Grigory Sizov	7ad87f63e2	Support src_mask and src_key_padding_mask for Better Transformer (#88488 ) Fixes T135842750 (follow-up for #87377) ## Description At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention. This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream. Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device: - on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported. - on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask. ## Tests - Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed - Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA - `test_masked_softmax_mask_types` now covers mask type 2 - `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously - `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488 Approved by: https://github.com/mikekgfb	2022-11-10 08:12:56 +00:00
efiks	dcefea2706	[caffe2][tourch] Optimize BatchBoxCox (#87585 ) Differential Revision: D40215424 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87585 Approved by: https://github.com/hyuen	2022-11-10 06:11:05 +00:00
PyTorch MergeBot	e87c79ca0c	[vision hash update] update the pinned vision hash (#88742 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88742 Approved by: https://github.com/pytorchbot	2022-11-10 03:05:00 +00:00
Animesh Jain	cf04b36ce8	[dynamo] fixes dict changed during runtime error (#87526 ) Fixes https://github.com/pytorch/torchdynamo/issues/1744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87526 Approved by: https://github.com/ezyang	2022-11-10 01:57:17 +00:00
William Wen	0b8889c724	Do not flag models in dashboard due to NaN values (#88792 ) Title. Tested by running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-4 --training --visualize_logs` on a copy of a recent set of logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88792 Approved by: https://github.com/anijain2305	2022-11-10 01:48:04 +00:00
William Wen	6e3555edea	Add absolute latency to dashboard (#88790 ) Add absolute latency to dashboard, as requested by https://github.com/pytorch/torchdynamo/issues/1833#issuecomment-1302742914 Tested by setting `run.sh` to ``` # Setup the output directory rm -rf ../test-dynamo-runner-logs-7/ mkdir ../test-dynamo-runner-logs-7/ # Commands for torchbench for device=cuda, dtype=float32 for training and for performance testing python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 --cold_start_latency # Commands for torchbench for device=cuda, dtype=float32 for training and for accuracy testing python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 ``` and running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-7/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard` (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else). Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1309645562 NOTE: this change breaks processing old logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88790 Approved by: https://github.com/anijain2305	2022-11-10 01:45:52 +00:00
Elias Ellison	2381548071	add stride constraints to fallbacks (#88534 ) Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel. Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534 Approved by: https://github.com/ngimel	2022-11-10 01:13:44 +00:00
Eddie Yan	fb5c6ae61f	[cuDNN][cuDNN V8 API] Match V7 API behavior for `channels_last` stride coercion for cuDNN (#88699 ) For ConvNeXt failure in https://github.com/pytorch/torchdynamo/issues/1833 cuDNN V7 has some stride "fixing" code to coerce cuDNN to use channels-last in cases when allowed by size 1 strides that was omitted in V8, which seems to seems to lead to performance regressions. This PR patches in the same fix for V8. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/88699 Approved by: https://github.com/ngimel	2022-11-10 00:49:07 +00:00
mikey dagitses	59115e6139	disable test that times out in fbcode (#88758 ) Test Plan: Rely on CI. Differential Revision: D41162966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88758 Approved by: https://github.com/zou3519	2022-11-10 00:28:02 +00:00
William Wen	16bd363863	Fix dynamo dashboard passrate denominator (#88777 ) Before the dashboard improvements, the passrate table looked like this: ~~~ +------------------------+------------+-------------+-------------+ \| Compiler \| torchbench \| huggingface \| timm_models \| +------------------------+------------+-------------+-------------+ \| eager \| 98%, 54/55 \| 100%, 43/43 \| 100%, 61/61 \| \| aot_eager \| 95%, 52/55 \| 100%, 43/43 \| 97%, 59/61 \| \| aot_cudagraphs \| 75%, 41/55 \| 49%, 21/43 \| 38%, 23/61 \| \| nvprims_nvfuser \| 71%, 39/55 \| 16%, 7/43 \| 48%, 29/61 \| \| inductor \| 87%, 48/55 \| 93%, 40/43 \| 95%, 58/61 \| \| inductor_no_cudagraphs \| 93%, 51/55 \| 93%, 40/43 \| 95%, 58/61 \| +------------------------+------------+-------------+-------------+ ~~~ After the change, the table looked like: ~~~ +------------------------+------------+-------------+-------------+ \| Compiler \| torchbench \| huggingface \| timm_models \| +------------------------+------------+-------------+-------------+ \| eager \| 82%, 53/65 \| 84%, 43/51 \| 82%, 61/74 \| \| aot_eager \| 83%, 54/65 \| 84%, 43/51 \| 82%, 61/74 \| \| aot_cudagraphs \| 69%, 45/65 \| 65%, 33/51 \| 38%, 28/74 \| \| nvprims_nvfuser \| 48%, 31/65 \| 78%, 40/51 \| 26%, 19/74 \| \| inductor \| 75%, 49/65 \| 82%, 42/51 \| 81%, 60/74 \| \| inductor_no_cudagraphs \| 82%, 53/65 \| 82%, 42/51 \| 82%, 61/74 \| +------------------------+------------+-------------+-------------+ ~~~ There is no actual regression, but the passrate is lower since the denominator is wrong. Check fix by running locally (e.g. `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-5 --training --visualize_logs`) and comparing passrate table output to previously correct one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88777 Approved by: https://github.com/anijain2305	2022-11-10 00:26:58 +00:00
Nikita Shulga	4f18739bf0	Fix Docker image generation (#88741 ) Pass install channel when building nightly images Pass `TRITON_VERSION` argument to install triton for nightly images Fix `generate_pytorch_version.py` to work with unannotated tags and avoid failures like the following: ``` % git checkout nightly % ./.github/scripts/generate_pytorch_version.py fatal: No annotated tags can describe '93f15b1b54ca5fb4a7ca9c21a813b4b86ebaeafa'. However, there were unannotated tags: try --tags. Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 120, in <module> main() File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 115, in main print(version_obj.get_release_version()) File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 75, in get_release_version if not get_tag(): File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 37, in get_tag dirty_tag = subprocess.check_output( File "/Users/nshulga/miniforge3/lib/python3.9/subprocess.py", line 424, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/Users/nshulga/miniforge3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['git', 'describe']' returned non-zero exit status 128. ``` After the change nightly is reported as(due to autolabelling issue, should be fixed by ttps://github.com/pytorch/test-infra/pull/1047 ): ``` % ./.github/scripts/generate_pytorch_version.py ciflow/inductor/26921+cpu ``` Even for tagged release commits version generation was wrong: ``` % git checkout release/1.13 % ./.github/scripts/generate_pytorch_version.py ciflow/periodic/79617-4848-g7c98e70d44+cpu ``` After the fix, it is as expected: ``` % ./.github/scripts/generate_pytorch_version.py 1.13.0+cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88741 Approved by: https://github.com/dagitses, https://github.com/msaroufim	2022-11-10 00:06:31 +00:00
Akshit Khurana	7006ac6ee5	[Dynamo] Fix Tensor.T trace (#88642 ) Summary: Tensor.T considered T as a GetAttr and didn't progate "example_value" Via https://pytorch.org/docs/stable/tensors.html#torch.Tensor.T > If n is the number of dimensions in x, x.T is equivalent to > x.permute(n-1, n-2, ..., 0). Fixes pytorch/torchdynamo#1476 Test Plan: pytest test/dynamo/test_functions.py::FunctionTests::test_T Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D41130306](https://our.internmc.facebook.com/intern/diff/D41130306) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88642 Approved by: https://github.com/tugsbayasgalan, https://github.com/yanboliang, https://github.com/jansel	2022-11-09 23:44:30 +00:00
PyTorch MergeBot	c7fc710459	Revert "[3/n] Thread PG: add threaded PG implementation (#88627 )" This reverts commit 6dd081846e3ae6192b375d658d4b4f3d6bd9df6e. Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test `6dd081846e` in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged	2022-11-09 22:38:41 +00:00
HDCharles	6fe4ccc7cb	[ao] qconfig.py fix public v private (#87515 ) Summary: made is_reuse_input_qconfig, _activation_is_memoryless, _partial_wrapper_equals, _obs_or_fq_ctr_equals, _add_module_to_qconfig_obs_ctr, _assert_valid_qconfig private Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40709280](https://our.internmc.facebook.com/intern/diff/D40709280) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87515 Approved by: https://github.com/jcaip	2022-11-09 22:30:03 +00:00
Howard Huang	3a3500fa08	[13/N] Update gather with CPU/CUDA implementations (#86409 ) Differential Revision: [D40181612](https://our.internmc.facebook.com/intern/diff/D40181612) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86409 Approved by: https://github.com/kwen2501	2022-11-09 22:11:40 +00:00
anjali411	1af9b38a90	Symintify embedding_sparse_backward (#88746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88746 Approved by: https://github.com/ezyang	2022-11-09 22:05:09 +00:00
Zhengxu Chen	b7aa22d6db	[fx] Fix GraphModule.print_readable() (#88730 ) Summary: `__nested_code()` seems removed. Test Plan: CI Differential Revision: D41149662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88730 Approved by: https://github.com/SherlockNoMad	2022-11-09 21:39:48 +00:00
Charlie Yan	6dd081846e	[3/n] Thread PG: add threaded PG implementation (#88627 ) Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation. Test Plan: TBD Reviewed By: XilunWu Differential Revision: D40992593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-11-09 20:51:11 +00:00
PyTorch MergeBot	93d3bd626e	Revert "[primTorch] Improve `narrow` and `narrow_copy`: refs, tests, docs (#87045 )" This reverts commit aa8279bcb8687e025a666e18828a436eb7ef7b45. Reverted https://github.com/pytorch/pytorch/pull/87045 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41161182	2022-11-09 20:48:32 +00:00
Charlie Yan	8523c45717	Delete stub file to enable mypy check (#4649 ) (#88701 ) Summary: X-link: https://github.com/facebookresearch/detectron2/pull/4649 Context in https://fburl.com/4irjskbe This change deletes distributed.pyi, so that lintrunner will run mypy on distributed.py for typing check. Test Plan: CI Differential Revision: D41028360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88701 Approved by: https://github.com/zhaojuanmao	2022-11-09 20:29:34 +00:00
Sherlock Huang	133e61af7a	OpOverload is_view (#88722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88722 Approved by: https://github.com/ezyang	2022-11-09 19:03:12 +00:00
Howard Huang	55df18e3da	[12/N] Update scatter with CPU/CUDA implementations (#86408 ) Differential Revision: [D40181613](https://our.internmc.facebook.com/intern/diff/D40181613) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86408 Approved by: https://github.com/kwen2501	2022-11-09 18:40:25 +00:00
mikey dagitses	3a1bdfee67	skip environment collection test in fbcode (#88744 ) Summary: This runs pip, which we don't have in the fbcode environment. Test Plan: Rely on CI. Differential Revision: D41156589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88744 Approved by: https://github.com/zou3519	2022-11-09 18:20:04 +00:00
Jason Ansel	de53d4143a	Fix TorchInductor benchmarking in fbcode (#88689 ) Summary: Makes the C++ TorchInductor benchmarking work in fbcode plus some minor fixed to enable that. Test Plan: Test added Differential Revision: D41045910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88689 Approved by: https://github.com/soumith	2022-11-09 18:13:06 +00:00
ssjia	c4a3aa8fe7	[vulkan] Add option for buffer representations in vTensor (#87622 ) This diff adds the option to use a Buffer to store data for a `vTensor` by passing `StorageType::BUFFER` to the constructor of `vTensor`. To enable this change, the construction of `vTensor` and `vTensorStorage` had to be slightly refactored to properly support strides. To summarize the changes: * `vTensorStorage` now contains no Tensor metadata (such as tensor sizes, strides, and `TensorOptions`) - it now only contains the image extents (if texture storage is used) and the buffer length. Tensor metadata is now managed by `vTensor`. The reason for this is to allow multiple `vTensor` objects to point to the same `vTensorStorage` but with different metadata which may be a useful feature now that Buffer storage is enabled. * `vTensor` will now compute the strides upon construction based on the requested sizes and memory layout if Buffer storage is requested. Previously, strides were faked by setting them all to 0 as strides do not apply to image textures (this behavior is preserved for texture storage). Differential Revision: [D40604163](https://our.internmc.facebook.com/intern/diff/D40604163/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87622 Approved by: https://github.com/digantdesai	2022-11-09 17:59:49 +00:00
Edward Z. Yang	d81797e845	Meta function for aten.sort and aten.scatter* (#88705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88705 Approved by: https://github.com/ezyang	2022-11-09 17:47:14 +00:00
Will Constable	100b55637b	Mark dynamo torchbench dlrm as unsupported (#88712 ) - DLRM requires special configuration of embedding layers which are sparse and not compatible with DDP. - I could mark the embedding params as ignored in DDP to make the benchmark pass, but this isn't a representative benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88712 Approved by: https://github.com/ezyang	2022-11-09 17:23:56 +00:00
kshitij12345	eb9b156019	[fix] MathBits: serialization (#88182 ) Fixes #81690 TODO: * [x] C++ Unpickler Fix (locally tested pickled in Python and unpickled in C++) * [x] C++ Pickler Fix (locally tested pickled in C++ and unpickled in Python) * [x] Do quant_tensor, sparse_tensor, etc require similar changes? (Sparse and Quant don't need this) * [x] Add Comments * [x] How to make sure C++ and Python are in sync? (Functions in `pickler.h` help in getting and setting Tensor Metadata (math-bits for now) on a tensor. They are the only place which should handle this.) Notes: Quant Tensor don't support complex dtypes and for float they segfault with `_neg_view` : https://github.com/pytorch/pytorch/issues/88484 Sparse Tensor: ```python >>> a = torch.tensor([[0, 2.], [3j, 0]]).to_sparse() >>> a.conj().is_conj() False >>> a._neg_view() Traceback (most recent call last): File "<stdin>", line 1, in <module> NotImplementedError: Cannot access storage of SparseTensorImpl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88182 Approved by: https://github.com/ezyang, https://github.com/anjali411	2022-11-09 17:15:12 +00:00
Nikita Shulga	525fe53aa4	[BE] Delete push_nightly_docker_ghcr (#88748 ) As it seems to be duplicating the functionality of `docker-release.yml` and have not produced a valid build in last 16 days, according to https://github.com/pytorch/pytorch/actions/workflows/push_nightly_docker_ghcr.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/88748 Approved by: https://github.com/seemethere	2022-11-09 16:13:56 +00:00
Bin Bao	f11f0e4a03	[inductor] Handle nested tuple/list output in fallback kernel (#88495 ) Summary: Currently fallback kernel in inductor assumes its output is either a tensor or a tuple/list of tensors. This PR makes it handle more generic output data structure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495 Approved by: https://github.com/jansel	2022-11-09 15:50:45 +00:00
mikey dagitses	3150c9dc6f	extract out the clean workspace test to its own file (#88682 ) Summary: This test relies on what the root workspace is before any other code is run. However, some of the test cases change it. If the order the tests are run is randomized, then the test can fail if run after one of them. Having it on its own ensures that it always sees a pristine state. Test Plan: Verified locally and confirmed in internal and external CI. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88682 Approved by: https://github.com/r-barnes, https://github.com/malfet	2022-11-09 13:48:57 +00:00
Edward Z. Yang	c19bae9f84	Add SherlockNoMad to symbolic-shapes reviewer list (#88739 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88739 Approved by: https://github.com/anjali411	2022-11-09 13:20:19 +00:00
Edward Z. Yang	44de7cdbc4	Add voznesenskym to symbolic-shapes group, move wconstab to listener (#88593 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88593 Approved by: https://github.com/anjali411	2022-11-09 13:11:16 +00:00
Edward Z. Yang	c86cc68d23	Mark diag.out composite (#88670 ) It's implementation just redispatches, it works for more than CPU/CUDA. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88670 Approved by: https://github.com/anjali411	2022-11-09 12:59:07 +00:00
Ivan Yashchuk	69b2352236	Add min cut partitioner for AOT+nvFuser (#88204 ) Here we mark most of `torch.ops.nvprims` as something that can be recomputed in the backward passes (and hopefully fused). TODO: - [x] Add a test after https://github.com/pytorch/pytorch/pull/88186 is merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/88204 Approved by: https://github.com/jjsjann123, https://github.com/jansel	2022-11-09 12:56:55 +00:00
Sean Ross-Ross	ff7c5b0df8	Changing as_strided_scatter to deterministic inputs (#85583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85583 Approved by: https://github.com/mruberry	2022-11-09 12:40:03 +00:00
blzheng	fca6ed02b9	[Inductor] fix c++ compile error with masked float value init (#88298 ) Fixes #88201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-09 10:40:25 +00:00
Fabio Rocha	652af5ec15	upsample_*.vec ops are now CompositeImplicit (#85638 ) It was previously CompositeExplicit but it was not really necessary. See discussion in https://github.com/pytorch/pytorch/issues/85405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85638 Approved by: https://github.com/ezyang, https://github.com/lezcano, https://github.com/malfet, https://github.com/jansel	2022-11-09 09:58:04 +00:00
Nikita Karetnikov	aa8279bcb8	[primTorch] Improve `narrow` and `narrow_copy`: refs, tests, docs (#87045 ) Fixes #87019. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87045 Approved by: https://github.com/mruberry	2022-11-09 09:19:28 +00:00
Xia, Weiwen	f6192b75c6	[Quant] Support lowering of channel shuffle in FX (#83731 ) ## Description Support lowering of channel shuffle in FX by adding its module and functional op to `is_copy_node` list in `torch/ao/quantization/fx/_lower_to_native_backend.py` ## Validation UTs added to test - correctness of quantized `ChannelShuffle` module. - FX lowering of `ChannelShuffle` module and functional `channel_shuffle`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83731 Approved by: https://github.com/jerryzh168	2022-11-09 08:08:11 +00:00
Nikita Shulga	ab9a19a95b	[BE] Move `setup-ssh` step ahead of clone PyTorch (#88715 ) It allows one to SSH faster rather than having to wait for repo clone to finish. I.e. right now one usually have to wait for a few minutes fore PyTorch clone is finished, but with this change you can SSH ahead of time (thanks to `setup-ssh` being a composite action Pull Request resolved: https://github.com/pytorch/pytorch/pull/88715 Approved by: https://github.com/clee2000, https://github.com/izaitsevfb	2022-11-09 06:55:22 +00:00
Eddie Yan	a7420d2ccb	Hopper (`sm90`) support (#87736 ) Essentially a followup of #87436 CC @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/87736 Approved by: https://github.com/xwang233, https://github.com/malfet	2022-11-09 01:49:50 +00:00
Wei-Sheng Chin	19d7941e37	Fix Python-bound function signature (torch._C.Graph.addInput) (#88528 ) In pytorch/torch/_C/__init__.pyi, Graph.addInput has signature ```python def addInput(self, name: str) -> Value: ... ``` which doesn't match the corresponding function ```cpp Value* addInput(const std::string& name = "") { return block_->addInput(name); } ``` in python_ir.cpp. This PR aligns the bound function on both C++ and Python sides. Without this PR, mypy will compain whenever a change contains some calls to `addInput`; for example, ![image](https://user-images.githubusercontent.com/3524474/200092086-429b8d63-9321-4d03-b0d6-f4c9bd361756.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88528 Approved by: https://github.com/davidberard98	2022-11-09 01:31:45 +00:00
Edward Z. Yang	f0e6cea2ed	Meta registrations for inplace operators (#88678 ) Also, handle non-default alpha correctly. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88678 Approved by: https://github.com/SherlockNoMad, https://github.com/albanD	2022-11-09 01:27:01 +00:00
Edward Z. Yang	a880ddc164	Meta implementation for unsqueeze_ (#88675 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88675 Approved by: https://github.com/SherlockNoMad	2022-11-09 01:27:01 +00:00
Edward Z. Yang	1dab35ca1b	Meta implementation for bernoulli (#88676 ) For some reason bernoulli uses legacy memory format, see linked issue. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88676 Approved by: https://github.com/SherlockNoMad	2022-11-09 01:26:58 +00:00
Nikita Shulga	6be426ca1a	Update gloo submodule (#88530 ) Also, add an explicit cudart dependency to `torch_cuda` if Kineto is used with GPU support (it used to be somehow inherited from a wrong `gloo` setup) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88530 Approved by: https://github.com/osalpekar	2022-11-09 01:04:32 +00:00
Zhengxu Chen	08b2a251e1	[export] Preserve meta["val"] on placeholders in dynamo.export(). (#88651 ) Summary: Today when we transform the captured graph in the last step in export(aten_graph=True), we construct a new graph which doesn't have the all the metadata to be preserved, for example, node.meta["val"]. meta["val"] is important for writing passes and analysis on the graph later in the pipeline, we may want to preserve that on placeholder nodes. Test Plan: test_export.py:test_export_meta_val Differential Revision: D41110864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88651 Approved by: https://github.com/tugsbayasgalan, https://github.com/jansel	2022-11-09 01:02:09 +00:00
Bin Bao	5f876bfdc5	Reduce the number of shards inductor uses for model tests (#88610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88610 Approved by: https://github.com/huydhn	2022-11-09 00:54:54 +00:00
Antoni Viros i Martin	9f58e027a9	Add implementation for irregular dimension selection for nested tensors. (#88585 ) Summary: This diff modifies the implementation of the select operator so slices of the irregular dimension can be selected (e.g. nt[:,0,:]). Test Plan: Added new unit tests to test that the new functions work as intended (see them in diff). To test, `buck test mode/dev-nosan //caffe2/test:nested` Differential Revision: D41083993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88585 Approved by: https://github.com/cpuhrsch	2022-11-09 00:19:38 +00:00
Samantha Andow	87238e6491	[nn] add remove_duplicate flag to named_parameters (#759 ) (#88090 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/759 Since the remove_duplicate flag was added to named_buffers in D39493161 (`c12f829cce`), this adds the same flag to named_parameters Test Plan: python test/test_nn.py -k test_buffers_and_named_buffers OSS Tests Differential Revision: D40801899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88090 Approved by: https://github.com/albanD	2022-11-09 00:09:20 +00:00
Taylor Robie	cef13ebea0	[Profiler] Memory profiler part 1: Gradient identification (#86802 ) There are multiple ways to indentify that a Tensor is a gradient. (A subset of which also give additional context.) So to start off I've made a utility to handle that determination. Differential Revision: [D39920730](https://our.internmc.facebook.com/intern/diff/D39920730/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86802 Approved by: https://github.com/chaekit	2022-11-08 23:53:13 +00:00
Michael Suo	c0e6b4329f	[dynamo] only error out on nested fx trace if dynamo is optimizing (#88640 ) I think this is the final resolution to issue caused by https://github.com/pytorch/pytorch/pull/87797. The nvfuser issue that PR tripped up was because, even though we're correctly disabling torchdynamo via a `DisableContext`, the nested fx trace check was still firing. This PR properly narrows it to only fire if we're not disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88640 Approved by: https://github.com/yf225	2022-11-08 23:52:21 +00:00
Mikayla Gawarecki	a02ea655b5	Slight fix in error message for check_for_seq_len_1_nested_tensor (#88690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88690 Approved by: https://github.com/cpuhrsch	2022-11-08 22:14:21 +00:00
Taylor Robie	6e6f929b2c	[Profiler] Restructure inputs and capture TensorLists. (#87825 ) This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata. I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through. Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825 Approved by: https://github.com/slgong-fb, https://github.com/chaekit	2022-11-08 21:48:43 +00:00
Taylor Robie	e132c45fd0	[Profiler] Handle ABA for TensorImpl* when assigning IDs (#87133 ) Part of the current ID assingment algorithm groups any Storages which are associated with the same TensorImpl. This isn't sound (which I knew but deferred until it actually became a problem) because pointers can be reused by different objects. (ABA problem) ABA is easy to handle for Storage because we see allocations and frees, but ~TensorImpl is very hot and cannot tolerate profiling code without significant increases in overhead. This PR narrows the conditions under which ID assignment will join on TensorImpl. Two storages which are associated with the same TensorImpl* are grouped IFF they were live at the same time. (Note that this still allows storages with disjoint lifetimes to be joined transitively through a third storage which overlaps with both.) The need for this PR arose in memory profiling. The Python argument parser creates short lived Tensors for (some) scalar arguments which triggers this issue. (Which is stochastic and platform dependent since optimizations like reusing recently freed allocations is implementation defined.) Spurious connections can lead to confusing and long range interactions when building up the memory profile, so it makes sense to harden ID assignment to avoid any issues. Differential Revision: [D40445121](https://our.internmc.facebook.com/intern/diff/D40445121/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87133 Approved by: https://github.com/slgong-fb, https://github.com/chaekit	2022-11-08 21:48:43 +00:00
Nikita Shulga	078c25df13	[MPS][BE] Code cleanup (#88529 ) Various code cleanup in MPS operations: - Per @kulinseth suggestion move `mpsSupportsCumsum` to `MPSDevice.h` and rename it to `is_macos_13_or_newer()` - Move Ventura MPSGraph new operators to `MPSGraphVenturaOps.h` header - Use `LookupAs` and `CreateCachedGraphAs` to make code more compact - Formatting Pull Request resolved: https://github.com/pytorch/pytorch/pull/88529 Approved by: https://github.com/kulinseth	2022-11-08 21:10:07 +00:00
Sherlock Huang	1d82eba98b	PatternMatcher supports matching list-typed args (#88656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88656 Approved by: https://github.com/jerryzh168	2022-11-08 21:05:18 +00:00
Peter Bell	8e2627d42f	[inductor] Fix aten.fmod lowering (#88602 ) Currently the lowering for aten.fmod promotes integral types to float and calls `tl.libdevice.fmod` whereas the ATen behavior is to use the modulo operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88602 Approved by: https://github.com/jansel	2022-11-08 20:27:36 +00:00
Mengwei Liu	f556d73574	[torch] Implement aten::native_batch_norm.out for CPU (#88604 ) Summary: Implement `native_batch_norm.out` for CPU. Reuses the main logic for `native_batch_norm` but extract out the Tensor creation logic for outputs. There are 3 outputs: `output`, `save_mean` and `save_var`. `batch_norm_cpu` calls `batch_norm_cpu_update_stats_template` to get `save_mean` and `save_var`, and then calls into `batch_norm_cpu_transform_input_template` which initializes `output`. In the implementation of `batch_norm_cpu_out`, I did the following: * Let `batch_norm_cpu_transform_input_template` to take another argument `output`, ask the call sites to pass in a output Tensor. * Overload `batch_norm_cpu_update_stats_template` to take `save_mean` and `save_var`, ask the call sites to pass in those Tensors. * In `batch_norm_cpu_out`, pass `output`, `save_mean` and `save_var` all the way to our new `batch_norm_cpu_transform_input_template` and `batch_norm_cpu_update_stats_template`. * In `batch_norm_cpu`, prepare for these outputs and call `batch_norm_cpu_out`. Test Plan: Enable unit tests for `native_batch_norm.out`. Differential Revision: D40992036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88604 Approved by: https://github.com/iseeyuan, https://github.com/jjsjann123	2022-11-08 19:53:11 +00:00
Eddie Yan	3e30a9ea1c	Fix `CUDA_MAX_THREADS_PER_SM` for `sm_87` (#88644 ) #88326 CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/88644 Approved by: https://github.com/ngimel	2022-11-08 19:44:23 +00:00
Edward Z. Yang	6bb7f4f29f	Minor error message improvements on meta functions (#88677 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88677 Approved by: https://github.com/SherlockNoMad	2022-11-08 19:16:29 +00:00
PyTorch MergeBot	d98a884b33	Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669 )" This reverts commit 3c6bddc3f6347ce7d1ed33aee94cdaa953cbc387. Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions	2022-11-08 19:04:25 +00:00
Nikita Shulga	5eecfcf5f3	Run libtorch trunk build on linux.4xlarge (#88683 ) Add optional `runner` input to `_linux-build.yml` Move `libtorch-linux-bionic-cuda11_6-py3_7-gcc7-build` to `linux.4xlarge` as it occasionally OOMS on 2xlarge one Pull Request resolved: https://github.com/pytorch/pytorch/pull/88683 Approved by: https://github.com/atalman, https://github.com/weiwangmeta	2022-11-08 18:52:56 +00:00
zyq8709	eaf4fe3d2b	Most recently used cache management for TorchDynamo (#88076 ) Modify the lookup procedure for TorchDynamo caches to keep the head of the single linked list as the most recently used cache entry, which may potentially improve probability for cache hitting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88076 Approved by: https://github.com/jansel	2022-11-08 18:46:59 +00:00
Edward Z. Yang	1b5373fc83	Mark as_strided_ as supporting SymInt in C++ (#88674 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88674 Approved by: https://github.com/anjali411	2022-11-08 18:45:05 +00:00
PyTorch MergeBot	dba887766b	Revert "torchdynamo support modules() for nn_module (#88023 )" This reverts commit 96104c7b7e908634a473792b6b2e9279d79d23d8. Reverted https://github.com/pytorch/pytorch/pull/88023 on behalf of https://github.com/ydwu4 due to [Internal breakages] https://www.internalfb.com/intern/sandcastle/job/9007200067589062/	2022-11-08 18:37:48 +00:00
Edward Z. Yang	860e354d1c	Support diag_embed.out decomposition (#88671 ) This is a little tricky: there is a diag_embed.out, but its not bound in Python because it's autogenerated, see https://github.com/pytorch/pytorch/issues/88598 So I can't "just" add the out variant to the ref, as this makes it inconsistent with the torch API. To workaround this, I mark the ref as supporting out, but not the original function. This is useful to do, because it means that diag_embed.out now supports symbolic shapes. However, this cannot be easily tested because I can't mark the out variant as being supported in the normal OpInfo test. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88671 Approved by: https://github.com/mruberry	2022-11-08 18:28:36 +00:00
Edward Z. Yang	3f6a560184	Correctly test that dtype/device match in generated .out kernels for composites (#88672 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88672 Approved by: https://github.com/anjali411	2022-11-08 18:28:36 +00:00
Edward Z. Yang	245144a636	Propagate layout and pin memory in randint to inner constructor (#88673 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88673 Approved by: https://github.com/anjali411	2022-11-08 18:22:30 +00:00
Yidi Wu	96104c7b7e	torchdynamo support modules() for nn_module (#88023 ) Differential Revision: D40820879 This diff allows models to call self.modules() during dynamo tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88023 Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym, https://github.com/jansel	2022-11-08 18:22:03 +00:00
Kurt Mohler	ee28b865ee	Deprecate TypedStorage, its derived classes, and all of their public methods (#85303 ) Part of #85302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303 Approved by: https://github.com/ezyang	2022-11-08 18:11:01 +00:00
Natalia Gimelshein	53ca5ad347	enable scalar reduction with dim=-1 (#88628 ) Tested with all samples for `sum`, but also fixes all samples errors on other reductions (amin, amax, any, all etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88628 Approved by: https://github.com/desertfire	2022-11-08 17:06:28 +00:00
Will Constable	89c5819626	Dynamo DDP accuracy bench uses find_unused_parameters (#88645 ) - find_unused_parameters adds a slight overhead, but is required in cases where users do not manually specify parameters to ignore which will not receive grads. In some models, some parameters do not receive grads, and this causes DDP to throw an exception as it waits for a grad for each parameter Pull Request resolved: https://github.com/pytorch/pytorch/pull/88645 Approved by: https://github.com/soumith	2022-11-08 16:13:10 +00:00
albanD	fcc2883476	Clean up SymFloat binding to cover all functions (#88370 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88370 Approved by: https://github.com/ezyang	2022-11-08 14:32:47 +00:00
albanD	6abaa5946d	Fix categorization of sym_int method (#88369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88369 Approved by: https://github.com/ezyang, https://github.com/bdhirsh, https://github.com/anjali411	2022-11-08 14:32:47 +00:00
Howard Huang	bc66ddb5cb	Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134 ) Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8 Changes: - introduce new error type - Update `C10D_NCCL_CHECK` Sample script to demonstrate new error type ```python # python -m torch.distributed.run --nproc_per_node=2 <script>.py import torch import torch.distributed as dist if __name__ == "__main__": dist.init_process_group("nccl") dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0) ``` Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134 Approved by: https://github.com/rohan-varma	2022-11-08 13:26:42 +00:00
lezcano	1a7c4b0de7	Create _make_alias to preserve the name of a function when creating an alias (#88114 ) Before, we would inherit the name of the aliased function, which was very confusing, and disallowed some homogeneous treatment of references, as we do later in this stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/88114 Approved by: https://github.com/mruberry	2022-11-08 13:09:34 +00:00
jjsjann123	af09270e10	nvprims bookend non compute (#88457 ) Cherry-pickeding: https://github.com/csarofeen/pytorch/pull/2099 1. enabling bookend non-compute-ops pass on nvfuser 2. fixing bookend op check on intermediate tensor as partition inputs 3. python tests added for: `getitem` special handling bookend_non_compute removal 4. patching dfs by excluding dfs within partition to avoid going over recursion limitation Pull Request resolved: https://github.com/pytorch/pytorch/pull/88457 Approved by: https://github.com/SherlockNoMad	2022-11-08 12:06:35 +00:00
Huy Do	8cb5c5543e	Revive static_runtime_benchmark build and test (#87660 ) This build uses the wrong BUILD_ENVIRONMENT `pytorch-linux-focal-py3`, thus it hasn't been run for a long time (forgotten). The name was probably the old name of the build environment we used in the past. The convention today doesn't have the `pytorch-` prefix. There is a TODO for this: > TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this. This is done as part of [T131829540](https://www.internalfb.com/intern/tasks/?t=131829540), where we want `static_runtime_benchmark` build and test jobs to run in OSS CI to avoid breaking internal * I also fix some compiler warning errors `-Werror=sign-compare`, `-Werror,-Wunused-const-variable`, and gcc7 compatibility issue along the way because this hasn't been run for a long time. * Reviving this test also reveals a small bug in `PrepackWeights` test in `test_static_runtime.cc` added recently in https://github.com/pytorch/pytorch/pull/85289. The test refers to an internal ops and should only be run internally. This has been fixed by https://github.com/pytorch/pytorch/pull/87799 (To be merged) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87660 Approved by: https://github.com/malfet	2022-11-08 08:32:45 +00:00
Michael Suo	02c1a304fa	[ci] increase timeout time of ios test app build (#88611 ) We were timing out; 5 minutes seems a bit short. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88611 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/ZainRizvi	2022-11-08 06:29:11 +00:00
Taylor Robie	8f66ae413f	[Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339 ) There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well. Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88339 Approved by: https://github.com/ngimel	2022-11-08 05:37:43 +00:00
Charlie Yan	ffb6e68962	Add missing args to DDP constructor in distributed.pyi (#88209 ) Summary: As title. And remove all unnecessary `pyre-fixme` for the unknown arg in call-site. Test Plan: CI Differential Revision: D40874013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88209 Approved by: https://github.com/zhaojuanmao	2022-11-08 05:12:18 +00:00
biubiuX	ced71e8e82	[Pytorch] add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188 ) Summary: Add an option to disable TORCH_WARN, some op could trigger spammy TOCH_WARN log which is not desired under certain scenario. Test Plan: Tested with -pt.disable_warn = 1 and -pt.disable_warn = 0 verified TORCH_WARN and TORCH_WARN_ONCE are properly handled tested with -pt.strip_error_messages = 1, -pt.disable_warn = 0 verified strip error message is respected when warn is printed Differential Revision: D40321550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87188 Approved by: https://github.com/kurtamohler, https://github.com/ezyang	2022-11-08 04:49:45 +00:00
PyTorch MergeBot	ed97e0aa29	[vision hash update] update the pinned vision hash (#88465 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88465 Approved by: https://github.com/pytorchbot	2022-11-08 03:29:55 +00:00
BoringCrypto	9f11ce7f67	Setting pickle_module isn't working (#88570 ) When setting the pickle_module it currently always gets overwritten by the pickle module. This should only happen when the pickle_module isn't specified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88570 Approved by: https://github.com/kit1980	2022-11-08 03:26:46 +00:00
Edward Z. Yang	825f4e602b	Add support for symbolic shapes to sparse tensor (#88573 ) Along the way, I undid making sparse/dense dim symint (they're dimensions, so they should be static.) Also symintify set_indices_and_values_unsafe There is a little bit of a nontrivial infra change here: previously, we didn't populate the strides field on sparse tensors. It is now populated with "empty" strides, and this meant that sparse tensors were falsely reporting they were non-overlapping dense/contiguous. I added in a hack to work around this case. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88573 Approved by: https://github.com/anjali411	2022-11-08 03:13:42 +00:00
Jiewen Tan	c29502dd2f	[LTC] Remove view (#88445 ) Summary: This pull request removes the last view ops, the original view. Test Plan: ./build/bin/test_lazy --gtest_filter=LazyOpsTest.TestView* Pull Request resolved: https://github.com/pytorch/pytorch/pull/88445 Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/Krovatkin	2022-11-08 02:22:02 +00:00
Nikita Shulga	f2000842a8	Do not use double for single-prec upsample (#88277 ) I'm not sure, what would be the best behaviour here, but it feels a bit strange to perform parts of `float32` computations as `float64` and then downcast them back to `float32`. Use `at::opmath_type` rather than `at:acc_type` as no accumulation is used in the op. I don't know much about double vs single precision scalar perf on x86 CPU, but before the change: ``` python -c "import timeit;import torch;x=torch.arange(100, dtype=torch.float32).reshape(1, 1, 10, 10); print(timeit.Timer(stmt='torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"bilinear\", align_corners=False)', globals={'x':x, 'torch':torch}).timeit())" 11.337517574429512 ``` After the change: ``` $ python -c "import timeit;import torch;x=torch.arange(100, dtype=torch.float32).reshape(1, 1, 10, 10); print(timeit.Timer(stmt='torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"bilinear\", align_corners=False)', globals={'x':x, 'torch':torch}).timeit())" 10.513805857859552 ``` I.e. roughly 7% perf degradation (measured on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz) NOTE: - `aten::acc_type<float, false>` yields `double` - `aten::acc_type<float, true>` return `float`. Fixes https://github.com/pytorch/pytorch/issues/87968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88277 Approved by: https://github.com/mingfeima, https://github.com/ngimel, https://github.com/jgong5	2022-11-08 01:46:25 +00:00
Kazuaki Ishizaki	4ea2310f1e	Fix typos used in documents under torch directory (#88483 ) This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html. This is a follow-up of #88300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88483 Approved by: https://github.com/kit1980	2022-11-08 01:33:36 +00:00
Huy Do	d25be63c05	[Reland] Use sudo when reset NVIDIA devices (#88605 ) I accidentally delete my remote branch, so I need to create a new PR for this fix (instead of updating the reverted PR https://github.com/pytorch/pytorch/pull/88531) TIL, sudo echo doesn't do that I think it does, the correct syntax should be `echo "1" \| sudo tee /sys/bus/pci/devices/$PCI_ID/reset` granting sudo permission to the latter tee command. ### Testing Due diligence and actually login to `i-07e62045d15df3629` and make sure that the command works Pull Request resolved: https://github.com/pytorch/pytorch/pull/88605 Approved by: https://github.com/ZainRizvi	2022-11-08 01:17:35 +00:00
Antoni Viros i Martin	c77368d416	Implement a constructor for nested_tensor that is similar to torch.tensor() (#88213 ) Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213 Approved by: https://github.com/cpuhrsch	2022-11-08 00:03:18 +00:00
Huy Do	72a7351993	Pin linux ninja dep to 1.10.2 (#88548 ) The latest version 1.11.1 breaks PyTorch CI. A bunch of tests are failing now in master `d1ee073041`. Curiously, the latest commit `81042d3a53` looks green, but it's good to pin this dependency anyway https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L95-L97 has a curious note about ninja and why it's not part of the docker container (need to revisit this later on): ``` #ninja #Description: build system. Note that it install from #here breaks things so it is commented out ``` This is one more reason to justify the effort to consolidating all pip and conda dependencies to get rid of this family of issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88548 Approved by: https://github.com/clee2000	2022-11-07 23:53:17 +00:00
Huy Do	fdf2865108	Use test/test-reports for inductor (#88533 ) So that the test reports can be picked up automatically by the CI and uploaded to S3. Later on, this will allows the querying of these test reports from our Rockset DB. For example https://github.com/pytorch/pytorch/actions/runs/3382363153/jobs/5617382531 `Upload test statistics` shows: ``` + python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test No tests in reports found in test ``` `678d038001` inductor artifacts are also empty zip at the moment Pull Request resolved: https://github.com/pytorch/pytorch/pull/88533 Approved by: https://github.com/desertfire	2022-11-07 23:49:21 +00:00
Peter Bell	eb3f975c6e	Fix segfault in has_torch_function (#88559 ) Fixes #83908 `PySequence_Fast` may return `NULL` to indicate an error was raised, in which case `sequence_has_torch_function` will dereference a null pointer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88559 Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/hameerabbasi	2022-11-07 23:48:39 +00:00
Huy Do	4796e23bbb	Fix pull docs build running with a schedule and increase cpp doc timeout to 4h (#88589 ) * After https://github.com/pytorch/pytorch/pull/88373, pull workflow can now be triggered with a schedule. This changes the assumption in the doc build workflow when schedule event is used to determine if the docs should be pushed * I'll create a follow-up issue to see if it's possible to improve the performance of cpp doc build job. At the moment, it uses a linux.12xlarge runner and still couldn't finish the job after 3h Pull Request resolved: https://github.com/pytorch/pytorch/pull/88589 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2022-11-07 23:05:14 +00:00
lezcano	d453b3c4d4	Add a note on the stability of linalg functions. (#88313 ) This was long-due, as it keeps comming up in issues. Fixes https://github.com/pytorch/pytorch/issues/85950 Fixes https://github.com/pytorch/pytorch/issues/59720 Fixes https://github.com/pytorch/pytorch/issues/59782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88313 Approved by: https://github.com/soumith, https://github.com/mruberry	2022-11-07 22:44:23 +00:00
PyTorch MergeBot	b00c43b310	Revert "fallback for scatter_(scalar) (#88210 )" This reverts commit 896fa8c5c9b0191c9621e04ab5e20057614d48ad. Reverted https://github.com/pytorch/pytorch/pull/88210 on behalf of https://github.com/suo due to this broke inductor tests, see: `896fa8c5c9`	2022-11-07 22:29:56 +00:00
William Wen	0e67b2f7dd	Dynamo Dashboard Improvements (#88516 ) Implement various features in https://github.com/pytorch/torchdynamo/issues/1644: - Upload nightly run logs to /fsx before parsing - for backing up parsing failures. - Flag models with (1) < 0.95x speedup, (2) > 2min compile time, (3) < 0.9x compression ratio - Flag models that were passing yesterday but failed today. - Other small bug fixes. See https://github.com/pytorch/torchdynamo/issues/1831 for sample outputs. Also tested by running run.sh: ```bash # Setup the output directory rm -rf ../test-dynamo-runner-logs-3/ mkdir ../test-dynamo-runner-logs-3/ # Commands for torchbench for device=cuda, dtype=float32 for training and for performance testing python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-3//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 --cold_start_latency # Commands for torchbench for device=cuda, dtype=float32 for training and for accuracy testing python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-3//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 ``` with the command `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-3/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard` (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else). Pull Request resolved: https://github.com/pytorch/pytorch/pull/88516 Approved by: https://github.com/anijain2305	2022-11-07 22:24:44 +00:00
Aaron Gokaslan	b14e06503a	(fix): Add some missing std::moves to C10 (#88512 ) I saw some missed optimization opportunities in C10 using std::move and thought I would submit a PR to fix them. There are particularly a lot of them dealing with the symbolic operators which are used in quite a few places including in loops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88512 Approved by: https://github.com/ezyang	2022-11-07 22:17:13 +00:00
lezcano	d8506ff42b	Generalize gesvdjBatched to run whith full_matrices==false (#88502 ) As brought up in https://github.com/pytorch/pytorch/issues/86234#issuecomment-1268296036, our heuristic for which SVD backend to choose was not great in some cases. The case in which there could be some improvements is when we have a large batch of very small non-square matrices. This PR, adapts the calling code to gesvdj by creating two temporary square buffers to allow to call gesvdjBatched, and then copies back the result into the output buffers. We then modify the heuristic that chooses between gesvdj and gesvdjBatched. Fixes https://github.com/pytorch/pytorch/issues/86234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88502 Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved, https://github.com/mruberry, https://github.com/xwang233	2022-11-07 22:07:48 +00:00
Vitaly Fedyunin	9dadf8fcc2	[DataPipes] Add group support to the sharding_filter (#88424 ) Differential Revision: [D41006747](https://our.internmc.facebook.com/intern/diff/D41006747) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88424 Approved by: https://github.com/ejguan	2022-11-07 22:07:01 +00:00
Edward Z. Yang	23a3eb37cf	SymIntify _copy functionalization kernels (and _copy_out too) (#88572 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88572 Approved by: https://github.com/anjali411, https://github.com/bdhirsh	2022-11-07 21:40:10 +00:00
Nikolay Korovaiko	896fa8c5c9	fallback for scatter_(scalar) (#88210 ) `scatter_reduce_` overloads can only accept `Tensor src`. `scatter_`, on the other hand, can accept `Number src`. Switching a fallback from `scatter_reduce_` to `scatter_` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88210 Approved by: https://github.com/desertfire	2022-11-07 21:25:55 +00:00
Jane Xu	0a69c50a46	Publicly expose _LRScheduler to LRScheduler (#88503 ) Fixes #61232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88503 Approved by: https://github.com/soulitzer	2022-11-07 21:15:10 +00:00
Huy Do	05b9e8ec00	Upload test stats for inductor workflow (#88535 ) We miss this new workflow, so none of its test stats are uploaded to rockset Pull Request resolved: https://github.com/pytorch/pytorch/pull/88535 Approved by: https://github.com/desertfire	2022-11-07 21:04:02 +00:00
Yu Guo	a37524085d	[torchdynamo] support torch.autograd._profiler_enabled (#88378 ) fix https://github.com/pytorch/torchdynamo/issues/1826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88378 Approved by: https://github.com/voznesenskym	2022-11-07 20:36:26 +00:00
Sherlock Huang	95d57b54e0	Handle pin_memory in refs.randn (#88473 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88473 Approved by: https://github.com/mruberry	2022-11-07 20:25:56 +00:00
Michael Suo	bf49dada1e	[nvfuser] skip extremal tests on rocm (#88587 ) Summary: These are failing in rocm so disable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88587 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2022-11-07 20:03:27 +00:00
PyTorch MergeBot	7bf9db81c5	Revert "Use sudo when reset NVIDIA devices (#88531 )" This reverts commit 505486ce9321bc22d2156a1a9b97fe474a05b53b. Reverted https://github.com/pytorch/pytorch/pull/88531 on behalf of https://github.com/huydhn due to Wrong sudo echo usage, should use tee instead	2022-11-07 19:59:42 +00:00
PyTorch MergeBot	78a0ca29d9	Revert "[fix] allow saving python attr on Tensor and Parameter via torch.save (#81616 )" This reverts commit 54b6188cc6dee45b775d688223b847dc8ea85bff. Reverted https://github.com/pytorch/pytorch/pull/81616 on behalf of https://github.com/mehtanirav due to Internal publishing is broken	2022-11-07 18:51:16 +00:00
Angela Yi	91a4039842	[exir][fx] PassManager error handling (#88520 ) Summary: * Added an error message for when the result is not a PassResult * Modified the error handling to capture exceptions that happen in the check() function * consolidated inplace_wrapper and pass_result_wrapper Test Plan: CI Differential Revision: D40950135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88520 Approved by: https://github.com/SherlockNoMad	2022-11-07 18:42:41 +00:00
Yanbo Liang	bd1ffc6501	[Dynamo] Fix bug: GradMode doesn't carry grad state correctly after graph break (#88537 ) Fixes https://github.com/pytorch/torchdynamo/issues/1446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88537 Approved by: https://github.com/jansel	2022-11-07 18:03:31 +00:00
Rodrigo Kumpera	6663ae5537	[2/n] Thread PG: add class _World to distributed_c10d.py (#781 ) (#88471 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/781 Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 This change ensures BC by keeping the global variables around and have the default _World wrap it. I have relinked this diff to a new github PR, so that I can update it. The original PR is > Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348 Differential Revision: D40236769 Pulled By: yhcharles Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471 Approved by: https://github.com/gnadathur, https://github.com/rohan-varma	2022-11-07 17:56:40 +00:00
Zain Rizvi	fc8f2f66fe	Clarify rules for which commit is used in CI (#88425 ) The old information was out of date. Updating it as per @janeyx99's feedback Pull Request resolved: https://github.com/pytorch/pytorch/pull/88425 Approved by: https://github.com/malfet	2022-11-07 17:38:42 +00:00
Huy Do	c407a7b203	Upgrade Linux NVIDIA driver to the latest prod version (#88517 ) The driver (515.76) is downloaded from https://www.nvidia.com/en-us/drivers/unix. This should help address the issue with A10G GPU on G5 runners according to NVIDIA. This is to address https://github.com/pytorch/pytorch/issues/88352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88517 Approved by: https://github.com/ZainRizvi	2022-11-07 17:26:28 +00:00
Huy Do	505486ce93	Use sudo when reset NVIDIA devices (#88531 ) Per title, I should have known, i.e. https://ossci-raw-job-status.s3.amazonaws.com/log/9307292415 ``` 2022-11-04T23:52:18.2921665Z + echo 1 2022-11-04T23:52:18.2921862Z Reseting 0000:00:1e.0 (enabled state: 0) 2022-11-04T23:52:18.2922186Z .github/scripts/install_nvidia_utils_linux.sh: line 77: /sys/bus/pci/devices/0000:00:1e.0/reset: Permission denied ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88531 Approved by: https://github.com/ZainRizvi	2022-11-07 17:19:02 +00:00
Nikolay Korovaiko	cec4bd99b0	allow XLA folks update the pin (#88527 ) this is one of the files XLA team needs to update ocassionally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88527 Approved by: https://github.com/wconstab	2022-11-07 17:02:08 +00:00
Brian Hirsh	a16ced03c9	reland "fix as_strided_scatter_backward (#87646 )" (#88342 ) This reverts commit 71fb763e5452881cb3be8fefa9419b785d0a61e2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88342 Approved by: https://github.com/zou3519	2022-11-07 15:00:58 +00:00
Mike Iovine	dd43903fa9	[Static Runtime] Fix tensor_split sections overload (#88113 ) Summary: D40798763 broke this op. Unfortunately, it wasn't caught at land time due to the recent OSS Static Runtime test problems. The problem is C++ overload resolution. After D40798763, the int that we were passing to `at::native::tensor_split` was getting implicitly converted to `IntArrayRef`. Fix this by converting the int to a `SymInt` and calling the correct overload. Test Plan: ``` buck2 test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Tensor_Split --run-disabled ``` Differential Revision: D40862394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88113 Approved by: https://github.com/hlu1	2022-11-07 14:36:39 +00:00
PyTorch MergeBot	7076a6481d	[xla hash update] update the pinned xla hash (#88070 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88070 Approved by: https://github.com/pytorchbot	2022-11-07 10:22:46 +00:00
Wang, Eikan	ad27d762a7	Support sign for HF models like ElectraForQuestionAnswering (#88160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160 Approved by: https://github.com/jansel	2022-11-07 09:10:37 +00:00
Wang, Eikan	a9d37ce8f5	Support reduction vectorization (#87356 ) This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation. - Step1: Parallelize and vectorize the reduction implementation - Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar - Step3: Handle the tail elements For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step. - Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010) - Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015) - Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023) - Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033) - Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042) - Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049) ```python # Gen the non-reduction loop for loop in CppVecKernel.NoneReductionLoop: # Gen the reduction initialization both for vectorization and non-vectorization kernel CppVecKernel.ReductionPrefix # Gen the reduction loop for the vectorization kernel for loop in CppVecKernel.ReductionLoop CppVecKernel.Loads CppVecKernel.Compute CppVecKernel.Stores # Gen the code to reduce the vector to scalar CppVecKernel.ReductionSuffix # Gen the reduction loop for the non-vectorization kernel for loop in CppKernel.ReductionLoop CppKernel.Loads CppKernel.Compute CppKernel.Stores # The reduction is almost finished. To do some post-reduction things like store reduction value. CppKernel.ReductionSuffix ``` The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined. ```C++ { // Declare reduction for at::vec::Vectorized since it is not built-in data type. #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp4 = 0; // tmp4_vec is used to vectorize the sum reduction for tmp4 auto tmp4_vec = at::vec::Vectorized<float>(tmp4); float tmp6 = 0; // tmp6_vec is used to vectorize the sum reduction for tmp6 auto tmp6_vec = at::vec::Vectorized<float>(tmp6); #pragma omp parallel num_threads(48) { // Parallelize the vectorized reduction #pragma omp for reduction(+:tmp4_vec) reduction(+:tmp6_vec) for(long i0=0; i0<192; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2.abs(); auto tmp5 = tmp2 * tmp2; tmp4_vec += tmp3; tmp6_vec += tmp5; } // Reduce the tmp4_vec as a scalar and store at tmp4 tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec); // Reduce the tmp6_vec as a scalar and store at tmp6 tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec); // Handle the tail elements that could not be vectorized by aten. #pragma omp for simd simdlen(4) reduction(+:tmp4) reduction(+:tmp6) for(long i0=1536; i0<1536; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 - tmp1; auto tmp3 = std::abs(tmp2); auto tmp5 = tmp2 * tmp2; tmp4 += tmp3; tmp6 += tmp5; } } out_ptr0[0] = tmp4; out_ptr1[0] = tmp6; } ``` Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance): Softmax (1,16,384,384,dim=3) \| Speedup ratio (simdlen=None) \| Speedup ratio (simdlen=8) + this PR -- \| -- \| -- 24c \| 0.37410838067524177 \| 0.9036240100351164 4c \| 0.24655829520907663 \| 1.0255329993674518 1c \| 0.21595768114988007 \| 1.000587368005134 HW Configuration: SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz MemTotal: 196708148 kB MemFree: 89318532 kB MemBandwidth: 112195.1MB/S Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-07 06:40:34 +00:00
Wang, Eikan	6541e51ffd	Explicit vectorization support for TorchInductor (#87068 ) In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8. ```C++ extern "C" void kernel(const float* __restrict__ in_ptr0, const float* __restrict__ in_ptr1, float* __restrict__ out_ptr0, const long ks0, const long ks1) { #pragma omp parallel num_threads(48) { #pragma omp for for(long i0=0; i0<((ks0ks1) / 8); ++i0) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8i0); auto tmp2 = tmp0 + tmp1; auto tmp3 = tmp2.exp(); tmp3.store(out_ptr0 + 8i0); } #pragma omp for simd simdlen(4) for(long i0=8(((ks0ks1) / 8)); i0<ks0*ks1; ++i0) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_ptr1[i0]; auto tmp2 = tmp0 + tmp1; auto tmp3 = std::exp(tmp2); out_ptr0[i0] = tmp3; } } } ``` The major pipeline is as follows. - Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](`bf66991fc4/torch/_inductor/codegen/cpp.py (L702)`)is to check whether all the `ops` have been supported. The [other one](`355326faa3/torch/_inductor/codegen/cpp.py (L672)`) is to check whether the data access could be vectorized. - [`CppSimdVecKernelChecker`](`355326faa3/torch/_inductor/codegen/cpp.py (L655)`) - Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized. - [`CppSimdVecKernel`](`355326faa3/torch/_inductor/codegen/cpp.py (L601)`) - [`CppSimdVecOverrides`](`355326faa3/torch/_inductor/codegen/cpp.py (L159)`): The ops that we have supported on the top of `aten::vec` - Create kernel - [`aten::vec` kernel](`355326faa3/torch/_inductor/codegen/cpp.py (L924)`) - [`Original CPP kernel - OMP SIMD`](`355326faa3/torch/_inductor/codegen/cpp.py (L929)`) - Generate code - [`CppKernelProxy`](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) is used to combine the `aten::vec` kernel and original cpp kernel - [Vectorize the most inner loop](`355326faa3/torch/_inductor/codegen/cpp.py (L753)`) - [Generate code](`355326faa3/torch/_inductor/codegen/cpp.py (L821)`) Next steps: - [x] Support reduction - [x] Vectorize the tail loop with `aten::vec` - [ ] Support BF16 - [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-07 06:24:14 +00:00
Natalia Gimelshein	a95419b47e	use faster cache flush in triton benchmarking (#88557 ) Speeds up autotuning a little bit more (about 90s -> 75s for coat_lite_mini) @bertmaher, I've put in workaround so that internal doesn't break, but it can be removed once triton is updated internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88557 Approved by: https://github.com/anijain2305	2022-11-07 05:48:22 +00:00
YJ Shi	eda247ee6c	[Dynamo] fix torchdynamo's TVM meta schedule backend (#88249 ) Note that the previous `optimize_torch` functionality of pytorch is not working with default pytorch release with CXX11 ABI off as TVM by default needs CXX11 ABI for builds. Source: [1](https://discuss.tvm.apache.org/t/can-someone-please-give-me-the-steps-to-use-pt-tvmdsoop/12525), [2](https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627). It would be easier for user to tune with meta schedule instead of finding a CXX11-compatible pytorch, turning on the `pt-tvmdsoop` flag in TVM and rebuilding it. This could be revisited once the `pt-tvmdsoop` flag is updated and tuned on by default in TVM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88249 Approved by: https://github.com/jansel	2022-11-07 01:33:57 +00:00
Peter Bell	791d9ee253	[inductor] Add lowering for as_strided_scatter (#88379 ) Ref pytorch/torchdynamo#327 The use of as_strided does require in-memory manipulations, however this lowering allows those memory ops to be fused with any preceding calculations. e.g. ``` def f(a, b): return torch.as_strided_scatter( a * 8 + 10, b * 2 - 4, size=(a.numel() // 2,), stride=(2,)) ``` Before this compiles to two kernels and a call to `aten.as_strided_scatter` and with this PR it compiles to just two kernels and no additional operator calls. In theory I think this could be a decomposition, but in practice I saw the `output_view.copy_(src)` being optimized out in some cases when this was implemented as a decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88379 Approved by: https://github.com/jansel	2022-11-07 00:59:29 +00:00
PyTorch MergeBot	81042d3a53	Revert "Reenable optimizer overlap tests (#88439 )" This reverts commit da452bcadbc6f34989c6b3b0db6075a272aa9891. Reverted https://github.com/pytorch/pytorch/pull/88439 on behalf of https://github.com/huydhn due to This change breaks trunk due to a land race missing reason parameter to sandcastle_skip_if `da452bcadb`	2022-11-06 02:29:53 +00:00
Nikita Karetnikov	bbaa0637df	Add error inputs to `gaussian_nll_loss` `OpInfo` (#88486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88486 Approved by: https://github.com/lezcano	2022-11-05 20:10:54 +00:00
Rohan Varma	404f254e20	Upstream apply_optim_in_backward from TorchRec (#87397 ) (#88539 ) Summary: Upstreaming this as part of sharing common APIs. This is just a plain move, any changes needed to support DDP / FSDP will come in follow up diffs. Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D40564646 fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539 Approved by: https://github.com/awgu	2022-11-05 18:28:07 +00:00
Rohan Varma	da452bcadb	Reenable optimizer overlap tests (#88439 ) Closes https://github.com/pytorch/pytorch/issues/73259. Not sure the root cause but CI seems fine with these tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88439 Approved by: https://github.com/awgu	2022-11-05 18:26:01 +00:00
Edward Z. Yang	d1ee073041	Handle case when candidate is empty (#88359 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88359 Approved by: https://github.com/wconstab	2022-11-05 17:19:40 +00:00
Sherlock Huang	46730aec35	[Reland] Fix primTorch compute_elementwise_output_strides (#88525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88525 Approved by: https://github.com/desertfire	2022-11-05 05:42:07 +00:00
Edward Z. Yang	0e3031f7e7	Functionalize and compute joint simultaneously. (#88063 ) This also comes with some bug fixes that were uncovered from doing this: - Forward device calls to inner tensor in FunctionalTensorWrapper - Make legacyExtractDispatchKey exclude Functionalize, so that it can get at the real device type key. This is noncontroversial. - Stop stripping dense from key set. The reason for this is FunctionalWrapperTensor may be used in contexts where people query if it is dense or not. If it doesn't report this correctly (from the dispatch key), it will cause errors. This caused some torchbench models to fail when I did one-pass tracing. - Save and restore reapply views TLS correctly Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88063 Approved by: https://github.com/bdhirsh	2022-11-05 03:52:40 +00:00
Sherlock Huang	957a9b63c5	fx.replace_pattern accepts pattern/replacement as GraphModule (#88479 ) Symbolic tracer is no longer the default tracer to produce fx graph. SubgraphRewriter should thus accept a raw GraphModule, rather than use symbolic tracer by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88479 Approved by: https://github.com/jerryzh168	2022-11-05 03:35:30 +00:00
Will Constable	4bb5c2c205	Add docstring to DDPOptimizer (#88521 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88521 Approved by: https://github.com/aazzolini	2022-11-05 02:41:26 +00:00
Will Constable	1f32c3c087	Add single-process DDP accuracy support to dynamo benchmark suite (#88511 ) - does not intend to support multi-process, as that is more complex and we have torchbench scripts for that - currently only works in accuracy mode as this was the main goal, but could be extended for measuring single-gpu perf impact of graph breaks Run with `python benchmarks/dynamo/torchbench.py --inductor --training --accuracy --only hf_Bert --ddp` Example output ``` cuda train hf_Bert [2022-11-04 18:52:08,304] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to complex input striding PASS ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88511 Approved by: https://github.com/davidberard98, https://github.com/aazzolini	2022-11-05 02:41:17 +00:00
Will Constable	3fd0729bb6	DDPOptimizer replace debug=True/False with using torchdynamo logger (#88480 ) Example output: ``` 2022-11-04 05:09:29,525] torch._dynamo.optimizations.distributed: [INFO] DDPOptimizer bucket assignments ┌─────────┬────────────┬───────────────────┐ │ Index │ Size (b) │ Param Names │ ├─────────┼────────────┼───────────────────┤ │ 0 │ 100120020 │ self_net_6_weight │ ├─────────┼────────────┼───────────────────┤ │ │ │ self_net_6_bias │ ├─────────┼────────────┼───────────────────┤ │ │ │ self_net_4_weight │ ├─────────┼────────────┼───────────────────┤ │ │ │ self_net_4_bias │ ├─────────┼────────────┼───────────────────┤ │ 1 │ 100020000 │ self_net_2_weight │ ├─────────┼────────────┼───────────────────┤ │ │ │ self_net_2_bias │ ├─────────┼────────────┼───────────────────┤ │ 2 │ 220000 │ self_net_0_weight │ ├─────────┼────────────┼───────────────────┤ │ │ │ self_net_0_bias │ └─────────┴────────────┴───────────────────┘ [2022-11-04 05:09:29,527] torch._dynamo.optimizations.distributed: [DEBUG] ---orig graph--- graph(): %inputs : torch.Tensor [#users=1] = placeholder[target=inputs] %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {}) %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {}) %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {}) %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {}) %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {}) %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {}) %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {}) %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {}) return (self_net_7,) ---split graph--- graph(): %inputs : torch.Tensor [#users=1] = placeholder[target=inputs] %submod_0 : [#users=1] = call_module[target=submod_0](args = (%inputs,), kwargs = {}) %submod_1 : [#users=1] = call_module[target=submod_1](args = (%submod_0,), kwargs = {}) %submod_2 : [#users=1] = call_module[target=submod_2](args = (%submod_1,), kwargs = {}) return (submod_2,) ---submod_0 graph--- graph(): %inputs : [#users=1] = placeholder[target=inputs] %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {}) %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {}) return self_net_1 ---submod_1 graph--- graph(): %self_net_1 : [#users=1] = placeholder[target=self_net_1] %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {}) %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {}) return self_net_3 ---submod_2 graph--- graph(): %self_net_3 : [#users=1] = placeholder[target=self_net_3] %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {}) %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {}) %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {}) %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {}) return self_net_7 --------------- ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88480 Approved by: https://github.com/anj-s, https://github.com/davidberard98	2022-11-05 02:40:51 +00:00
jjsjann123	52375a0fd2	nvprims native batch norm patch (#88455 ) Cherry-picking: https://github.com/csarofeen/pytorch/pull/2104 - [x] Added explicit cast on inputs to nvprims.native_batch_norm. This avoids the explicit cast, which gives us issue on fusion definition. - [x] add python repro with dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/88455 Approved by: https://github.com/mruberry, https://github.com/IvanYashchuk	2022-11-05 02:22:27 +00:00
Yanbo Liang	b1116a5117	[Dynamo] Improve BuiltinVariable log when incorrect arg count happens (#88409 ) Fixes https://github.com/pytorch/torchdynamo/issues/1832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88409 Approved by: https://github.com/mlazos	2022-11-05 00:17:18 +00:00
Michael Lazos	5220d07d2c	Fix minifier accuracy msg (#88515 ) Fixes https://github.com/pytorch/torchdynamo/issues/1809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88515 Approved by: https://github.com/yanboliang, https://github.com/williamwen42	2022-11-04 23:26:44 +00:00
Mergen Nachin	dde9affeaa	Populate self.export in InstructionTranslatorBase (#88508 ) Summary: This is a followup to https://github.com/pytorch/pytorch/pull/88354/files#diff-622913fdb49db90d6f3a8ab225b4badb7996023e6498e9f7c6d03fe9f32d0986R836 Reference to self.export got added to InstructionTranslatorBase (i.e. STORE_ATTR) but self.export is populated only for InstructionTranslators. Here's an example failure ``` File "/scratch/williamwen/work/pytorch/torch/_dynamo/symbolic_convert.py", line 322, in step getattr(self, inst.opname)(inst) File "/scratch/williamwen/work/pytorch/torch/_dynamo/symbolic_convert.py", line 844, in STORE_ATTR not self.export AttributeError: 'InliningInstructionTranslator' object has no attribute 'export' ``` Let's populate with the base class with export flag. Test Plan: python test/dynamo/test_export_mutations.py python test/dynamo/test_export.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/88508 Approved by: https://github.com/tugsbayasgalan	2022-11-04 23:23:41 +00:00
Digant Desai	afdc2283ef	[QNNPACK] Add unaligned attributes where asan fails (#88276 ) Summary: Bypass "Runtime error: store to misaligned address [...] for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment" Test Plan: One of the failing tests, now passes `buck test fbsource//arvr/mode/platform010/dev-asan fbsource//arvr/libraries/eye/engine:sys_test_eyetrackingenginevisioninterface` Reviewed By: kimishpatel, salilsdesai Differential Revision: D40918376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88276 Approved by: https://github.com/manuelcandales	2022-11-04 23:01:45 +00:00
andrewor14	7560a7b27c	[Quant] Respect non_leaf_module_list for activation modules (#88498 ) Summary: This commit fixes the bug where `non_leaf_module_list` was not respected for activation modules like `torch.nn.Sigmoid` and `torch.nn.Tanh`. Today, these modules default to `default_fixed_qparams_range_0to1_fake_quant`, and there is no way to configure them to use any other activation_post_process (e.g. FixedQParamsObserver) (see this [mapping](`dc00bb51b8/torch/ao/quantization/quantization_mappings.py (L188-L193)`)). `non_leaf_module_list` is a "list of non-leaf modules we want to add observer" (see prepare docstring). If the user explicitly specified to insert observers for these modules, we should respect that instead of continuing to use the default. Test Plan: python test/test_quantization.py TestQuantizeEagerPTQStatic.test_activations_in_non_leaf_module_list Reviewers: vkuzo, jerryzh168 Subscribers: vkuzo, jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88498 Approved by: https://github.com/jerryzh168	2022-11-04 22:46:55 +00:00
Jane Xu	5af3feefab	[BE] Update native_functions.yaml README; we do not support Tensor! (#88513 ) Just a doc update to minimize confusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/88513 Approved by: https://github.com/bdhirsh	2022-11-04 21:48:29 +00:00
Will Constable	678d038001	Support DDP ignored parameters in DDPOptimizer (#88460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88460 Approved by: https://github.com/aazzolini	2022-11-04 21:42:15 +00:00
Andrew M. James	ff6770a9a1	enable backward for log1p (sparse layouts) (#88155 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88155 Approved by: https://github.com/cpuhrsch	2022-11-04 20:59:26 +00:00
Andrew M. James	6938dd0b2c	Support sparse inputs to deg2rad (#88156 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88156 Approved by: https://github.com/cpuhrsch	2022-11-04 20:59:26 +00:00
Andrew M. James	1964d8c34f	Enable sparse_csr autograd testing for relu (#88154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88154 Approved by: https://github.com/cpuhrsch	2022-11-04 20:59:23 +00:00
Andrew M. James	f03302ba49	Add sparse layout support for torch.frac (#88153 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88153 Approved by: https://github.com/cpuhrsch	2022-11-04 20:59:22 +00:00
Catherine Lee	d632d94cc7	Disable mem leak check (#88373 ) tbh at this point it might be easier to make a new workflow and copy the relevant jobs... Changes: * Disable cuda mem leak check except for on scheduled workflows * Make pull and trunk run on a schedule which will run the memory leak check * Periodic will always run the memory leak check -> periodic does not have parallelization anymore * Concurrency check changed to be slightly more generous Pull Request resolved: https://github.com/pytorch/pytorch/pull/88373 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2022-11-04 20:47:42 +00:00
Huy Do	093e220836	Re-enable inductor models tests as periodical jobs (#88509 ) Run every 4 hour same as periodic, but offset by an hour. This should give us some signals instead of completely disabling these jobs on master after https://github.com/pytorch/pytorch/pull/88374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88509 Approved by: https://github.com/malfet	2022-11-04 20:35:13 +00:00
Jane Xu	3e6579b8f6	Don't print fatal:... in generate_torch_version.py (#88335 ) During build, users commonly see a message like ``` fatal: no tag exactly matches 'd8b4f33324b1eb6c1103874764116fb68e0d0af4' ``` which is usually ignored when builds succeed, but has confused users when build fails (due to a different issue). This PR removes the red herring, since this usually prints for local development when tags are not found. We catch the exception anyway and handle it under the hood, so we don't need to print it and confuse the user. Test plan: Note that builds on trunk current have this line, cmd-F 'fatal: no tag exactly matches' in https://github.com/pytorch/pytorch/actions/runs/3379162092/jobs/5610355820. Then check in the PR build to see that the line no longer appears. I also tagged my commit locally and printed what tag would be--this code and the old code printed the same results for what tag would be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88335 Approved by: https://github.com/seemethere	2022-11-04 20:34:23 +00:00
Bin Bao	955cbe610b	[inductor] Handle the case where kwargs contains tensor (#88417 ) Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805; currently inductor does not allow any tensor in kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88417 Approved by: https://github.com/ngimel	2022-11-04 20:29:03 +00:00
Kurt Mohler	e940a2f8e2	Add nondeterministic error for `scatter` (#88244 ) Fixes #88096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88244 Approved by: https://github.com/ezyang, https://github.com/mruberry	2022-11-04 20:23:59 +00:00
Mor Tzur	6575174dcb	[fx2ait] fixes for AITSplitter (#87805 ) Summary: propagate lower settings to AITSplitter settings. Reviewed By: yinghai, qxy11 Differential Revision: D40568216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87805 Approved by: https://github.com/yinghai	2022-11-04 20:18:08 +00:00
jjsjann123	7b419e8513	[NVFuser] Upstream push 1026 (#87779 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: * codegen improvement: i. allow non-root trivial reductions, allow empty/no-op fusion ii. fixes vectorization checks and size calculation iii. bank conflict handle improvement iv. enables transpose scheduler * misc: i. CI tests failure fixes ii. cpp tests file clean up iii. trivial forwarding supports added in codegen runtime iv. added factory methods support in codegen Commits that's in this PR from the devel branch: ``` 7117a7e37ebec372d9e802fdfb8abb7786960f4a patching nvfuser conv cudnn test numerics mismatch (#2048) 65af1a4e7013f070df1ba33701f2d524de79d096 Inserting sync for redundant parallel types is already done at the (#2023) 6ac74d181689c8f135f60bfc1ec139d88941c98c Fix sync map (#2047) f5bca333355e2c0033523f3402de5b8aac602c00 Bank conflict checker improvements (#2032) d2ca7e3fd203537946be3f7b435303c60fa7f51e Minor update on cp.async code generation. (#1901) d36cf61f5570c9c992a748126287c4e7432228e0 Test file cleanup (#2040) 0b8e83f49c2ea9f04a4aad5061c1e7f4268474c6 Allow non-root trivial reductions (#2037) a2dfe40b27cd3f5c04207596f0a1818fbd5e5439 Fix vectorize size calculation (#2035) e040676a317fe34ea5875276270c7be88f6eaa56 Use withPredicate to replace setPredicate to maintain Exprs immutable (#2025) 197221b847ad5eb347d7ec1cf2706733aacbf97c removing ci workflow (#2034) 40e2703d00795526e7855860aa00b9ab7160755f Reduction rand like patch (#2031) bc772661cbdb3b711d8e9854ae9b8b7052e3e4a3 Add utility for checking bank conflict of shared memory (#2029) ddd1cf7695f3fb172a0e4bcb8e4004573617a037 Add back FusionReductionWithTrivialReduction_CUDA (#2030) fbd97e5ef15fa0f7573800e6fbb5743463fd9e57 Revert "Cleanup trivial reduction workarounds (#2006)" (#2024) bca20c1dfb8aa8d881fc7973e7579ce82bc6a894 Cleanup trivial reduction workarounds (#2006) e4b65850eee1d70084105bb6e1f290651adde23e Trivial forwarding (#1995) 1a0e355b5027ed0df501989194ee8f2be3fdd37a Fix contiguity analysis of predicates to match updated contiguity. (#1991) a4effa6a5f7066647519dc56e854f4c8a2efd2a7 Enable output allocation cache (#2010) 35440b7953ed8da164a5fb28f87d7fd760ac5e00 Patching bn inference (#2016) 0f9f0b4060dc8ca18dc65779cfd7e0776b6b38e8 Add matmul benchmark (#2007) 45045cd05ea268f510587321dbcc8d7c2977cdab Enable tests previously disabled due to an aliasing bug (#2005) 967aa77d2c8e360c7c01587522eec1c1d377c87e Contiguous indexing for View operations (#1990) a43cb20f48943595894e345865bc1eabf58a5b48 Make inlining even more modular (#2004) dc458358c0ac91dfaf4e6655a9b3fc206fc0c897 Test util cleanup (#2003) 3ca21ebe4d213f0070ffdfa4ae5d7f6cb0b8e870 More strict validation (#2000) a7a7d573310c4707a9f381831d3114210461af01 Fix build problem (#1999) fc235b064e27921fa9d6dbb9dc7055e5bae1c222 Just fixes comments (#1998) 482386c0509fee6edb2964c5ae72074791f3e43a cleanup (#1997) 4cbe0db6558a82c3097d281eec9c85ad2ea0893a Improve divisible split detection (#1970) 42ccc52bdc18bab0330f4b93ed1399164e2980c9 Minor build fix. (#1996) fcf8c091f72d46f3055975a35afd06263324ede6 Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989) 15f2f6dba8cbf408ec93c344767c1862c30f7ecc Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988) 8f1c7f52679a3ad6acfd419d28a2f4be4a7d89e2 Minor cleanup lower_unroll.cpp (#1994) 1d9858c80319ca7f0037db7de5f04e47f540d76c Minor cleanup (#1992) f262d9cab59f41c669f53799c6d4a6b9fc4267eb Add support for uniform RNG (#1986) eb1dad10c73f855eb1ecb20a8b1f7b6edb0c9ea3 Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987) 634820c5e3586c0fe44132c51179b3155be18072 Add support for some empty fusion (#1981) eabe8d844ad765ee4973faa4821d451ef71b83c3 Segment self mapping fusions (#1954) e96aacfd9cf9b3c6d08f120282762489bdf540c8 Enable Transpose operation (#1882) 425dce2777420248e9f08893765b5402644f4161 Add a null scheduler that helps segmenting away no-op schedules (#1835) 306d4a68f127dd1b854b749855e48ba23444ba60 Fix canScheduleCompileTime check of transpose scheduler (#1969) b1bd32cc1b2ae7bbd44701477bddbcfa6642a9be Minor fix (#1967) bd93578143c1763c1e00ba613a017f8130a6b989 Enable transpose scheduler (#1927) b7a206e93b4ac823c791c87f12859cf7af264a4c Move scheduler vectorize utilities into their own file (#1959) d9420e4ca090489bf210e68e9912bb059b895baf View scheduling (#1928) c668e13aea0cf21d40f95b48e0163b812712cdf2 Upstream push ci fixes (#1965) c40202bb40ce955955bb97b12762ef3b6b612997 Fix dump effective bandwidth (#1962) 93505bcbb90a7849bd67090fe5708d867e8909e4 WAR on index mapping when exact and permissive maps differ (#1960) 45e95fd1d3c773ee9b2a21d79624c279d269da9f Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930) a3ecb339442131f87842eb56955e4f17c544e99f Improve the comments at the beginning of index_compute.h (#1946) f7bc3417cc2923a635042cc6cc361b2f344248d6 Remove unused variables (#1955) df3393adbb5cb0309d091f358cfa98706bd4d313 Some cleanup (#1957) 7d1d7c8724ab5a226fad0f5a80feeac04975a496 TVDomainGuard factory (#1953) 357ba224c0fb41ed3e4e8594d95599c973f4a0ca Fill allocation with nan on tests (#1956) 8eafc54685d406f5ac527bcbacc475fda4492d7a Fix detection of unmappable root domains (#1952) 90a51f282601ba8ebd4c84b9334efd7762a234bc Some indexing cleanups, Add eye support (#1940) ddc01e4e16428aec92f9c84d698f959b6436a971 Exclude unsupported data types (#1951) 992e17c0688fe690c51b50e81a75803621b7e6aa test the groups the same order as they are merged (#1949) 208262b75d1fed0597a0329d61d57bc8bcd7ff14 Move detection of self mapping IDs to IterDomainGraph from (#1941) ac4de38c6ee53b366e85fdfe408c3642d32b57df Merge pull request #1945 from csarofeen/master_merge_0828 631094891a96f715d8c9925fb73d41013ca7f2e3 Add full, full_like, zeros, zeros_like, ones, ones_like (#1943) aab10bce4541204c46b91ff0f0ed9878aec1bfc4 Merge remote-tracking branch 'upstream/viable/strict' into HEAD 4c254c063bb55887b45677e3812357556a7aa80d Fix arange when step is negative (#1942) 89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D40869846](https://our.internmc.facebook.com/intern/diff/D40869846) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87779 Approved by: https://github.com/davidberard98	2022-11-04 20:04:34 +00:00
Li-Huai (Allan) Lin	15e54293ef	[MPS] Fix embedding backward with scalar index (#82809 ) ### Description Previously the embedding backward always expands `-1` dim to indices, resulting in the following error when the indices is a scalar: ``` error: Rank of data array must equal number of outer dimensions in indices array + rank of slice to update, 2 != 1 + 0 -:8:10: note: see current operation: %5 = "mps.scatter_nd"(%0, %arg1, %4) {batch_dims = 0 : ui32, mode = 0 : i32} : (tensor<10x5xf16>, ``` Now makes it conditional. Reproducer: ```python def repro(): w = torch.tensor([[-2.6465, 2.5859, 0.4688, 1.7949, 3.2676], [-3.1641, 8.9375, 5.7578, -2.9453, -6.5469], [ 2.0469, 1.3516, -8.7344, 6.0000, 1.3906], [ 6.5781, 7.8438, 6.9766, 3.2891, -5.1172], [-7.9414, 7.7344, 4.1875, 2.8574, 2.9531], [-0.4844, -5.6328, -6.8359, -4.5156, 3.7891], [ 4.9375, 6.6094, 6.7031, 0.6719, -6.4219], [ 7.0469, 8.2031, 4.4453, 1.7129, -2.4688], [ 1.2207, -3.3750, -2.4531, 7.4062, -6.0469], [-8.9688, 2.2656, 2.4160, -1.0176, 8.4531]], dtype=torch.float32, requires_grad=True) x = torch.tensor(5) out = torch.nn.functional.embedding(x, w) out.sum().backward() w_mps = w.detach().clone().to("mps").requires_grad_() x_mps = x.to("mps") out = torch.nn.functional.embedding(x_mps, w_mps) out.sum().backward() # error ``` ### Issue <!-- Link to Issue ticket or RFP --> ### Testing <!-- How did you test your change? --> Pull Request resolved: https://github.com/pytorch/pytorch/pull/82809 Approved by: https://github.com/malfet	2022-11-04 19:43:56 +00:00
Codrin Popa	5b767d404e	Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290 ) Summary: Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator. This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space. Test Plan: Tested locally Differential Revision: D40103909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290 Approved by: https://github.com/zdevito	2022-11-04 19:31:16 +00:00
ssjia	b78b8727ff	[vulkan] enable prepacking for Batchnorm op (#88433 ) Adds a `BatchNormPackedContext` so that the `batchnorm` op can use prepacking. Differential Revision: [D40721546](https://our.internmc.facebook.com/intern/diff/D40721546/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88433 Approved by: https://github.com/manuelcandales	2022-11-04 19:24:13 +00:00
Edward Z. Yang	53eac1d482	Revert "Revert "Put Python Dispatcher cache in dict, clear it on new registrations. (#88329 )"" (#88489 ) The bug was that I was accidentally caching at the wrong key name, so we were never actually hitting the cache. I've renamed the resolved key to final_key to avoid shadowing in this way. This reverts commit 410ce96a23a3496a45478e0b25ffac53aa3c116f. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88489 Approved by: https://github.com/albanD	2022-11-04 19:23:04 +00:00
jjsjann123	79abea5683	nvprim python runtime dtype correctness patch (#88452 ) Cherry-picking: https://github.com/csarofeen/pytorch/pull/2133 - [x] casts FusionDefinition output to original dtype recorded in the GraphModule - [x] add a python repro with dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/88452 Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry	2022-11-04 19:17:07 +00:00
PyTorch MergeBot	8c1c6759b2	Revert "remove assert_allclose from torch.testing (#87974 )" This reverts commit 5669e10d37fa3cca21cf82c843ae4c4e79da1b89. Reverted https://github.com/pytorch/pytorch/pull/87974 on behalf of https://github.com/mehtanirav due to Internal breakages from method removal	2022-11-04 19:12:37 +00:00
Edward Z. Yang	bda688c186	Fix typo in clones (#88501 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88501 Approved by: https://github.com/wconstab	2022-11-04 19:12:19 +00:00
Shiyan Deng	633f0d620d	[torch package] Treat builtins as default extern module (#88385 ) Summary: When using torch deploy, if we do fx transformation and then try to pickle/unpickle a fx GraphModule, it's possible that the GraphModule's code depends on `builtins` but we didn't add it to extern module. Reviewed By: PaliC Differential Revision: D40958730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88385 Approved by: https://github.com/PaliC	2022-11-04 17:35:12 +00:00
John Detloff	ead36e5a90	Add dep on Accelerate framework to torch podspecs (#88422 ) A dep on Accelerate was added in https://github.com/pytorch/pytorch/pull/80449 We need to declare this dep in our podspec, otherwise users will have to add the Accelerate framework to their projects manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88422 Approved by: https://github.com/kimishpatel, https://github.com/malfet	2022-11-04 17:31:17 +00:00
Manuel Candales	dc00bb51b8	[Vulkan][TCC] Add tests for conv2d prepack context (#88316 ) Summary: Implement Vulkan tests for the create/run context functions in Convolution.cpp, their transposed versions and their backwards compatible versions: - create_conv2d_context - run_conv2d_context - create_tconv2d_context - run_tconv2d_context - conv2d_clamp_prepack - conv2d_clamp_run Test Plan: On Mac ``` cd ~/fbsource buck run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test adb shell "/data/local/tmp/vulkan_api_test" ``` Reviewed By: salilsdesai Differential Revision: D40935343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88316 Approved by: https://github.com/salilsdesai	2022-11-04 12:07:12 +00:00
Wonjoo Lee	a171b0636a	Add use_lazy_shape flag to GenLazyIr class (#88444 ) Add use_lazy_shape flag to GenLazyIr class to allow XLA to use its custom shape class. The default value is kept to use lazy shape, so this PR does not introduce any new behaviors. PyTorch/XLA companion PR: https://github.com/pytorch/xla/pull/4111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88444 Approved by: https://github.com/alanwaketan, https://github.com/wconstab	2022-11-04 08:23:56 +00:00
XiaobingSuper	b3206268ac	TorchDynamo: enable convolution and batchnorm folding for inference path (#87435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87435 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 05:24:57 +00:00
Pruthvi Madugundu	fbd08fb358	Introduce TORCH_DISABLE_GPU_ASSERTS (#84190 ) - Asserts for CUDA are enabled by default - Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON` - Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON` This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2022-11-04 04:43:05 +00:00
Will Constable	70b00b1383	Add hf_bert + DDP multigpu test (#88435 ) Spot-checks an e2e model working with ddp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435 Approved by: https://github.com/davidberard98	2022-11-04 03:17:48 +00:00
XiaobingSuper	71f793d312	TorchDynamo: Add linear binary fusion for cpu in BF16 inference mode (#87066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87066 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 02:40:29 +00:00
Elias Ellison	7d95b1e344	Run all fallback kernels with FakeTensor (#88248 ) This improves the memory compression of resnet18 from .84 -> .94 on inductor no-cudagraphs. It does mean that any extern kernel which incorrectly computes strides will be a hard error at runtime, but that's an issue we are going to have to face with dynamic shapes anyway. CC @ezyang, @SherlockNoMad Pull Request resolved: https://github.com/pytorch/pytorch/pull/88248 Approved by: https://github.com/ezyang	2022-11-04 02:06:38 +00:00
XiaobingSuper	e4efea4f14	TorchDynamo: Add linear unary fusion for cpu in BF16 inference mode (#87065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87065 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 01:26:08 +00:00
Nikita Shulga	657f2e12f0	[MPS] Add native `cumsum` implementation (#88319 ) Using https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/4057333-cumulativesumwithtensor?language=objc Fall back to CPU if running on older MacOS versions In `unary_op` add output tensor dims/dtype to the graph key (as even in default op we check output graph type) Also, upcast int16 to int32 as MPS cumsum op on Ventura returns incorrect results for Int16 type (and it makes total sense for int8, as chances for overflow are very high) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88319 Approved by: https://github.com/kulinseth	2022-11-04 01:22:41 +00:00
XiaobingSuper	52173188ef	TorchDynamo: Add convolution binary fusion for cpu in inference mode (#87064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87064 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-11-04 01:10:05 +00:00
Elias Ellison	2ce2fc133d	Disable Current Modes when printing Tensor (#88344 ) Fix for https://github.com/pytorch/pytorch/issues/88087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88344 Approved by: https://github.com/ezyang, https://github.com/samdow	2022-11-04 00:45:35 +00:00
Jiewen Tan	e804c72294	[LTC] Update merge_rules.yaml (#88291 ) Summary: Some of the LTC code-gen infra has been moved from codegen/ to torchgen/. Update the merge_rules.yaml to reflect that. Test Plan: New GH PRs... Pull Request resolved: https://github.com/pytorch/pytorch/pull/88291 Approved by: https://github.com/malfet	2022-11-04 00:06:07 +00:00
Andrew Gu	a84d68cdfd	[FSDP][Docs] Reword `sharding_strategy` docs and other minor doc changes (#88431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88431 Approved by: https://github.com/mrshenli	2022-11-03 23:32:41 +00:00
Andrew Gu	ff23e07b2e	[FSDP][Docs] Simplify CPU offload docs (#88430 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88430 Approved by: https://github.com/mrshenli	2022-11-03 23:32:41 +00:00
Chien-Chin Huang	4de50b2521	[FSDP] Allow to use TorchDispatch with FSDP (#88014 ) Add `_no_dispatch_record_stream` to disable TorchDispatch before calling `record_stream()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88014 Approved by: https://github.com/awgu	2022-11-03 23:15:56 +00:00
Huy Do	31ebd3cc2f	Reset NVIDIA devices stuck in failed mode (#88459 ) Try to reset the NVIDIA devices if they get stuck in failed mode per comment in https://github.com/pytorch/pytorch/issues/88388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88459 Approved by: https://github.com/malfet	2022-11-03 23:15:41 +00:00
Andrew Gu	ab8f3333ff	[FSDP][Docs] Simplify `mixed_precision` ctor docs (#88429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88429 Approved by: https://github.com/mrshenli	2022-11-03 23:15:32 +00:00
Animesh Jain	36582574f3	[dynamo] Skip mutation detection for inference mode (#88406 ) Skip the mutation detection for inference_mode, and raise a warning. This helps one internal model Related to https://github.com/pytorch/torchdynamo/issues/1768 @ezyang What do you think about this? The issue that Dynamo mutation detector uses version counter to detect mutation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88406 Approved by: https://github.com/ezyang	2022-11-03 22:56:05 +00:00
PyTorch MergeBot	410ce96a23	Revert "Put Python Dispatcher cache in dict, clear it on new registrations. (#88329 )" This reverts commit 86c7cd287caeb23c227d97d283e58bc123294746. Reverted https://github.com/pytorch/pytorch/pull/88329 on behalf of https://github.com/clee2000 due to test_decomp takes an extra 2 hours in some jobs, windows takes so long it times out	2022-11-03 21:57:19 +00:00
samdow	9946041a3e	[functorch] make hessian docs actually use hessian function (#88451 ) I was going through the hessian docs to find an example and noticed that these docs don't actually use the hessian function.... Pull Request resolved: https://github.com/pytorch/pytorch/pull/88451 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2022-11-03 21:50:52 +00:00
Elias Ellison	ce961b3443	Dont hold onto references of saved tensors in backward (#88247 ) This improves memory compression of resnet18 on inductor non-cudagraphs from .78 -> .0.84. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88247 Approved by: https://github.com/ezyang	2022-11-03 21:24:32 +00:00
Sam Tsai	65de9a2b81	Fix fuse_func method overwrite (#87791 ) (#88193 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87791 Fixing the interface so that the fuse_func is honored and not replaced but the default fuse_known_method. Test Plan: Wait for sandcastle Reviewed By: jerryzh168 Differential Revision: D40722395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88193 Approved by: https://github.com/jerryzh168	2022-11-03 20:32:54 +00:00
Po-Wei Chou	433746300d	[pytorch] Expose EmbeddingPackedParamsBase::unpack to Python (#88362 ) Summary: User can't call `.unpack()` when they have a quantized Embedding layer because `&EmbeddingPackedParamsBase::unpack` was never exposed to Python through pybind. This diff fixes that. Test Plan: CI Reviewed By: jerryzh168 Differential Revision: D40606585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88362 Approved by: https://github.com/jerryzh168	2022-11-03 20:20:49 +00:00
Justin Chu	23a6e15321	[ONNX] Remove the INT64_MAX magic numbers (#88341 ) Remove the magic numbers in symbolic opsets and use a INT64_MAX global instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88341 Approved by: https://github.com/BowenBao	2022-11-03 20:18:36 +00:00
Andrew Gu	6d7eee04b8	[FSDP] Default to `BACKWARD_PRE` (#88428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88428 Approved by: https://github.com/mrshenli	2022-11-03 20:16:15 +00:00
Author Name	c28022d96c	[profiler] Add an option initialize kineto profiler on start up (#87226 ) (#88020 ) Summary: # Initialize Kineto Profiler for on-demand profiling ## TLDR Overall this patch enables initializing the kineto profiling library on start-up. This is guarded by an env variable that is described a bit more later. The kineto profiler is otherwise initialized lazily when pytorch profiler is invoked. ## Background We are enabling on-demand profiling capability for pytorch. As users run large distributed training flows this will enable one to capture a pytorch profiler/GPU trace remotely, from outside the process. The kineto library and a monitoring daemon - dynolog- interact to achieve this. Dynolog will be open sourced by end of October, and has been dogfooded on Meta AI Research cluster. https://github.com/facebookincubator/dynolog ### How it works Kineto library registers itself with the dynolog daemon running on the host over inter process communication ``` \| kineto \| --> (ipcfabric) --> \| dynolog \| * register() * poll for on-demand tracing configs() ``` This feature is currently enabled by setting the env variable `KINETO_USE_DAEMON`. However, it only works if we initialize kineto, else the thread to talk to dynolog is not spun up. Related PRs in kineto include https://github.com/pytorch/kineto/pull/637 https://github.com/pytorch/kineto/pull/653 ## TestPlan: Build pytorch from source (need to set USE_LITE_INTERPRETER_PROFILER=OFF) Run a simple linear model [example](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html). ### First run with the env variable set ``` export KINETO_CONFIG=/private/home/bcoutinho//libkineto.conf export KINETO_USE_DAEMON=1 python3 /private/home/bcoutinho/linear_model.py ``` Output ``` INFO:2022-10-18 09:01:12 4169946:4169946 init.cpp:98] Registering daemon config loader cuda:0 ``` We can trigger a trace using the dynolog client tool ``` #> dyno gputrace --log-file /tmp/gpu_trace_test.json response length = 147 response = {"activityProfilersBusy":0,"activityProfilersTriggered":[4116844],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[4116844]} Matched 1 processes Trace output files will be written to: /tmp/gpu_trace_test_4116844.json ``` ### Run without env variable. ``` python3 ../../linear_model.py cuda:0 99 1425.056884765625 10099 8.817168235778809 ``` ## Side effects to initialization Currently the environment should guard users from picking this change up unless intended. The libkineto_init does setup CUPTI APIs and spins up a thread to read on-demand configurations. This should not be problematic, we can provide a more granular init in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87226 Reviewed By: chaekit Differential Revision: D40558184 Pulled By: briancoutinho fbshipit-source-id: afea7502b1d72201c00994c87fde63a35783f4d5 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88020 Approved by: https://github.com/chaekit	2022-11-03 20:08:16 +00:00
Max Ren	826b4a9c2d	[coreml] delegate multiple outputs (#88345 ) Summary: https://www.internalfb.com/code/fbsource/[c0e4da0b5c7fff3b4e31e4611033c30cabdc6aef]/fbcode/caffe2/torch/csrc/jit/backends/backend_detail.cpp?lines=268-276 seems like the torchscript addition of `$unpack, = self.__backend.execute( ... ` the comma after unpack forces the result of execute to have only one item. So for this fix now when the size of the outputs > 1, execute returns a List List of outputs (basically put the outputs in another list before putting it into the list we return) ``` [[output1, output2, output3, ...]] ``` instead of ``` [output1, output2, output3, ...] ``` Do we want to fix this in backend_detail? Or should we make the change in our delegate to accomadate the torchscript? Proposing this q here. Requesting cccclai, kimishpatel for approval here Test Plan: unblocked models for chengxiangyin and models in pytorch playground all passing unit tests Reviewed By: kimishpatel, cccclai Differential Revision: D40328684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88345 Approved by: https://github.com/jmdetloff, https://github.com/Skylion007	2022-11-03 20:05:53 +00:00
Kimish Patel	9533fe9031	[pytorch][vulkan] Add bias storage type to template (#88324 ) To enable buffer based use for bias as well, this diff adds storage type for bias to template Differential Revision: [D40689003](https://our.internmc.facebook.com/intern/diff/D40689003/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88324 Approved by: https://github.com/jmdetloff	2022-11-03 20:02:24 +00:00
Kimish Patel	893f8e3790	[PyTorch][Vulkan] Add template based codegen for shader generation (#88323 ) We would like to be able to parameterize kernels such that a parameterized algorithm can be implemented via templates. We can then profile performance of a kernel with different parameter values. This enables us to determine what parameters may work the best for a given kernel or a given device. In this diff one such kernel added in 1x1 conv which parameters across size of the tile being produced by each invocation. Few other options for parameters can be: - One can imagine dtype can also be a parameter such that we can do compute in fp16 or int8/int16. - Register blocking for input channels Differential Revision: [D40280336](https://our.internmc.facebook.com/intern/diff/D40280336/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40280336/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/88323 Approved by: https://github.com/jmdetloff	2022-11-03 19:51:51 +00:00
Elias Ellison	60925fcb7e	Dont clone inputs if using fake tensor (#88208 ) Not sure that this will really reduce memory use but it is an extraneous copy in our stack right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88208 Approved by: https://github.com/anijain2305	2022-11-03 19:35:53 +00:00
Kimish Patel	192e806c26	[Pytorch][vulkan] Generate shader with parameters (#88322 ) Parametsr such as tile size and weight type and format is embedded within the shader code. This is used to generate ShaderInfo. For now we will maintain both ShaderSrc and ShaderInfo so as to transition from VK_KERNEL to VK_SHADER incremental. Otherwise we will have to switch multiple of them at the same time. Differential Revision: [D40280338](https://our.internmc.facebook.com/intern/diff/D40280338/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88322 Approved by: https://github.com/jmdetloff, https://github.com/mcr229	2022-11-03 19:33:41 +00:00
kshitij12345	fe3a226d74	[minor] use set_default_dtype instead of try and finally (#88295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88295 Approved by: https://github.com/mruberry	2022-11-03 19:28:33 +00:00
Animesh Jain	f8b73340c8	[dashboard] Replace aot_nvfuser with nvprims_nvfuser (#88437 ) @IvanYashchuk @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88437 Approved by: https://github.com/soumith	2022-11-03 19:07:03 +00:00
Yanbo Liang	2bda2baad7	[Dynamo][Easy] Fix config.suppress_errors error log (#88402 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88402 Approved by: https://github.com/williamwen42	2022-11-03 18:03:36 +00:00
Michael Lazos	4d62ee1b36	Verbose exc printing fix (#88387 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88387 Approved by: https://github.com/tugsbayasgalan	2022-11-03 17:59:05 +00:00
Justin Chu	0a274c4b6c	[ONNX] Default runtime type checking to raising errors (#86555 ) Default runtime type checking to raise by changing the default value to `GLOBALS.runtime_type_check_state` into ERRORS Pull Request resolved: https://github.com/pytorch/pytorch/pull/86555 Approved by: https://github.com/BowenBao	2022-11-03 17:41:48 +00:00
XiaobingSuper	d70bc222d8	add parameters check for mkldnn_transpose (#85318 ) This PR is about add parameters check for mkldnn_transpose, fixed https://github.com/pytorch/pytorch/issues/85216. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85318 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/leslie-fang-intel	2022-11-03 17:28:33 +00:00
Animesh Jain	c1dd13fb2f	[dynamo] Support compare op for userfunctionvariable (#88372 ) Helps reduce graph breaks for one of the training models Pull Request resolved: https://github.com/pytorch/pytorch/pull/88372 Approved by: https://github.com/jansel	2022-11-03 17:05:50 +00:00
Mergen Nachin	2c46d5725e	Disallow module attribute mutation (#88354 ) Summary: See https://github.com/pytorch/torchdynamo/issues/1475 Not allowing any new mutations happen inside forward() function during export. Test Plan: Run `python test/dynamo/test_export.py` and make sure it passes Added new unit tests (3 positive tests and 4 negative tests) Here's what the actual error looks like ``` File "/home/mnachin/local/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 322, in step getattr(self, inst.opname)(inst) File "/home/mnachin/local/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 835, in STORE_ATTR assert not self.export, f"Mutating module attribute {inst.argval} during export." AssertionError: Mutating module attribute a during export. from user code: File "/data/users/mnachin/pytorch/test/dynamo/test_export_mutations.py", line 25, in forward self.a = self.a.to(torch.float64) Set torch._dynamo.config.verbose=True for more information ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88354 Approved by: https://github.com/tugsbayasgalan, https://github.com/jansel	2022-11-03 17:01:22 +00:00
PyTorch MergeBot	2b117c8436	Revert "Fix primTorch compute_elementwise_output_strides (#88175 )" This reverts commit 1c8a0656d65412b83d3c00f2fc66ab958e991de8. Reverted https://github.com/pytorch/pytorch/pull/88175 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks cuda 11.6 in trunk. As the PR signal was green, this is probably a landrace	2022-11-03 16:53:04 +00:00
Nikolay Korovaiko	0f6304ef1e	disable the out variants in test_cumprod test for inductor (#88328 ) `out=` variants aren't supported by autograd and it's not a must fix, so disabling the test (https://github.com/pytorch/torchdynamo/issues/1798) for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88328 Approved by: https://github.com/desertfire	2022-11-03 16:52:37 +00:00
Nikolay Korovaiko	529ba076c6	add an exclude for test_constructor for inductor (#88143 ) This test (https://github.com/pytorch/torchdynamo/issues/1800) fails since none of the c-tor ops support `pin_memory=True`. Natalia suggests it's not a priority to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88143 Approved by: https://github.com/desertfire	2022-11-03 16:21:18 +00:00
Nikolay Korovaiko	002dad35f4	better error message for out= ops (#88367 ) In cases where a tensor kwarg is actually "out=", the following error message would look nicer than this : ``` Traceback (most recent call last): File "/fsx/users/binbao/pytorch/torch/_inductor/graph.py", line 241, in call_function out = lowerings[target](args, *kwargs) File "/fsx/users/binbao/pytorch/torch/_inductor/lowering.py", line 168, in wrapped assert not any(isinstance(x, TensorBox) for x in kwargs.values()) AssertionError ``` https://github.com/pytorch/torchdynamo/issues/1798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88367 Approved by: https://github.com/desertfire	2022-11-03 16:20:14 +00:00
Natalia Gimelshein	b4fcfe77b2	reduce the number of autotuning iterations, don't autotune simple til… (#88386 ) …ed copies Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s. Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386 Approved by: https://github.com/jansel	2022-11-03 15:58:18 +00:00
Christian Puhrsch	5e6ceebccb	Add support for neg to NestedTensor (#88131 ) Partially fixes #86889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88131 Approved by: https://github.com/drisspg	2022-11-03 15:15:57 +00:00
Andrew Gu	35be73df09	[FSDP()][Easy] Make `fully_shard()` only `FULL_SHARD` (#88260 ) We can have a separate API for each of the other sharding strategies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88260 Approved by: https://github.com/mrshenli	2022-11-03 13:41:54 +00:00
Andrew Gu	fc743ec059	[FSDP()] Have `fully_shard()` abide by `@contract`! (#88235 ) We are making some progress on composability :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88235 Approved by: https://github.com/mrshenli	2022-11-03 13:41:54 +00:00
Bin Bao	63cd5d7e27	Add a shortcut in Makefile for updating triton (#88318 ) Summary: Local triton installation needs to be updated after we migrate to a newer version of triton, e.g. https://github.com/pytorch/pytorch/pull/88242. The Makefile shortcut makes that easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88318 Approved by: https://github.com/ezyang	2022-11-03 13:32:33 +00:00
Edward Z. Yang	f884e817d4	Make Python op registration work with torchdeploy/multipy (#87162 ) See strategy at PythonOpRegistrationTrampoline.cpp for the big picture. Along the way, I made OperatorHandle support == and hashing, and slightly changed the low level python_dispatch impl API to disallow empty strings for dispatch key, which had the knock on effect of requiring us to explicitly make sure we pass in CompositeImplicitAutograd if we would have passed in "" (I didn't apply this to the rest of the file because I'm lazy.) Test strategy is we delete the logic for preventing Python op registrations in torch from being skipped in a torchdeploy context and show CI still works. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162 Approved by: https://github.com/anjali411, https://github.com/bdhirsh	2022-11-03 12:56:44 +00:00
Edward Z. Yang	2f296cfdbb	Add a reshape_copy operator. (#88314 ) The semantics is "as if" you did a reshape, but it always copied even if the input was directly view'able. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88314 Approved by: https://github.com/albanD	2022-11-03 12:53:51 +00:00
Edward Z. Yang	86c7cd287c	Put Python Dispatcher cache in dict, clear it on new registrations. (#88329 ) The motivation is that I am going to add the ability to temporarily install entries to the python dispatcher, and to do that, I need an easier way to clear the cache. Putting the cache in a dict centralizes cache clearing in one place. I then add some easy cache clearing. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88329 Approved by: https://github.com/albanD	2022-11-03 12:53:51 +00:00
Edward Z. Yang	97d3b200ca	Unconditionally enable python dispatcher in AOTAutograd (#88365 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88365 Approved by: https://github.com/Chillee	2022-11-03 12:52:19 +00:00
Andrew Gu	a689502275	[FSDP] Do not include empty state in `_flatten_optim_state_dict()` (#88353 ) `983c0e7f31/torch/optim/adam.py (L163)` The above line requires that a candidate optimizer state dict being loaded via `load_state_dict()` has non-empty state for its 0th parameter (via `state_values[0]`). This PR changes FSDP to only include non-empty mappings in the state returned by `_flatten_optim_state_dict()`, which is the subroutine for both `shard_full_optim_state_dict()` and `flatten_sharded_optim_state_dict()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88353 Approved by: https://github.com/fegin	2022-11-03 11:33:10 +00:00
Andrew Gu	95a9721a15	[FSDP()][Easy] Rename `_State` to `_FSDPState` (#88234 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88234 Approved by: https://github.com/mrshenli	2022-11-03 11:29:01 +00:00
Andrew Gu	0520131ed6	[FSDP()] Rename to `fully_shard()` and move to `_composable/` (#88233 ) After internal discussion, we are currently preferring `fully_shard()` as the name of the composable FSDP API. - `FullyShardedDataParallel` (FSDP) has existing brand value, so the chosen name should try to preserve that. We think this takes precedence over the fact that composable FSDP may encompass than just the ZeRO-3 approach of _fully sharding_. - Given the refactoring efforts, it would also not be challenging to create a new frontend API like `hybrid_shard()` that calls into the same underlying initialization and runtime except for a different `ShardingStrategy`. In other words, we do not have to coalesce all sharding strategies under `fully_shard()`. - The other composable APIs are verbs (`replicate()`, `checkpoint()`), so the chosen name should be a verb. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88233 Approved by: https://github.com/mrshenli	2022-11-03 11:29:01 +00:00
Kshiteej K	54b6188cc6	[fix] allow saving python attr on Tensor and Parameter via torch.save (#81616 ) Fixes: https://github.com/pytorch/pytorch/issues/72129 TODO: * [x] Fix for Parameter Benchmark (Measurable diff for small tensors) ``` [-------------- Save and Load --------------] \| After PR \| Before PR 1 threads: ---------------------------------- () \| 111.7 \| 106.9 (4, 4) \| 114.4 \| 109.2 (128, 128) \| 135.2 \| 128.3 (1024, 1024) \| 1431.9 \| 1431.3 Times are in microseconds (us). ``` <details> <summary> Benchmark Script </summary> ```python import torch from torch.testing._internal.common_utils import BytesIOContext from torch.utils import benchmark import pickle shapes = ((), (4, 4), (128, 128), (1024, 1024)) sizes = [1, 64, 1024, 10000] results = [] def save_load_fn(t): with BytesIOContext() as f: torch.save(t, f) f.seek(0) torch.load(f) for shape in shapes: t = torch.randn(shape) label = 'Save and Load' sub_label = f'{shape}' results.append(benchmark.Timer( stmt='save_load_fn(t)', globals={'t': t, 'save_load_fn':save_load_fn}, label=label, sub_label=sub_label, description='Before PR', ).blocked_autorange(min_run_time=2)) compare = benchmark.Compare(results) compare.print() with open('before_pr.pkl', 'wb') as f: pickle.dump(results, f) # with open('after_pr.pkl', 'rb') as f: # after_pr = pickle.load(f) # with open('before_pr.pkl', 'rb') as f: # before_pr = pickle.load(f) # compare = benchmark.Compare(after_pr + before_pr) # compare.print() ``` </details> NOTE : BC-Breaking : After this PR, all tensors (also regular tensors) will be serialised using `_rebuild_from_type_v2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81616 Approved by: https://github.com/albanD, https://github.com/kurtamohler	2022-11-03 09:57:47 +00:00
Sherlock Huang	1c8a0656d6	Fix primTorch compute_elementwise_output_strides (#88175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88175 Approved by: https://github.com/ngimel	2022-11-03 08:38:55 +00:00
Wonjoo Lee	0efd4e92b5	Make GenLazyNativeFuncDefinition generator to be customizable in lazy codegen (#87823 ) As part of the ongoing LTC migration effort, PyTorch/XLA is updating its codegen to use `xla::Shape` instead of `torch::lazy::Shape`. To achieve this, this PR updates the codegen to make the `GenLazyNativeFuncDefinition` generator customizable. The existing `GenLazyNativeFuncDefinition` is kept by using the initial default values, so this change should not introduce any new behaviors to the existing codegen in PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87823 Approved by: https://github.com/alanwaketan, https://github.com/wconstab	2022-11-03 06:19:40 +00:00
Thiago Crepaldi	a8f40b39ce	Update all ONNX symbolics with new JitScalarType API (#87245 ) Fixes https://github.com/pytorch/pytorch/issues/84365 and more This PR addresses not only the issue above, but the entire family of issues related to `torch._C.Value.type()` parsing when `scalarType()` or `dtype()` is not available. This issue exists before `JitScalarType` was introduced, but the new implementation refactored the bug in because the new api `from_name` and `from_dtype` requires parsing `torch._C.Value.type()` to get proper inputs, which is exactly the root cause for this family of bugs. Therefore `from_name` and `from_dtype` must be called when the implementor knows the `name` and `dtype` without parsing a `torch._C.Value`. To handle the corner cases hidden within `torch._C.Value`, a new `from_value` API was introduced and it should be used in favor of the former ones for most cases. The new API is safer and doesn't require type parsing from user, triggering JIT asserts in the core of pytorch. Although CI is passing for all tests, please review carefully all symbolics/helpers refactoring to make sure the meaning/intetion of the old call are not changed in the new call Pull Request resolved: https://github.com/pytorch/pytorch/pull/87245 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-03 03:01:33 +00:00
PyTorch MergeBot	b013825c7d	[vision hash update] update the pinned vision hash (#88382 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88382 Approved by: https://github.com/pytorchbot	2022-11-03 02:57:27 +00:00
Aaron Gokaslan	5fb9c113ae	Update pybind11 to v2.10.1 (#88332 ) I am one of the maintainers of pybind11, and a frequent PyTorch user. We added quite a lot of bugfixes and performance improvements in 2.10.1 (see the changelog for full details) and I wanted to upstream them to PyTorch. Our releases is tested throughout Google's codebase including on their global builds of PyTorch so there should be no surprises. The main new feature is optin in Eigen Tensor to Numpy casters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88332 Approved by: https://github.com/soumith	2022-11-03 02:53:26 +00:00
Richard Barnes	e59d307e2f	Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350 ) Test Plan: Sandcastle Differential Revision: D40949947 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350 Approved by: https://github.com/Skylion007, https://github.com/soumith	2022-11-03 02:48:41 +00:00
Jerry Zhang	a0fb234b45	[codegen] using TORCH_LIBRARY_FRAGMENT for some namespaces (#88229 ) Summary: Sometimes we want to extend an existing custom namespace library, instead of creating a new one, but we don't have a namespace config right now, so we hardcode some custom libraries defined in pytorch today, i.e. quantized and quantized_decomposed Test Plan: ci Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88229 Approved by: https://github.com/ezyang	2022-11-03 02:30:02 +00:00
Huy Do	7b8cc063ac	Not run inductor test in trunk (#88374 ) Trying to not run in inductor tests in trunk at the moment because of CUDA issue with G5 runner: * CUDA GPU not found https://github.com/pytorch/pytorch/actions/runs/3379516207/jobs/5611539300 * NVIDIA driver installation fails https://github.com/pytorch/pytorch/actions/runs/3379922198/jobs/5612458360 * Docker fails to start https://github.com/pytorch/pytorch/actions/runs/3381276196/jobs/5615513348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88374 Approved by: https://github.com/desertfire	2022-11-03 02:15:07 +00:00
Mikayla Gawarecki	d979caa87c	Added add/mul for nested dense [B, *, D], [B, 1, D] case (CUDA-only) (#88289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88289 Approved by: https://github.com/cpuhrsch	2022-11-03 01:29:25 +00:00
soulitzer	4c20c0509d	Split out forward AD tests from test_ops_gradients and reenable slow gradcheck CI (#88216 ) Fixes: https://github.com/pytorch/pytorch/issues/88010 This PR does a couple things to stop slow gradcheck from timing out: - Splits out test_ops_fwd_gradients from test_ops_gradients, and factors out TestFwdGradients and TestBwdGradients which both inherit from TestGradients, now situated in common_utils (maybe there is a better place?) - Skips CompositeCompliance (and several other test files) for slow gradcheck CI since they do not use gradcheck - because test times for test_ops_fwd_gradients and test_ops_gradients are either unknown or wrong, we hardcode them for now to prevent them from being put together. We can undo the hack after we see actual test times are updated. ("def calculate_shards" randomly divides tests with unknown test times in a round-robin fashion.) - Updates references to test_ops_gradients and TestGradients - Test files that are skipped for slow gradcheck CI are now centrally located in in run_tests.py, this reduces how fine-grained we can be with the skips, so for some skips (one so far) we still use the old skipping mechanism, e.g. for test_mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/88216 Approved by: https://github.com/albanD	2022-11-03 00:20:45 +00:00
PyTorch MergeBot	a8561c4571	Revert "[inductor] Handle the case where kwargs contains tensor (#88215 )" This reverts commit 983c0e7f3101f1543bed6c4ec1539a4d590a94c0. Reverted https://github.com/pytorch/pytorch/pull/88215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it breaks trunk https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613987333 with a failure in test_torchinductor_opinfo.py	2022-11-02 23:33:15 +00:00
Jiewen Tan	7354368fd5	[LTC] Remove non-native view ops (#88031 ) Summary: LTC somehow implements a bunch of non-native view ops during the transition to functionalization. Let's remove them now that functionalization is final. Test Plan: CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88031 Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim	2022-11-02 23:31:26 +00:00
Kimish Patel	72f3688029	[Pytorch][Vulkan] Update spv generation script to embed shader parameters (#88321 ) This diffs adds shader parameters such as tile size, weight storage type and format to the generated spv.cpp file. This is used in ShaderInfo struct that ops such as convolution will use to determine, the workgroup size and how to pack weights. Differential Revision: [D40280337](https://our.internmc.facebook.com/intern/diff/D40280337/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88321 Approved by: https://github.com/jmdetloff, https://github.com/mcr229	2022-11-02 23:28:18 +00:00
Andrew Gu	6c858e3727	[FSDP][Easy] Remove unneeded `TrainingState` transition (#88232 ) Follow-up from previous PR in the stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/88232 Approved by: https://github.com/mrshenli	2022-11-02 23:25:53 +00:00
Andrew Gu	73de44fc56	[FSDP] Rename `unflat_param_name` -> `fqn` for consistency (#88123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88123 Approved by: https://github.com/mrshenli	2022-11-02 23:25:53 +00:00
Andrew Gu	f35d5145a1	[FSDP] Simplify `_get_buffer_names()` (#88122 ) This is a follow-up from a previous PR in this stack. The PR simplifies the `_get_buffer_names()` implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88122 Approved by: https://github.com/mrshenli	2022-11-02 23:25:53 +00:00
Andrew Gu	572a3d2d6e	[FSDP] Remove unneeded `torch.no_grad()` context when offloading to CPU (#88121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88121 Approved by: https://github.com/mrshenli	2022-11-02 23:25:53 +00:00
Andrew Gu	c87f0501ab	[FSDP][Docs] Add note mentioning rate limiter for backward prefetch (#88120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88120 Approved by: https://github.com/mrshenli	2022-11-02 23:25:53 +00:00
Andrew Gu	32d22edc67	[FSDP()][27/N] Add forward hook registration (#88040 ) This PR adds the forward hook registration to composable FSDP and adds a unit test for the runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88040 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma	2022-11-02 23:25:53 +00:00
Christian Puhrsch	6fd416650a	Add _foreach_addc(div/mul)(_).Tensor (#88157 ) Support passing value scalars as a flat 1D Tensor. Currently we can only pass either an individual scalar or a ScalarList. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88157 Approved by: https://github.com/ngimel, https://github.com/albanD	2022-11-02 23:24:35 +00:00
Henry Cheng	91a51fe9f4	[ONNX] Produce comprehensive assertion errors for quantized outputs (#87242 ) Fixes #83038 Currently _compare_ort_pytorch_outputs does not produce clearer error messages for differences in the zero point or scale of the two outputs. It also does not produce a clear error message for whether both are quantized. This pull request adds assertions to output whether the scales and zero points have differences, and whether each individual output is quantized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87242 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-02 23:07:45 +00:00
Charlie Yan	ca2dc8b4e7	[1/n] Thread PG: fix pyre error of class ProcessGroup (#88281 ) Summary: Fix the typing stub of `ProcessGroup` in "torch/distributed/__init__.py", so that it won't confuse pyre, and we can remove a lot of pyre suppression comments. Test Plan: pyre check Differential Revision: D40921667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88281 Approved by: https://github.com/wanchaol	2022-11-02 23:02:08 +00:00
Jiong Gong	d1ba4c3a6d	Update Reviewers for CPU-related Modules (#87591 ) This PR updates the reviewers responsible for CPU related modules: "IDEEP", "oneDNN graph", "CPU ATen backend", "CPU frontend" and "Autocast". It also adds "NNC" and adds the corresponding reviewers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87591 Approved by: https://github.com/malfet	2022-11-02 22:57:07 +00:00
jjsjann123	b325c3fc25	[nvFuser] patches profiling on scalar arguments for std/var (#88165 ) Fixes #86531 Added profiling on scalar values for aten::std & aten::var. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88165 Approved by: https://github.com/kevinstephano	2022-11-02 22:47:34 +00:00
PyTorch MergeBot	bf7c996dcb	Revert "torchdynamo support modules() for nn_module (#88023 )" This reverts commit eb91e8a534f94127a6d744543f2080a44bca9e57. Reverted https://github.com/pytorch/pytorch/pull/88023 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799692855066/insights)	2022-11-02 22:35:14 +00:00
Huy Do	7dfa75546c	Print only the driver version from the first GPU (#88364 ) For example, distributed test has more than one of them: ``` nvidia-smi --query-gpu=driver_version --format=csv,noheader 515.57 515.57 ``` while `--id=0` correctly prints: ``` nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 515.57 ``` This is to avoid re-install the same driver as in https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613981088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88364 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2022-11-02 21:59:54 +00:00
Christian Puhrsch	943b20e7ae	Use tensor cores for NT bmm (#86856 ) Copy of internal diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86856 Approved by: https://github.com/drisspg	2022-11-02 21:51:40 +00:00
Scott Wolchok	1c0d47cb17	[PyTorch] Make c10::irange(x) generate the same assembly as for loop (#86841 ) `c10::irange(n)` generated an extra `sar` and `andn` instruction compared to a traditional `for` loop. now it doesn't. Differential Revision: [D40321009](https://our.internmc.facebook.com/intern/diff/D40321009/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86841 Approved by: https://github.com/r-barnes, https://github.com/malfet	2022-11-02 21:34:22 +00:00
Richard Zou	ef4ce6d4c6	Add [[noreturn]] attribute to operator() in DispatchKeyExtractor.h (#88333 ) Originally D40537408. Submitting this through the diff train workflow to get it merged faster. Test Plan: - Build PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/88333 Approved by: https://github.com/ezyang	2022-11-02 21:32:07 +00:00
Bin Bao	983c0e7f31	[inductor] Handle the case where kwargs contains tensor (#88215 ) Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805; currently inductor does not allow any tensor in kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88215 Approved by: https://github.com/ngimel	2022-11-02 19:50:16 +00:00
albanD	98f09c9ab3	[WIP] Add symnode magic method testing (#88119 ) There are failures that need to be addressed before landing: - Some issue with handling of booleans. - Most functions return wrong result when mixing int/float Pull Request resolved: https://github.com/pytorch/pytorch/pull/88119 Approved by: https://github.com/ezyang	2022-11-02 19:41:09 +00:00
PyTorch MergeBot	99c07735e4	Revert "Add support for neg to NestedTensor (#88131 )" This reverts commit 6a75a0d1a197e378ebbf1f73f5ab93ce79cb873a. Reverted https://github.com/pytorch/pytorch/pull/88131 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799692239080/insights)	2022-11-02 18:43:36 +00:00
PyTorch MergeBot	0fa23663cc	Revert "Introduce TORCH_DISABLE_GPU_ASSERTS (#84190 )" This reverts commit 1e2c4a6e0e60dda763b53f00f25ee5c1f1e5233d. Reverted https://github.com/pytorch/pytorch/pull/84190 on behalf of https://github.com/malfet due to Needs internal changes, has to be landed via co-dev	2022-11-02 18:13:37 +00:00
Zachary DeVito	4a84d69f50	[functorch.dims] Fix corner cases with permute (#88226 ) Previously the permute function was extended to behave like the `order` function for first-class dimensions. However, unlike `permute`, `order` doesn't have a keyword argment `dims`, and there is no way to add it in a way that makes both permute an order to continue to have the same behavior. So this change just removes the extra functionality of permute, which wasn't documented anyway. Fixes #88187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88226 Approved by: https://github.com/zou3519	2022-11-02 17:55:43 +00:00
soulitzer	84a302e534	Remove wrong internal assert in handle_view_on_rebase (#88243 ) Fixes: https://github.com/pytorch/pytorch/issues/88205 The `CreationMeta::NO_GRAD_MODE` path in handle_view_on_rebase wrongly assumes that the tensor would be a leaf, because tensors created in no_grad are always leaf tensors. However, due to creation_meta propagation, a view of a view created in no_grad also has `CreationMeta::NO_GRAD_MODE`, but DOES have grad_fn. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88243 Approved by: https://github.com/albanD	2022-11-02 17:50:16 +00:00
Andrew Gu	30dc6cee3a	[FSDP()][26/N] Move `_lazy_init()` into `_fsdp_root_pre_forward()` (#87941 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87941 Approved by: https://github.com/mrshenli	2022-11-02 17:45:08 +00:00
Pruthvi Madugundu	1e2c4a6e0e	Introduce TORCH_DISABLE_GPU_ASSERTS (#84190 ) - Asserts for CUDA are enabled by default - Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON` - Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON` This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2022-11-02 17:41:57 +00:00
Huy Do	b18d0f1dc9	Add more debug information when installing NVIDIA driver (#88168 ) This calls `lspci`, `lsmod`, and `modinfo nvidia` before and after the installation to gather more data about the "No GPU available" transient issue on G5 runner, i.e. `59fe272c1e` This also handles `nvidia-smi` call and tries to re-install the driver if the first call fails, i.e. `No devices were found` `8ea19c802e` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88168 Approved by: https://github.com/clee2000, https://github.com/malfet	2022-11-02 17:39:07 +00:00
Michael Suo	923a5e9685	[dynamo] Error when user nests FX with dynamo (#87797 ) Today, this doesn't work and dynamo errors out in a very non-obvious way (see: https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a). Here, we detect the error early and exit with a nicer msg. Also add a config option to just no-op dynamo (which need to unblock internal enablement). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797 Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel	2022-11-02 17:38:56 +00:00
Huy Do	c503398828	Ignore macos usage log upload artifact failure (#88288 ) I'm not quite sure why GitHub starts to get flaky when we are trying to upload usage_log.txt to it (500 Internal server error). But we can live without it, so let's just ignore this for now, and follow up on this latter. The failures all come from M1 runner, so it seems to point to a connectivity issue between AWS and GitHub: * https://github.com/pytorch/pytorch/actions/runs/3373976793/jobs/5599310905 * https://github.com/pytorch/pytorch/actions/runs/3372858660/jobs/5597033598 * https://github.com/pytorch/pytorch/actions/runs/3371548201/jobs/5594274444 * https://github.com/pytorch/pytorch/actions/runs/3370877990/jobs/5592709210 * https://github.com/pytorch/pytorch/actions/runs/3370609384/jobs/5592008430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88288 Approved by: https://github.com/clee2000	2022-11-02 17:27:30 +00:00
Huy Do	5b882a34c4	Consolidate macos pip dependencies (#88071 ) After conda, consolidating all macos pip dependencies to cache every dependencies that macos CI needs. Two small issues are found along the way in `_mac-test-mps` workflow: * It didn't have `Install macOS homebrew dependencies` to install libomp like the regular `_mac-test` workflow * It didn't install `scipy`, thus silently skipping some `signal.windows` tests Both are fixed in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/88071 Approved by: https://github.com/malfet	2022-11-02 17:22:01 +00:00
Andrew Gu	f132c171ac	[FSDP()][25/N] Add `_post_forward_reshard()` (#87940 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87940 Approved by: https://github.com/mrshenli	2022-11-02 17:16:30 +00:00
PyTorch MergeBot	5b75b19f51	Revert "Do not use unsafe restriding for subclasses (#87610 )" This reverts commit 73379acaf3865379aed0a1bab1320616772152f3. Reverted https://github.com/pytorch/pytorch/pull/87610 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/36028797828925790/insights)	2022-11-02 16:59:02 +00:00
Sherlock Huang	c00c34fb69	Fix meta for aten.upsample_bilinear2d.vec (#88158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88158 Approved by: https://github.com/ngimel	2022-11-02 16:58:29 +00:00
PyTorch MergeBot	71fb763e54	Revert "fix as_strided_scatter_backward (#87646 )" This reverts commit f9d7985851f49c3b44383dae50cd77632e7e2245. Reverted https://github.com/pytorch/pytorch/pull/87646 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think this one or one of the PR in the stack break bionic-cuda11.7 on trunk `70782981f0`	2022-11-02 16:54:36 +00:00
Andrew Gu	bf2819a836	[FSDP()][24/N] Refactor `_lazy_init()` (#87939 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87939 Approved by: https://github.com/zhaojuanmao	2022-11-02 16:35:47 +00:00
Rohan Varma	bd5b4e6504	[Easy] Unused var in functional_adam (#88292 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/88292 Approved by: https://github.com/awgu	2022-11-02 16:31:16 +00:00
Nikita Shulga	7382c88df2	[BE][MPS] Do not use malloc/free in 2022 (#88307 ) Use `std::vector` to store tensor shapes and automatically free them when array goes out of scope Pull Request resolved: https://github.com/pytorch/pytorch/pull/88307 Approved by: https://github.com/kulinseth	2022-11-02 16:27:43 +00:00
Nikita Shulga	4e6f5f22fd	Run asan's shard 4 on `linux.4xlarge` (#88310 ) In attempt to mitigate OOMs, see https://github.com/pytorch/pytorch/issues/88309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88310 Approved by: https://github.com/albanD	2022-11-02 16:26:11 +00:00
AllenTiTaiWang	3d90788a58	[ONNX] Add 0d-tensor test case in runtime check (#87212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87212 Approved by: https://github.com/BowenBao	2022-11-02 16:04:21 +00:00
Thiago Crepaldi	2aed670710	Fix ONNX operator_export_type on the new registry (#87735 ) Fixes #87313 Our ONNX pipelines do not run with BUILD_CAFFE2=0, so tests for operator_export_type ONNX_ATEN and ONNX_ATEN_FALLBACK will not be fully tested, allowing regressions to happen again. We need to run the same set of tests for both BUILD_CAFFE2=0 and 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87735 Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao	2022-11-02 15:54:40 +00:00
Edward Z. Yang	b2679dc61c	Remove Krovatkin from dynamic shapes auto request review (#88315 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88315 Approved by: https://github.com/soumith	2022-11-02 15:05:49 +00:00
Digant Desai	dcbcf5b90e	[profiler] Expose experimental performance events to python (#87905 ) Reports total counts (includes time spent in all children), self counts can be calculated manully. Differential Revision: [D40282770](https://our.internmc.facebook.com/intern/diff/D40282770/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87905 Approved by: https://github.com/SS-JIA	2022-11-02 14:54:15 +00:00
Digant Desai	47a542dc06	Nested profiling support for Linux-perf Profiler (#87904 ) Add a stack of start counter values, and attribute each disable to the last enable Differential Revision: [D40539212](https://our.internmc.facebook.com/intern/diff/D40539212/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87904 Approved by: https://github.com/SS-JIA	2022-11-02 14:51:53 +00:00
Digant Desai	ebdaeaaa8c	[edge profiler] Add e2e test for profiler event and chrometrace (#87877 ) * Runs an existing model and checks an aten op if it gets perf events generated in the chrometrace * Doesn't check for exact values since that's harder to do in a hardware independent way Differential Revision: [D40474957](https://our.internmc.facebook.com/intern/diff/D40474957/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87877 Approved by: https://github.com/SS-JIA	2022-11-02 14:49:54 +00:00
Digant Desai	03346296db	[edge profiler] Add support for performance events counting (#87876 ) * Add support in lite_predictor benchmark binary to select event lists * Uses Linux perf through Kineto profiler Differential Revision: [D39837216](https://our.internmc.facebook.com/intern/diff/D39837216/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39837216/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87876 Approved by: https://github.com/SS-JIA	2022-11-02 14:47:44 +00:00
Digant Desai	bc1e9a07a3	[profiler] Add Performance events support in Kineto profiler (#87874 ) * Wiring to allow user to pass event names to profiler and reflect the count to the chrometrace * If not used, the runtime and size overhead should be neglegible * For now, primary user will be KinetoEdgeCPUProfiler but the impl does not assume that * Not exposed to python yet Differential Revision: [D40238032](https://our.internmc.facebook.com/intern/diff/D40238032/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238032/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87874 Approved by: https://github.com/SS-JIA	2022-11-02 14:43:17 +00:00
Brian Hirsh	70782981f0	aot_dispatch test fix: always use functionalization in symbolic tests (#87647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87647 Approved by: https://github.com/ezyang, https://github.com/Chillee	2022-11-02 14:36:49 +00:00
Brian Hirsh	f9d7985851	fix as_strided_scatter_backward (#87646 ) as_strided_scatter's derivative formula was broken - instead of making a "mask" of 1's and 0's, it would effectively make a mask of 1's and uninitialized memory. Fixes https://github.com/pytorch/pytorch/issues/88105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87646 Approved by: https://github.com/albanD	2022-11-02 14:36:49 +00:00
Brian Hirsh	b5a925ff2e	propagate .meta info when replacing subgraphs in fx (#87255 ) Fixes https://github.com/pytorch/torchdynamo/issues/1708 Our FX subgraph partitioner works by taking all of the original output nodes from a subgraph, and replacing it with a new `call_module` node in the graph. If the original subgraph outputs had fake tensors and other metadata stored in their `.meta` attribute though, then this information was getting lost when we spliced in the subgraph. Losing metadata on an FX graph also seems like an easy trap to fall into, so I'm wondering if there are any better guardrails that we can add. I ended up fixing in this PR by adding an optional kwarg to propagate meta info directly in the `fx.Node.replace_all_uses_with`, just because propagating metadata seems like a pretty core thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87255 Approved by: https://github.com/wconstab, https://github.com/SherlockNoMad	2022-11-02 14:36:46 +00:00
Philip Meier	5669e10d37	remove assert_allclose from torch.testing (#87974 ) See #87969 or #86586 for the reasoning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87974 Approved by: https://github.com/mruberry	2022-11-02 14:05:01 +00:00
Philip Meier	b9c617838a	remove make_non_contiguous from torch.testing (#87973 ) See #87969 or #86586 for the reasoning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87973 Approved by: https://github.com/mruberry	2022-11-02 14:05:01 +00:00
Philip Meier	8893c6cd07	remove deprecated dtype getters from torch.testing (#87972 ) See #87969 or #86586 for the reasoning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87972 Approved by: https://github.com/mruberry	2022-11-02 14:04:58 +00:00
Philip Meier	a360be50b5	remove deprecated device getter from torch.testing (#87971 ) See #87969 or #86586 for the reasoning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87971 Approved by: https://github.com/mruberry	2022-11-02 14:04:54 +00:00
Philip Meier	554cdc9a63	remove deprecated rand and randn from torch.testing (#87970 ) See #87969 or #86586 for the reasoning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87970 Approved by: https://github.com/mruberry	2022-11-02 14:04:51 +00:00
Philip Meier	bc73affdad	prepare removal of deprecated functionality in torch.testing (#87969 ) _Redo of #86586 with all BC breaking changes granularly placed into separate commits._ --- Per title. Deprecation happened on Feb 25, 2022 in c6f1bbc0ac33be0c8ad9956e3fc15e78ddb6cb95, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969 Approved by: https://github.com/mruberry	2022-11-02 14:04:48 +00:00
Digant Desai	0fc7de3986	[profiler] Add Linux Perf support (#87866 ) * Add support to use Linux kernel perf subsystem via the profiler. * For now the perf configurability is quite limited to just event names. Threading etc. to come later. * Given we want to support variety of different cpu types, number of events list (in addition to the standard set of events) is also limited. * Rather than failing with unsupported feature for non-Linux platforms, it returns zeros for all the event counts. * For now, max event counts is capped at 4, time multiplexing is not allowed. * Threadpool recreate hack is restricted to mobile only - need to add better support for threading in general Differential Revision: [D40238033](https://our.internmc.facebook.com/intern/diff/D40238033/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238033/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87866 Approved by: https://github.com/SS-JIA	2022-11-02 13:42:24 +00:00
Andrew Gu	d6b58d6924	[FSDP()][23/N] Refactor handle attr initialization (#87938 ) `_init_param_attributes()` -> `init_flat_param_attributes()` We move `_init_param_attributes()` to `FlatParamHandle.init_flat_param_attributes()` (as already marked as to-do during previous refactoring). `_reset_lazy_init()` We no longer delete `_local_shard` from each `FlatParameter` in `_reset_lazy_init()`. Analysis Thus, the two semantic differences are that we remove the initial `if hasattr(p, "_local_shard")` early return in `_init_param_attributes()` and the `delattr(p, "_local_shard")` in `_reset_lazy_init()`. This is safe because - If we never call `_reset_lazy_init()`, then `init_flat_param_attributes()` is only called once. There is no opportunity for an early return. - If we call `_reset_lazy_init()`, then `init_flat_param_attributes()` will be called again in the next `_lazy_init()`. However, since we removed the early return, all of the attributes initialized in `init_flat_param_attributes()` simply get re-initialized and override any existing attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87938 Approved by: https://github.com/mrshenli	2022-11-02 11:32:56 +00:00
Andrew Gu	d172dcf316	[FSDP()][21/N] Refactor and fix `_cast_buffers()` (#87935 ) This PR refactors and fixes `_cast_buffers()`. Before Buffers were not correctly cast back to their original dtypes for submodules when using buffer mixed precision. - `_cast_buffers(recurse=False)` incorrectly casts all buffers, including those in submodules. This is because of this outer loop over `self.modules()`: `c40033be16/torch/distributed/fsdp/fully_sharded_data_parallel.py (L700)` - There was a unit test that checked that buffers were cast as expected (`test_mixed_precision_e2e_full_shard()`). The unit test _coincidentally_ passed because all modules shared the same buffer name `"buffer"`. In `_cast_buffers()`, the `dict` mapping buffer name to original dtype is populated lazily (during `_lazy_init()`). However, the keys are unprefixed: `c40033be16/torch/distributed/fsdp/fully_sharded_data_parallel.py (L712-L717)` - Thus, even though (1) `_cast_buffers(recurse=False)` was only called on the root and (2) `self._buffer_name_to_orig_dtype` had unprefixed names as keys, the unit test still passed because (1) `_cast_buffers()` still looped over all buffers despite `recurse=False` and (2) all submodules' buffers were named `"buffer"` and had the same original and low-precision dtypes and hence were cast correctly. If we change each submodule to have its own distinct buffer name, then the unit test fails. This PR makes such a change to showcase the progression granted by this PR. After This PR separates `_cast_buffers()` into three methods: `_get_buffers_and_dtypes_for_computation()`, `_get_buffers_and_dtypes_for_checkpoint()`, and `_cast_buffers_to_dtype_and_device()`. This is to separate the different use cases (casting for computation and casting for checkpointing) and the corresponding code paths. Plus, the signature for `_cast_buffers_to_dtype_and_device()` makes it clear exactly what buffers are being cast and to what dtype. Both `_get_...()` functions assume that they are called on the root only for now. This coincides with the construction of `_buffer_name_to_orig_dtype` in the FSDP constructor, which loops over all submodules. (This means that for non-root modules, their `_buffer_name_to_orig_dtype` is populated but not used.) The `dict`'s keys are clean since the buffer cast to original dtype happens in a `summon_full_params()` context, which cleans the names. Follow-Ups - We can try to move `_get_buffers_and_dtypes_for_checkpoint()` into `_state_dict_utils.py` in a follow-up. - We may want to move to per-module buffer casting (i.e. do not have the root module cast for all submodules). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87935 Approved by: https://github.com/mrshenli	2022-11-02 11:32:56 +00:00
Andrew Gu	b0b1e78e2d	[FSDP] Rename `dtype` to `buffer_name_to_dtype` (#87934 ) This PR is easy and only a rename. `dtype` does not convey that it is actually a `Dict[str, torch.dtype]` (when not `None`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87934 Approved by: https://github.com/mrshenli	2022-11-02 11:32:53 +00:00
Andrew Gu	d14fc0bc36	[FSDP] Remove `device` arg from `_cast_buffers()` (#87933 ) This PR is easy. The `device` argument in `_cast_buffers()` is never used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87933 Approved by: https://github.com/mrshenli	2022-11-02 11:32:50 +00:00
Andrew Gu	19c7df89fb	[FSDP()][20/N][Easy] Move functions in file (#87932 ) This PR is easy. I just wanted to group functions in the file according to the same logical order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87932 Approved by: https://github.com/mrshenli	2022-11-02 11:32:48 +00:00
Andrew Gu	4635f56da1	[FSDP()][18/N] Refactor `pre_forward_unshard()` (#87931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87931 Approved by: https://github.com/mrshenli	2022-11-02 11:32:45 +00:00
Andrew Gu	0a752688bd	[FSDP()][17/N] Refactor `_fsdp_root_pre_forward()` (#87930 ) This PR moves `_fsdp_root_pre_forward()` to `_runtime_utils.py`. Note: This PR includes a (temporary) fix for `NO_SHARD` + `CPUOffload(offload_params=True)`, where we set `non_blocking=False` when copying the gradient from device to host. It is only included in this PR since the test was flaky (but not consistently failing) on this PR , so I needed to fix to unblock land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87930 Approved by: https://github.com/mrshenli	2022-11-02 11:32:42 +00:00
lezcano	39d9d2ed70	Implement reference for lerp (#87424 ) We follow the vectorised CPU implementation for numerical accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/87424 Approved by: https://github.com/ezyang	2022-11-02 11:21:01 +00:00
Ivan Yashchuk	6b5d7fccc6	Add a basic test for "nvprims_nvfuser" Dynamo backend (#88186 ) Ref. https://github.com/pytorch/pytorch/pull/87797#issuecomment-1297635210 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88186 Approved by: https://github.com/ezyang	2022-11-02 11:11:28 +00:00
Ivan Yashchuk	9ebb8d5232	Add ops.broadcast for nvFuser (#88080 ) Having nvFuser's `broadcast` available alongside `broadcast_in_dim` would allow easier experimentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88080 Approved by: https://github.com/jjsjann123, https://github.com/kevinstephano, https://github.com/mruberry	2022-11-02 10:05:12 +00:00
Kazuaki Ishizaki	2ddefbdc3c	Fix typos used in documents under torch directory (#88300 ) This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300 Approved by: https://github.com/lezcano	2022-11-02 09:38:13 +00:00
Ivan Yashchuk	4a8382b58e	Update caching of tensor arguments for nvFuser's fusion creation (#87860 ) Previously nvFuser's fusion definition was cached based on concrete shape and strides of tensor inputs for simplicity and correctness. This PR changes Python's cache to check the number of dimensions, size-1 dimensions, and contiguity information based on given strides and shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87860 Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/ngimel	2022-11-02 09:29:20 +00:00
Yanbo Liang	ccf6b558a4	[Dynamo] UserFunctionVariable supports type & ABCMeta as arguments (#88257 ) Fixes https://github.com/pytorch/torchdynamo/issues/1785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88257 Approved by: https://github.com/ezyang	2022-11-02 06:58:04 +00:00
Kshiteej K	e763b7abeb	[complex] conv_transpose3d : complex support (#87967 ) Reference: https://github.com/pytorch/pytorch/issues/71108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87967 Approved by: https://github.com/anjali411	2022-11-02 06:37:33 +00:00
PyTorch MergeBot	7674af9ce7	[vision hash update] update the pinned vision hash (#88162 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88162 Approved by: https://github.com/pytorchbot	2022-11-02 05:22:40 +00:00
Fabio Rocha	4ab5d79b28	[inductor] Updated some triton.libdevice calls (#88242 ) triton master now does not require `d` or `f` suffix to some libdevice function calls - it dispatches to right library call based on argument type. triton pin updated to `f16138d447` Also removed some xfails for some unrelated tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88242 Approved by: https://github.com/ngimel	2022-11-02 04:58:43 +00:00
Will Constable	a51da28551	Support multi-gpu CI for inductor-distributed (#87996 ) This test by itself isn't the end goal, but it is a minimal test that exercises multi-gpu and the focus of the PR is the infra behind enabling that. I'll follow up with more tests using actual models etc. and @malfet @desertfire for awareness/feedback on the infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/87996 Approved by: https://github.com/aazzolini	2022-11-02 03:52:20 +00:00
Edward Z. Yang	95fc0bcaad	Disable torchdynamo in backwards compiler harder (#88132 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88132 Approved by: https://github.com/bertmaher, https://github.com/malfet	2022-11-02 02:16:35 +00:00
eqy	3c6bddc3f6	[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669 ) #58414 Has a small tweak to a test that was breaking on A10 (CC @malfet). CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87669 Approved by: https://github.com/ngimel	2022-11-02 01:36:37 +00:00
Peter Bell	dfa9475755	Check SM version before calling flash attention with BFloat16 (#86600 ) The flash attention code path requires sm80 or newer to run on BFloat16, so any OpInfo tests running with BFloat16 would fail with the error: ``` RuntimeError: Expected q_dtype == at::kHalf \|\| (is_sm8x && q_dtype == at::kBFloat16) to be true, but got false. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86600 Approved by: https://github.com/ngimel	2022-11-02 00:52:30 +00:00
Peter Bell	bc9caafc78	record_function: update to use custom_class API (#76420 ) Re-submit of gh-72302 This still has a small performance hit, but it much smaller. On my machine I see `_record_fucntion_exit._RecordFunction` takes 1.05 us compared to the `Tensor` overload taking 0.79 us. In an overall comparison, I see a 0.7 us slowdown from 6.0 us to 6.7 us for this timeit benchmark ```python import torch def foo(): with torch.profiler.record_function("foo"): return torch.eye(3) %timeit foo() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76420 Approved by: https://github.com/robieta	2022-11-02 00:39:28 +00:00
Kazuaki Ishizaki	0131a66ab6	Fix typos under torch directory (#88172 ) This PR fixes typos in '.md' files under torch directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/88172 Approved by: https://github.com/malfet	2022-11-01 22:58:22 +00:00
Yanbo Liang	72958b9665	[Dynamo] Update Dynamo benchmarks running commands (#87844 ) Fixes https://github.com/pytorch/torchdynamo/issues/1761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87844 Approved by: https://github.com/jansel	2022-11-01 22:45:13 +00:00
jjsjann123	a56beb2a82	[nvfuser] merge rule update (#88228 ) adding Kevin to NVFuser reviewer Pull Request resolved: https://github.com/pytorch/pytorch/pull/88228 Approved by: https://github.com/soumith	2022-11-01 22:43:54 +00:00
Shiyan Deng	fb1586fbcb	Make a copy of the submodule inputs (#87899 ) Summary: There might be inplace ops in the model that would change the saved inputs. To avoid that, we save a deepcopy version. Test Plan: CI Differential Revision: D40771290 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87899 Approved by: https://github.com/houseroad	2022-11-01 22:42:04 +00:00
Charlie Yan	73492645cf	Copy DDP code to be reused in composable API (#87836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87836 Approved by: https://github.com/mrshenli	2022-11-01 22:25:10 +00:00
Andrew M. James	b2dfd20260	Remove BSC conversion skip from TestSparseCompressed.test_consistency (#88152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88152 Approved by: https://github.com/cpuhrsch	2022-11-01 22:18:56 +00:00
Andrew M. James	d044b4cc58	Update torch.abs and torch.positive opinfos to reflect sparse support (#88151 ) cc @nikitaved @pearu @cpuhrsch @bhosmer Pull Request resolved: https://github.com/pytorch/pytorch/pull/88151 Approved by: https://github.com/cpuhrsch	2022-11-01 22:18:56 +00:00
Nikita Shulga	ffd54def8f	[GHF] Remove CC line from commit message (#88252 ) This line is added by autoCCBot, but is not really meaningful as commit message Test Plan: ``` >>> from trymerge import GitHubPR, RE_PR_CC_LINE >>> import re >>> pr=GitHubPR("pytorch", "pytorch", 87809) >>> re.sub(RE_PR_CC_LINE, "", pr.get_body()) 'Fixes #ISSUE_NUMBER\r\n\n\n' >>> pr=GitHubPR("pytorch", "pytorch", 87913) >>> re.sub(RE_PR_CC_LINE, "", pr.get_body()) 'Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR.\n\n' >>> pr=GitHubPR("pytorch", "pytorch", 85692) >>> re.sub(RE_PR_CC_LINE, "", pr.get_body()) 'This PR sets CUDA_MODULE_LOADING if it\'s not set by the user. By default, it sets it to "LAZY".\r\n\r\nIt was tested using the following commands:\r\n```\r\npython -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"\r\n```\r\nwhich shows a memory usage of: 287,047,680 bytes\r\n\r\nvs\r\n\r\n```\r\nCUDA_MODULE_LOADING="DEFAULT" python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"\r\n```\r\nwhich shows 666,632,192 bytes. \r\n\r\nC++ implementation is needed for the libtorch users (otherwise it could have been a pure python functionality).\r\n\r\n' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/88252 Approved by: https://github.com/xuzhao9, https://github.com/izaitsevfb	2022-11-01 22:17:12 +00:00
Sean Ross-Ross	ba643b4ddf	feature: adding batch support for narrow_copy operator (#88130 ) Implement batch support https://github.com/pytorch/functorch/issues/825 for narrow copy narrow_copy was already added as an opinfo cc @zou3519 @Chillee @samdow @soumith Pull Request resolved: https://github.com/pytorch/pytorch/pull/88130 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2022-11-01 21:42:51 +00:00
Manuel Candales	c40033be16	[Vulkan][TCC] Implement tests for cat_batch, cat_width and normalize_dim (#87633 ) Summary: Implement Vulkan tests for these untested functions in Concat.cpp: - cat_batch - cat_width - normalize_dim Test Plan: ```cd ~/fbsource buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 ``` Differential Revision: D40605571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87633 Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign, https://github.com/SS-JIA	2022-11-01 21:01:31 +00:00
Elias Ellison	e6ea0a4a4b	Don't Require contiguous For Extern Kernels (#87650 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87650 Approved by: https://github.com/desertfire	2022-11-01 20:20:42 +00:00
Kevin Stephano	8ef9bda1bf	Fix nvFuser Fusion Definition printing of Squeeze and Permute (#88041 ) NM cc @jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88041 Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry	2022-11-01 19:02:40 +00:00
Jerry Zhang	68f9f256a3	[reland][fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87998 ) Summary: att, this is experimental api so not marking it as bc-breaking. The match will be accepted only if all the filters in the list passes. Changing the filter arg to be list also allows us to pass in empty list that means no filter, which makes user code cleaner. Test Plan: python test/test_fx.py -k test_replace_pattern_with_filters Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D40810943](https://our.internmc.facebook.com/intern/diff/D40810943) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87998 Approved by: https://github.com/SherlockNoMad	2022-11-01 18:48:14 +00:00
Tugsbayasgalan Manlaibaatar	2c7de4a144	Add meta implementation for aten.max.dim (#88005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88005 Approved by: https://github.com/Chillee, https://github.com/bdhirsh	2022-11-01 18:37:24 +00:00
jjsjann123	97b3eeac90	remove assert on tensor inputs to FusionGroup (#88018 ) Fixes #86530 #86227 #85872 All issues seem to be duplicate of each other. Removes the false positive assert Fixes come from @kevinstephano Pull Request resolved: https://github.com/pytorch/pytorch/pull/88018 Approved by: https://github.com/kevinstephano, https://github.com/soumith	2022-11-01 18:07:17 +00:00
Nikita Shulga	e1c123d29a	Add UBSAN to ASAN (#88055 ) Add undefined behavior sanitizer to `USE_ASAN` option. Added `torch._C._crash_if_vptr_ubsan()` that only fails if vptr belongs to a wrong class after typecast Deleted all ubsan supressions, but disabled `ProtoTest::Basic` as it fails above-mentioned vptr check. Fixes https://github.com/pytorch/pytorch/issues/88042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88055 Approved by: https://github.com/ezyang	2022-11-01 17:59:35 +00:00
Howard Huang	81f74eed75	[11/N] Update all_to_all with CPU/CUDA implementations (#86407 ) * #83916 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations Pull Request resolved: https://github.com/pytorch/pytorch/pull/86407 Approved by: https://github.com/kwen2501	2022-11-01 17:54:13 +00:00
Ivan Yashchuk	90fa25705c	Rename 'nvfuser' to 'ts_nvfuser' indicating TorchScript usage (#88188 ) cc @kevinstephano @jjsjann123 @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88188 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-01 17:46:55 +00:00
Howard Huang	bed8102741	[10/N] Update barrier with CPU/CUDA implementations (#86368 ) ### Changes - Updates for the barrier collective - NOTE: current change will not achieve dispatching of barrier since there is no tensor to read from ### Context https://github.com/pytorch/pytorch/issues/86225 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/86368 Approved by: https://github.com/kwen2501	2022-11-01 17:41:01 +00:00
Andrew Gu	1f34067e9d	[FSDP()][16/N] Refactor post-forward/pre-backward (#87929 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87929 Approved by: https://github.com/mrshenli	2022-11-01 17:26:03 +00:00
Andrew Gu	5a53f024e4	[FSDP()][15/N] Refactor `_init_streams()` (#87928 ) This PR is easy. I think I move `_init_streams()` again in a later PR though :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/87928 Approved by: https://github.com/mrshenli	2022-11-01 17:26:03 +00:00
Andrew Gu	90c5f856b2	[FSDP()][14/N] Refactor pre-forward/post-backward (#87927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87927 Approved by: https://github.com/mrshenli	2022-11-01 17:25:59 +00:00
Yidi Wu	eb91e8a534	torchdynamo support modules() for nn_module (#88023 ) Differential Revision: D40820879 This diff allows models to call self.modules() during dynamo tracing. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88023 Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym, https://github.com/jansel	2022-11-01 17:10:45 +00:00
Sherlock Huang	de1f641f11	Fix meta function for aten.addmm (#88068 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88068 Approved by: https://github.com/albanD	2022-11-01 17:05:48 +00:00
Thiago Crepaldi	fdc419786d	Add unit test for torch_geometric library (#85937 ) Fixes #65138 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85937 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-11-01 16:43:58 +00:00
Han Qi (qihqi)	5c3666cb81	[codev] Make backport work with flatbuffer models (#88127 ) Summary: By adding flatbuffer as dependency of backport. Differential Revision: D40865452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88127 Approved by: https://github.com/cccclai	2022-11-01 16:11:30 +00:00
Edward Z. Yang	bb7e6254e4	Add ability to freeze storages inside functionalization (#88141 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88141 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2022-11-01 16:00:33 +00:00
Edward Z. Yang	61f955dd83	Inline Alias into FunctionalStorageImpl (#88140 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88140 Approved by: https://github.com/bdhirsh	2022-11-01 16:00:33 +00:00
Natalia Gimelshein	73c9911fc0	always realize output regardless of the number of reads (#88046 ) This improves hf_Bert 1.139x->1.21x, currently lowmem dropout doesn't work for nn.Dropout module, and before this change we were recomputing all the dropout masks in a very inefficient kernel. This change pushes dropout masks to be saved in the dropout kernels where they are first computed. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88046 Approved by: https://github.com/Chillee	2022-11-01 15:47:43 +00:00
Sherlock Huang	c368c0faf0	Fix meta for aten.fill, constant_pad_nd, _adaptive_avg_pool2d (#88069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88069 Approved by: https://github.com/ngimel, https://github.com/malfet	2022-11-01 15:36:06 +00:00
Will Constable	82a9de16d4	Change dynamo/distributed tests to use cuda/nccl (#88133 ) - FSDP tests require nccl - also run in inductor shard and skip inductor in distributed shard - inductor shard has newer GPU and supports triton/inductor, but only runs on trunk - distributed shard runs on PR, but inductor shard only runs on trunk/opt-in Pull Request resolved: https://github.com/pytorch/pytorch/pull/88133 Approved by: https://github.com/davidberard98	2022-11-01 15:35:44 +00:00
Yanli Zhao	44f8efd5c1	[BE]fix DDP when the number of output features is zero (#87793 ) Fixes #87280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87793 Approved by: https://github.com/rohan-varma	2022-11-01 15:27:40 +00:00
Howard Huang	20d849b982	[9/N] [Dispatchable Collectives] Update reduce_scatter with CPU / CUDA implementations (#86166 ) ### Changes - Updates for the reduce_scatter collective ### Context https://github.com/pytorch/pytorch/issues/86225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86166 Approved by: https://github.com/kwen2501	2022-11-01 15:23:41 +00:00
Edward Z. Yang	1e5d33b6df	Reenable assert sanity testing with ADInplaceOrView reenable (#88102 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88102 Approved by: https://github.com/albanD	2022-11-01 14:29:00 +00:00
AllenTiTaiWang	bdb14238ec	[Reland][ONNX] Move all torch.onnx.export related tests to test/onnx (#87292 ) Moving torch.onnx.export related tests to test/onnx integrates ONNX tests to the same CI machine, so the testing environment can be better managed. Fixes https://github.com/pytorch/pytorch/issues/87320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87292 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/kit1980, https://github.com/malfet	2022-11-01 14:22:46 +00:00
Charlie Yan	62988e4fe6	Update _distributed_c10d.pyi (#88088 ) Summary: `_distributed_c10d.pyi` is out of sync with the C++ binding. This change updates it. Test Plan: TBD Differential Revision: D40840836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88088 Approved by: https://github.com/wanchaol	2022-11-01 13:51:06 +00:00
Andrew Gu	b1750d0440	[FSDP()][13/N] Refactor unshard/reshard/grads (#87926 ) This PR is not too complicated. We just move unshard/reshard/grads out to `_runtime_utils.py` and make them take `state: _State` instead of `self`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87926 Approved by: https://github.com/mrshenli	2022-11-01 13:37:31 +00:00
Andrew Gu	8039317c07	[FSDP()][12/N] Easy cleanup (#87925 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87925 Approved by: https://github.com/mrshenli	2022-11-01 12:39:24 +00:00
Andrew Gu	c1e28731b3	[FSDP()][10/N][11/N] Introduce composable (ctor only) (#87924 ) This PR introduces the composable FSDP API (with constructor semantics only) along with some further constructor refactoring. A notable contribution here is `_get_submodule_to_states()`, which performs auto wrapping without actually wrapping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87924 Approved by: https://github.com/mrshenli	2022-11-01 12:39:24 +00:00
Andrew Gu	78170701a3	[FSDP()][9/N] Refactor ctor (continued) (#87923 ) This PR makes a second pass over the constructor. The logic has been grouped into `_init_<...>` functions based on intent (e.g. `_init_prefetching_state()` or `_init_runtime_state()`). This makes the initialization code for composable FSDP much cleaner than having to re-write the same sequences of lower-level helper calls. This PR also moves `_ExecOrderData` into its own file `_exec_order_utils.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87923 Approved by: https://github.com/mrshenli	2022-11-01 12:39:21 +00:00
Mike Iovine	23fe6c8ca1	[Static Runtime] Fix ReplaceWithMaybeCopy test in OSS (#88099 ) Summary: `ReplaceWithMaybeCopy` is guarded by `FBCODE_CAFFE` in `OptimizeGraph`. Run the pass manually to ensure it does the replacement. Test Plan: Existing tests Differential Revision: D40858743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88099 Approved by: https://github.com/huydhn	2022-11-01 09:58:26 +00:00
Huy Do	7c6fe21a38	Fix monitoring script for macos (#88159 ) The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info) requires higher user privileges Example failures: * https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip * https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159 Approved by: https://github.com/clee2000	2022-11-01 05:58:44 +00:00
Kevin Stephano	323c646ca9	Cleaned up the nvFuser Python Frontend Batch Norm printing (#88057 ) * Removed `define_null_tensor` usage in favor of using optional arguments for binding. * Re-ordered the non-State arguments for easier printing. * Added a printing function to include booleans `training` and `channels_last` * Fixed `define_tensor` to print `is_cpu` cc @jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88057 Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry	2022-11-01 05:05:15 +00:00
Nikita Shulga	a6acbad5c3	[BE] Use default constructor in `LoggerVoidify` (#88054 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88054 Approved by: https://github.com/kit1980	2022-11-01 03:59:51 +00:00
Driss Guessous	560786ac20	call contiguous on BMM inputs for NT on CUDA (#88108 ) Fixes #87713 BMM for cpu supports non-contiguous nested tensor inputs, while BMM for Cuda does not support currently non-contiguous inputs. The derivative for BMM: ``` - name: bmm(Tensor self, Tensor mat2) -> Tensor self: grad.bmm(mat2.transpose(1, 2).conj()) mat2: self.transpose(1, 2).conj().bmm(grad) result: self_t.bmm(mat2_p) + self_p.bmm(mat2_t) ``` When calling backward it was impossible for this function to succeed since the inputs were always discontiguous, regardless of the user input. This adds contiguous calls to BMM_cuda implementation for nested tensors. This was not caught by tests because grad_check is currently only done on CPU in test_nestedtensors. This PR updates the autograd test to also be run on GPU. As a result I found one more issue with the backward for to_padded_tensor erroring instead of calling the generic version. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/88108 Approved by: https://github.com/cpuhrsch	2022-11-01 03:14:27 +00:00
Ivan Yashchuk	0eea05b11e	Remove "prims_nvfuser" backend for TorchDynamo (#88083 ) Removing "prims_nvfuser" backend according to the discussion in https://github.com/pytorch/torchdynamo/pull/1281#discussion_r979468355. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88083 Approved by: https://github.com/ezyang	2022-11-01 03:09:37 +00:00
Sahan Paliskara	a8aaee77be	[torch::deploy] add gpu unit tests to CI (#88107 ) Adds `torch::deploy`'s GPU tests to core CI to make sure core changes don't break them. Overall, deploy tests take 11 min, so it shouldn't be much of a burden :) https://github.com/pytorch/pytorch/actions/runs/3364231795/jobs/5578861939 Differential Revision: [D40861442](https://our.internmc.facebook.com/intern/diff/D40861442) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88107 Approved by: https://github.com/d4l3k, https://github.com/anirbanr-fb-r2p	2022-11-01 02:54:44 +00:00
Christian Puhrsch	6a75a0d1a1	Add support for neg to NestedTensor (#88131 ) Partially fixes #86889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88131 Approved by: https://github.com/drisspg	2022-11-01 02:37:42 +00:00
yanbing-j	708c050af9	Add labeler with cpu, mkldnn, amp, NNC and quantization paths to start (#87690 ) This PR is to dd labeler with `module: cpu`, `module: mkldnn`, `module: amp (automated mixed precision)`, `NNC` and `oncall: quantization' paths to start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87690 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-11-01 02:06:30 +00:00
maxren	3aa7a52855	[xnnpack][lite-int][4/n] introduce serialization to delegate (#87908 ) We introduced the serializer we created in the previous diff to our XNNGraph builder, the purpose of this is to serialize parts of the graph as we build this. At the end, we are able to finish and serialize the xnngraph into a std::string for use when we forward this along to on-device runtime. The next diff will rebuild the xnngraph from the serialization we introduce here, so testing the serialization of the graph will be done in the next diff Differential Revision: [D39335580](https://our.internmc.facebook.com/intern/diff/D39335580/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39335580/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87908 Approved by: https://github.com/digantdesai	2022-11-01 01:48:32 +00:00
maxren	8287c1d964	[xnnpack][lite-int][3/n] flatbuffer serializer class (#87907 ) Creating a serializer class that allows us to serialize the xnnpack graph creation arguments. This essentially abstracts away the flatbuffer api manipulation and serialization that we deal with. As a result we can call ``` XNNSerializer::serializeAddNode() XNNSerializer::serializeTensorValue() XNNSerializer::finishAndSerialize ``` to serialize the graph Differential Revision: [D39196312](https://our.internmc.facebook.com/intern/diff/D39196312/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39196312/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87907 Approved by: https://github.com/digantdesai	2022-11-01 01:44:18 +00:00
maxren	7bf819b181	[xnnpack]lite-int][2/n] flatbuffer xnn_value schema (#87906 ) serializer schema for xnnpack graphs Differential Revision: [D39003170](https://our.internmc.facebook.com/intern/diff/D39003170/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87906 Approved by: https://github.com/digantdesai	2022-11-01 01:39:41 +00:00
maxren	905d532d39	[xnnpack][lite-int][1/n] flatbuffer buck rules (#87826 ) Writing a placeholder schema.fbs file for now to setup the buck gen rules. The generated schema file will be used in the xnnpack name space and be reserved for serialization/deserialization of our xnnpack lowered graph Steps Accomplished - Buck rules to compile flatbuffer schema - added header file to preprocess - everything compiles correctly Differential Revision: [D38999169](https://our.internmc.facebook.com/intern/diff/D38999169/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38999169/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87826 Approved by: https://github.com/digantdesai	2022-11-01 01:36:52 +00:00
maxren	aa1f9a1bd7	[xnnpack][lite-int][graph-build] torchscript -> xnnpack graph (#87824 ) This point we perform conversion for Torchscript IR to XNNPack graph. Currently we only support converting Add Nodes and fp32 tensor values. As a caveat, we are not building this at runtime. So for testing we just run the xnn graph once ahead of time and with sample inputs and forward it to execute. This is only for testing, and will be changed in a later diff. This will allow us to check that graph creation is sound. Differential Revision: [D39838851](https://our.internmc.facebook.com/intern/diff/D39838851/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87824 Approved by: https://github.com/digantdesai, https://github.com/salilsdesai	2022-11-01 01:24:56 +00:00
Edward Z. Yang	d596b048e5	Also skip large models for normal --accuracy runs (#88086 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88086 Approved by: https://github.com/albanD	2022-11-01 00:59:09 +00:00
Driss Guessous	afd00673b6	Change Nested Tensor logging copy (#88104 ) # Summary Change the copy of how we log NestedTensor usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88104 Approved by: https://github.com/mikaylagawarecki	2022-11-01 00:00:35 +00:00
PyTorch MergeBot	c0761a835b	Revert "[dynamo] Error when user nests FX with dynamo (#87797 )" This reverts commit 1da5aeb97b73664ff0fe2f4bb48379655cede969. Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/ezyang due to breaks nvfuser stack, needs more investigation	2022-10-31 23:49:37 +00:00
Nikita Shulga	caaf37a111	Fix `PyTorchStreamWriter` exception handling (#88128 ) Avoid double exception in destructor if attempting to serialize to python object that does not have `write` method Use `Finalizer` class in `PyTorchStreamWriter::writeEndOfFile()` to a always set `finailized_` property even if excretion occurs. (as there isn't much one can do at this point) Add expicit check for the attribue to `_open_zipfile_writer_buffer` and add unitests Modernize code a bit by using Python-3 `super()` method Fixes https://github.com/pytorch/pytorch/issues/87997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88128 Approved by: https://github.com/albanD	2022-10-31 23:38:03 +00:00
John Detloff	ea8a5b09a9	[IOS] Update Cocoapods for 1.13 release (#88075 ) Update the podspecs for libtorch and libtorch-lite to v 1.13 to prepare for the 1.13 pod release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88075 Approved by: https://github.com/manuelcandales, https://github.com/salilsdesai, https://github.com/malfet	2022-10-31 23:36:00 +00:00
Masaki Kozuki	bc03aa6013	Store `autocast_gpu_dtype` in `custom_fwd` and `custom_bwd` for BFloat16 autocast (#88029 ) As per #87979, `custom_bwd` seems to forcefully use `torch.float16` for `torch.autograd.Function.backward` regardless of the `dtype` used in the forward. Changes: - store the `dtype` in `args[0]` - update tests to confirm the dtype of intermediate result tensors that are outputs of autocast compatible `torch` functions cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88029 Approved by: https://github.com/ngimel	2022-10-31 22:45:26 +00:00
Edward Z. Yang	f2b247f0d8	Remove stale comment (#88135 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88135 Approved by: https://github.com/albanD	2022-10-31 22:29:07 +00:00
Christian Puhrsch	139afc50ec	Fix links to tutorial in torch masked docs (#88129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88129 Approved by: https://github.com/jisaacso	2022-10-31 21:31:54 +00:00
Catherine Lee	9fed04ba33	fix for auto labeler (#88100 ) followed https://lightrun.com/answers/actions-labeler-how-to-only-add-label-not-remove-when-pr-is-opened side note: should we move this logic to test-infra to be with the release notes labeler? Pull Request resolved: https://github.com/pytorch/pytorch/pull/88100 Approved by: https://github.com/huydhn	2022-10-31 21:12:54 +00:00
Radek Bartoň	ba26bc0fc2	Fix random "C1041: cannot open program database" errors when compiling on Windows (#88084 ) Adds `/FS` option to `CMAKE_CXX_FLAGS` and `CMAKE_CUDA_FLAGS`. So far I've encountered this kind of errors: ``` C:\Users\MyUser\AppData\Local\Temp\tmpxft_00004728_00000000-7_cuda.cudafe1.cpp: fatal error C1041: cannot open program database 'C:\Projects\pytorch\build\third_party\gloo\gloo\CMakeFiles\gloo_cuda.dir\vc140.pdb'; if multiple CL.EXE write to the same .PDB file, please use /FS ``` when building with VS 2022. cc @peterjc123 @mszhanyi @skyline75489 @nbcsm Related issues: - https://github.com/pytorch/pytorch/issues/87691 - https://github.com/pytorch/pytorch/issues/39989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88084 Approved by: https://github.com/ezyang	2022-10-31 21:11:16 +00:00
Brian Hirsh	73379acaf3	Do not use unsafe restriding for subclasses (#87610 ) This helps convert some accuracy errors into runtime errors, which makes it easier to debug. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87610 Approved by: https://github.com/albanD	2022-10-31 20:49:15 +00:00
Christian Puhrsch	6fe41e76a9	Create separate files for NT Unary, Binary and Matmul ops (#88091 ) Improves code organization and code share. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88091 Approved by: https://github.com/drisspg	2022-10-31 20:10:07 +00:00
Sean Ross-Ross	1a9edc8136	Changing from sample_inputs to reference_inputs in test_compare_cpu (#86462 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86462 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-31 20:06:03 +00:00
Grigory Sizov	4c78c7c82a	Enable `src_mask` in fast path of `TransformerEncoderLayer` (#87377 ) ## Issues Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674 ## Description Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path: - Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type. - If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used ## Tests: - `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask - `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation - `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match ## Note I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason: - `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26) - If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests - Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double` Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377 Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet	2022-10-31 19:59:36 +00:00
PyTorch MergeBot	e9599724fa	Revert "[ONNX] Move all torch.onnx.export related tests to test/onnx (#87292 )" This reverts commit e3e84830aade59722d819bc5fa01922239494790. Reverted https://github.com/pytorch/pytorch/pull/87292 on behalf of https://github.com/weiwangmeta due to breaking internal test relating to quantization eager tests, see test/quantization/eager/test_quantize_eager_ptq.py test_lower_graph_linear and test_lower_graph_conv2d	2022-10-31 19:55:58 +00:00
KevinYuk	e9cabef663	enable xpu group norm channels last support (#87680 ) XPU would support channels last format for group norm operator, however, Pytorch converts all input tensor to contiguous format, which includes channels last tensor. Need Pytorch pass down this memory format hint to us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87680 Approved by: https://github.com/albanD	2022-10-31 19:46:01 +00:00
Kazuaki Ishizaki	7d2f1cd211	Fix typos under docs directory (#88033 ) This PR fixes typos in `.rst` and `.Doxyfile` files under docs directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/88033 Approved by: https://github.com/soulitzer	2022-10-31 19:31:56 +00:00
Sherlock Huang	c7ac333430	Fix args for meta__fused_moving_avg_obs_fq_helper (#88058 ) Fixes https://github.com/pytorch/torchdynamo/issues/1802 There are a few problems, 1. torch.fused_moving_avg_obs_fake_quant doesn't have OpInfo test 2. self.empty_like() is not a valid call. it should be torch.empty_like(self) 3. python meta function has some unexplained behavior for arguments with default value of bool type? In particular, problem 3 is the most concerning one. UPDATE: This is expected behavior, see discussion below for explanation. Without setting the default value for `per_row_fake_quant` and `symmetric_quant`, it gets the following error when running with meta tensor. ``` meta__fused_moving_avg_obs_fq_helper() missing 2 required positional arguments: 'per_row_fake_quant' and 'symmetric_quant' ``` I can fix this by adding the default values to these two args. However, I observer something strange when examining the actual value in meta function. ``` print("per_row_fake_quant", per_row_fake_quant) print("symmetric_quant", symmetric_quant) ``` When default values are False, printed value correctly reflect the args value populated from call site. When default values are True, printed value is ALWAYS True, regardless of the populated value from call site. When default Values are None, printed value is `None` when call site set the value to 'False', printed value is 'True' when call site sets the value to 'True'. I also verify that this bug also affect for other meta function with default args.... My speculation is that this is something about pybind value packing when called from c++ dispatcher to python meta function, and default value parsing for python meta function (and other python dispatch functions) ? I tried to find the c++ call stack, but gdb is missing symbols and C++ stacktrace is not working properly... Appreciate anyone who can point me to the source file for pybind value packing. cc @ezyang cc @bdhirsh. I know you had a fix in the symbolic shape branch... cc @yanboliang who reported this bug Pull Request resolved: https://github.com/pytorch/pytorch/pull/88058 Approved by: https://github.com/bdhirsh, https://github.com/yanboliang	2022-10-31 19:00:16 +00:00
Peter Bell	3eb379052d	unfold_backward: Remove stride >= size kernel in favour of copy_ (#88061 ) unfold_backward has a dedicated kernel for `stride >= size` which uses temporary tensors created by `at::arange` to perform the mapping from unfolded to folded. This instead uses `unfold` to view the output, and does a direct copy from the gradient into the view. In benchmarks I see either no difference or a marginal speed benefit from this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88061 Approved by: https://github.com/albanD	2022-10-31 18:45:44 +00:00
Peter Bell	ceddcf5434	istft: Use unfold_backward instead of col2im (#88060 ) `unfold_backward` implements the same operation as `col2im` but without support for 2d kernels or dilation. However, `istft` doesn't use any of those features and `unfold_backward` actually has a faster `TensorIterator` based implementation so we should use it here instead. In the example from #87353 I see a 2x speedup on both CPU and CUDA. On a wider variety of sizes and inputs I still see speedups across the board, especially on CPU since `col2im` isn't parallelized but `unfold_backward` is: \| device \| shape \| hop_length \| Master (us) \| This PR (us) \| Speedup \| \|--------\|-----------------\|------------\|-------------\|--------------\|---------\| \| CUDA \| (1, 129, 33) \| 256 \| 147 \| 136 \| 1.08 \| \| \| \| 128 \| 153 \| 128 \| 1.20 \| \| \| (100, 129, 20) \| 256 \| 181 \| 147 \| 1.23 \| \| \| \| 128 \| 171 \| 137 \| 1.25 \| \| \| (1000, 129, 10) \| 256 \| 681 \| 443 \| 1.55 \| \| \| \| 128 \| 632 \| 446 \| 1.42 \| \| CPU \| (1, 129, 33) \| 256 \| 106 \| 104 \| 1.02 \| \| \| \| 128 \| 103 \| 81 \| 1.27 \| \| \| (100, 129, 20) \| 256 \| 2400 \| 399 \| 6.02 \| \| \| \| 128 \| 2150 \| 313 \| 6.87 \| \| \| (1000, 129, 10) \| 256 \| 13800 \| 3740 \| 3.69 \| \| \| \| 128 \| 12700 \| 2110 \| 6.02 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/88060 Approved by: https://github.com/albanD	2022-10-31 18:45:44 +00:00
Edward Z. Yang	ff94494644	Revert "Revert "Unify meta tensor and fake tensor converter conversion (#87943 )"" (#88045 ) This reverts commit bc64999b8382796199178cf480adf51512b5f139. Check torch/_subclasses/meta_utils.py for "This is very tricky" for the bugfix explanation. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88045 Approved by: https://github.com/kit1980, https://github.com/Chillee	2022-10-31 17:50:14 +00:00
Jerry Zhang	2e1199d171	[quant][fx] Fix a typo in utils.py (#88024 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/88024 Approved by: https://github.com/HDCharles, https://github.com/z-a-f	2022-10-31 17:31:58 +00:00
Sherlock Huang	0a4ca9d083	Fix meta for aten.angle and aten.index_copy (#88066 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88066 Approved by: https://github.com/albanD	2022-10-31 17:11:29 +00:00
Khushi	a3f8495b84	[primTorch fix] use _maybe_convert_to_dtype (#85163 ) Fixes #84561 - [x] fix lint tests cc: @Lezcano!! Pull Request resolved: https://github.com/pytorch/pytorch/pull/85163 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-31 17:08:55 +00:00
Catherine Lee	2702aaffc0	remove old label check functionality (#88007 ) no longer needed as we have check_labels.py to check if the pr has labels and it blocks merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/88007 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi	2022-10-31 16:52:58 +00:00
Catherine Lee	83f31ffdfe	Move check labels to separate workflow (#87999 ) * moves check labels to separate workflow that is triggered on the usual pull_request triggers as well as labeled and unlabeled * deletes comments when label is added Fixes https://github.com/pytorch/test-infra/issues/978 and https://github.com/pytorch/pytorch/issues/87865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87999 Approved by: https://github.com/huydhn	2022-10-31 16:52:30 +00:00
Sherlock Huang	5723fd503c	Fix meta function for aten.flip and aten.rot90 (#88065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88065 Approved by: https://github.com/mruberry	2022-10-31 16:52:05 +00:00
Andrew Gu	9308cefbdf	[FSDP()][8/N] Refactor limiter's `_FreeEventQueue` (#87922 ) This PR is easy. It just moves `_FreeEventQueue` into its own file `_limiter_utils.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87922 Approved by: https://github.com/rohan-varma, https://github.com/mrshenli	2022-10-31 16:45:24 +00:00
Andrew Gu	d89cf2fdc9	[FSDP()][7/N] Refactor most of ctor (#87921 ) The goal of this PR is to make one pass over the FSDP constructor and refactor each helper method call to not be `self.<...>`. Subsequent PRs will make further passes over the FSDP constructor. This PR looks like a lot of lines of code change, but it is only reorganization. Methods are moved to `_init_utils.py` and `_common_utils.py`. This also marks the beginning of moving methods from `_utils.py` to `_common_utils.py` -- they will be coalesced eventually. I am only using `_common_utils.py` as a staging ground to include the methods that have been affected by the refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87921 Approved by: https://github.com/mrshenli	2022-10-31 16:45:24 +00:00
Andrew Gu	9d9267c6f7	[FSDP()][3/N] Refactor public APIs (#87917 ) - This PR defines a new `api.py` meant to hold the public API for FSDP (minus `FullyShardedDataParallel` itself). This is needed because several of the `_<...>_utils.py` files rely on the public API, and we cannot import from `torch.distributed.fsdp.fully_sharded_data_parallel` without a circular import. Calling the file `api.py` follows the convention used by `ShardedTensor`. - This PR cleans up the wording in the `BackwardPrefetch`, `ShardingStrategy`, `MixedPrecision`, and `CPUOffload` docstrings. - This PR adds the aforementioned classes to `fsdp.rst` to have them rendered in public docs. - To abide by the public bindings contract (`test_public_bindings.py`), the aforementioned classes are removed from `fully_sharded_data_parallel.py`'s `__all__`. This is technically BC breaking if someone uses `from torch.distributed.fsdp.fully_sharded_data_parallel import *`; however, that does not happen in any of our own external or internal code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87917 Approved by: https://github.com/mrshenli	2022-10-31 16:45:21 +00:00
Aaron Gokaslan	59fe272c1e	Fix: prefer .is_none() over .is(py::none()) for pybind11 (#88051 ) Fixes minor perf regression I saw in #85688 and replaced throughout the code base. `obj == Py_None` is directly equivalent to is_none(). Constructing a temporary py::none() object needlessly incref/decref the refcount of py::none, this method avoids that and therefore is more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88051 Approved by: https://github.com/albanD	2022-10-31 16:41:27 +00:00
vasiliy	75dbe37909	make autocast cache global instead of thread-local (#86492 ) Summary: There is a memory leak because `torch.clear_autocast_cache()` clears the autocast cache from the main thread, but autograd can write to this cache from a background thread, so whatever autograd writes will leak. With some offline discussion we decided that a global cache is a practical way to deal with this, and the performance impact of the lock should be negligible. Test Plan: I don't have a local repro of the original issue, need to look into how to get that. A toy example (https://gist.github.com/vkuzo/0d6318fe7f7cb1c505e370cd5c1a643b) does cache clearing as expected on forward and backward pass. local testing: ``` python test/test_cuda.py -k autocast python test/test_autocast.py ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86492 Approved by: https://github.com/ezyang	2022-10-31 16:12:37 +00:00
Andrew Gu	34f523b221	[FSDP] Enable `use_orig_params=True` test (#88034 ) I accidentally committed the `use_orig_params` PR with this test disabled. This PR simply re-enables it. It passes locally, so if CI is green, then this is an easy land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88034 Approved by: https://github.com/H-Huang	2022-10-31 14:28:51 +00:00
Salil Desai	df1cc0ef47	[Vulkan] Add Vulkan Rewrite to Transfer Inputs and Outputs to Vulkan and CPU Backends Respectively (#87432 ) With this change, we don't have to manually invoke transferring input and output backends when we run vulkan models. Graph rewrite code based off of: - `32efff45ba (diff-a473bddb458dc24225866a45092d6eca064eddd256245d93020e48e216eee4d5R160-R179)` Differential Revision: [D39519168](https://our.internmc.facebook.com/intern/diff/D39519168/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39519168/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87432 Approved by: https://github.com/mcr229, https://github.com/digantdesai	2022-10-31 14:18:45 +00:00
Salil Desai	bc68625151	[Vulkan] Add support for Optimization Blocklist to Vulkan Rewrite (#87431 ) Optimization Blocklist will be used in a future diff (D40315730) to make the rewrite to transfer input/output backends optional Differential Revision: [D40315729](https://our.internmc.facebook.com/intern/diff/D40315729/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87431 Approved by: https://github.com/mcr229, https://github.com/digantdesai	2022-10-31 14:15:51 +00:00
Edward Z. Yang	f717986f93	.gitignore log files (#88085 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88085 Approved by: https://github.com/albanD	2022-10-31 13:40:30 +00:00
Edward Z. Yang	8ea19c802e	Make IValue::unsafeToTensorImpl a little less unsafe. (#88043 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88043 Approved by: https://github.com/anjali411, https://github.com/albanD	2022-10-31 13:20:19 +00:00
Edward Z. Yang	e238752e20	Simplify magic method definition code. (#88017 ) It turns out sym_float (and the hypothetical sym_int) can be defined in the same way as conventional magic methods. Do so. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88017 Approved by: https://github.com/albanD	2022-10-31 13:19:56 +00:00
Edward Z. Yang	2a47b10780	Get the magic method try reverse protocol correct (#88030 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88030 Approved by: https://github.com/anjali411, https://github.com/albanD	2022-10-31 13:19:56 +00:00
Horace He	12dd877395	Fix all references to torchdynamo from the merge (#87731 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87731 Approved by: https://github.com/yanboliang, https://github.com/ezyang, https://github.com/anijain2305, https://github.com/jansel	2022-10-31 06:51:07 +00:00
Edward Z. Yang	496acb6602	Add fake tensor files to ciflow/inductor (#88052 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88052 Approved by: https://github.com/anijain2305	2022-10-31 05:35:54 +00:00
Kshiteej K	6735bf21c7	[test_nn] split convolution tests from test_nn (#87474 ) Ref #63085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87474 Approved by: https://github.com/albanD	2022-10-31 04:42:45 +00:00
Jing Xu	46ce92713d	fix github bug issue 87552 (#88059 ) Fixes #87552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88059 Approved by: https://github.com/jgong5, https://github.com/ngimel	2022-10-31 04:40:54 +00:00
Driss Guessous	e24ce484ed	Use scaled_dot_product_attention within attention.cpp (#87312 ) # Summary Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/87312 Approved by: https://github.com/cpuhrsch	2022-10-31 04:06:31 +00:00
Fuzzkatt	d13f1e6ab4	Add sequence number support for UCC (#85047 ) Add sequence number support for UCC, mostly following format of ProcressGroupNCCL. Pass new test: `test_all_gather_object_subgroup` Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047 Approved by: https://github.com/kwen2501	2022-10-31 03:56:55 +00:00
HAOCHENYE	9642a7c2f6	[ONNX] Fix get wrong summary of the docstring in `torch.onnx._deprecation.deprecated` (#87194 ) The summary of the deprecated function could be multi-line. Therefore the code below: `9ac2a06acf/torch/onnx/_deprecation.py (L45)` should be adjusted to ```python summary_and_body = docstring.split("\n\n", 1) ``` Otherwise, the multi-line summary will be separated wrongly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87194 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-10-31 03:00:30 +00:00
Animesh Jain	d67b2edec3	[dynamo][dashboard] minor fixes for a clean Dashboard (#88056 ) * better check for cold start latency * sort on inductor column for better readability. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88056 Approved by: https://github.com/ngimel	2022-10-31 02:30:29 +00:00
Mengchi Zhang	9109ecf914	Even "nvcc not found" should be commented out (#87959 ) Summary: Even "nvcc not found" should be commented out in minifier_launcher.py, cause there could be a case that PyTorch/minifier can find cuda path but nvcc is not explicitly included in env variable like PATH. Differential Revision: D40790023 cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87959 Approved by: https://github.com/anijain2305, https://github.com/jianyuh	2022-10-30 18:22:17 +00:00
Animesh Jain	1b575782a0	[dynamo][benchmarks] use fresh inductor cache and raise batch size wherever possible (#88044 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/88044 Approved by: https://github.com/ngimel	2022-10-30 17:10:17 +00:00
Nikita Shulga	e7b854fae9	[BE] Do not package caffe2 in wheel (#87986 ) If PyTorch is build without caffe2 integration, do not package unusable .py files/headers Same is true about functorch - don't package it unless building with `functorch` (although, I wonder if we should remove this option at some point in the future) Followup after https://github.com/pytorch/builder/pull/1181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87986 Approved by: https://github.com/seemethere	2022-10-30 04:31:45 +00:00
PyTorch MergeBot	65e7719599	[vision hash update] update the pinned vision hash (#87948 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87948 Approved by: https://github.com/pytorchbot	2022-10-30 03:02:57 +00:00
Nikita Shulga	621158cd7f	[BE] Do not assign string literal to `char ` (#87949 ) Not sure, what I was thinking when writing something like: ``` auto foo = std::getenv("BAR"); if (!foo) { foo = "baz"; } ``` as `std::getenv` return `char ` (i.e. mutable string), but string literals are immutable. (i.e. `const char *`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87949 Approved by: https://github.com/kit1980	2022-10-30 01:04:55 +00:00
Yanbo Liang	59001d05b4	[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809 ) Fixes #ISSUE_NUMBER cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809 Approved by: https://github.com/ngimel	2022-10-29 20:36:20 +00:00
PyTorch MergeBot	bc64999b83	Revert "Unify meta tensor and fake tensor converter conversion (#87943 )" This reverts commit baa715e790921e6498861e59556035de1a481cc5. Reverted https://github.com/pytorch/pytorch/pull/87943 on behalf of https://github.com/kit1980 due to Broke several inductor tests	2022-10-29 18:39:28 +00:00
Shunting Zhang	e4a8661ab8	torchdynamo and xla integration (#87741 ) # Motivation - torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing - guard system is quite low overhead but torchxla tracing overhead is quite high The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that - we can integration torchdynamo with XLA - we reduce or even completely avoid tracing overhead of torchxla # Technique details ## XLA baseline We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA. ## New dynamo backends added We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets. torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline. torchxla_trace_once only traces once during AOT compiling time. Here are the steps: 1. dynamo capture guards and the subgraph 2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup 3. at inference time, the hash is used directly to lookup the optimized graph and run it. # Limitations We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models. # Results The models we tested are those not causing LTC fallback. We run the tests on GPU. We see 1.38x geomean speedup for trochxla_trace_once and torchxla_trivial is mostly neutral as expected. ``` \| Model \| XLA (trace once) \| XLA (trace everytime) \| +=========================+====================+=========================+ \| resnet18 \| 1.346 \| 1.045 \| +-------------------------+--------------------+-------------------------+ \| resnet50 \| 1.153 \| 1.007 \| +-------------------------+--------------------+-------------------------+ \| resnext50_32x4d \| 1.381 \| 1.039 \| +-------------------------+--------------------+-------------------------+ \| alexnet \| 1.045 \| 1.018 \| +-------------------------+--------------------+-------------------------+ \| mobilenet_v2 \| 1.562 \| 1.021 \| +-------------------------+--------------------+-------------------------+ \| mnasnet1_0 \| 1.303 \| 1.069 \| +-------------------------+--------------------+-------------------------+ \| squeezenet1_1 \| 1.278 \| 1.025 \| +-------------------------+--------------------+-------------------------+ \| vgg16 \| 1.076 \| 1.008 \| +-------------------------+--------------------+-------------------------+ \| BERT_pytorch \| 2.224 \| 0.978 \| +-------------------------+--------------------+-------------------------+ \| timm_vision_transformer \| 1.81 \| 1.025 \| +-------------------------+--------------------+-------------------------+ \| geomean \| 1.38101 \| 1.02324 \| +-------------------------+--------------------+-------------------------+ ``` The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there): https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5 # Next steps - Use AOT autograd to enable training - Share results on XLA devices - Do more extensive tests on torchbench models Example command ``` GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once ``` Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: https://github.com/pytorch/xla/pull/4119 topic: not user facing cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87741 Approved by: https://github.com/wconstab	2022-10-29 17:52:26 +00:00
Richard Barnes	6cd25eb6de	Use TORCH_CHECK instead of inappropriate CUDA_KERNEL_ASSERT (#87714 ) `CUDA_KERNEL_ASSERT` should only be used inside kernels; switch these bad usages to `TORCH_CHECK` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87714 Approved by: https://github.com/ezyang	2022-10-29 17:48:23 +00:00
Huy Do	384b84d6a6	[BE] Upload GHA artifacts to S3 (#87827 ) This is exclusively used by macOS, ROCM (and any other future workflows) that don't have direct access to S3 to upload their artifacts ### Testing Running the script locally with the personal GITHUB_TOKEN: ``` python3 -m tools.stats.upload_artifacts --workflow-run-id 3342375847 --workflow-run-attempt 1 --repo pytorch/pytorch Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb Downloading sccache-stats-macos-12-py3-arm64-runattempt1-9155493770 Downloading sccache-stats-macos-12-py3-lite-interpreter-x86-64-runattempt1-9155493303 Downloading sccache-stats-macos-12-py3-x86-64-runattempt1-9155493627 Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-arm64-runattempt1-9155493770 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-arm64-9155493770 Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-lite-interpreter-x86-64-runattempt1-9155493303 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-lite-interpreter-x86-64-9155493303 Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-x86-64-runattempt1-9155493627 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-x86-64-9155493627 Downloading test-jsons-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip Downloading test-jsons-runattempt1-test-default-1-2-macos-12_9155944815.zip Downloading test-jsons-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip Downloading test-jsons-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip Downloading test-jsons-runattempt1-test-default-2-2-macos-12_9155944892.zip Downloading test-jsons-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-linux.rocm.gpu_9155913429.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-macos-12_9155944815.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-macos-m1-12_9155888061.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-linux.rocm.gpu_9155913500.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-macos-12_9155944892.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-macos-m1-12_9155888182.zip Downloading test-reports-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip Downloading test-reports-runattempt1-test-default-1-2-macos-12_9155944815.zip Downloading test-reports-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip Downloading test-reports-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip Downloading test-reports-runattempt1-test-default-2-2-macos-12_9155944892.zip Downloading test-reports-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-linux.rocm.gpu_9155913429.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-macos-12_9155944815.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-macos-m1-12_9155888061.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-linux.rocm.gpu_9155913500.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-macos-12_9155944892.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-macos-m1-12_9155888182.zip Downloading usage-log-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip Downloading usage-log-runattempt1-test-default-1-2-macos-12_9155944815.zip Downloading usage-log-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip Downloading usage-log-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip Downloading usage-log-runattempt1-test-default-2-2-macos-12_9155944892.zip Downloading usage-log-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-linux.rocm.gpu_9155913429.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-macos-12_9155944815.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-macos-m1-12_9155888061.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-linux.rocm.gpu_9155913500.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-macos-12_9155944892.zip Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-macos-m1-12_9155888182.zip ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87827 Approved by: https://github.com/clee2000	2022-10-29 17:40:07 +00:00
Shen Li	d9b6e41da9	Add composable activation checkpointing (#87664 ) This is a composable activation checkpointing API. Unlike functional activation checkpointing APIs, this one does not require changing model source code. Unlike ``nn.Module`` wrapper activation checkpointing APIs, this one does not modify model structure or fully-qualified names either. Under the hood, it registers activation checkpointing logic as pre- and post-forward hooks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87664 Approved by: https://github.com/zhaojuanmao	2022-10-29 17:35:58 +00:00
Sergey Lebedev	19171a21ee	Make barrier blocking in UCC (#86961 ) Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking. cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/86961 Approved by: https://github.com/kwen2501	2022-10-29 16:33:18 +00:00
Edward Z. Yang	baa715e790	Unify meta tensor and fake tensor converter conversion (#87943 ) Meta tensor does a lot of work to make sure tensors "look" similar to the original parts; e.g., if the original was a non-leaf, meta converter ensures the meta tensor is a non-leaf too. Fake tensor destroyed some of these properties when it wraps it in a FakeTensor. This patch pushes the FakeTensor constructor into the meta converter itself, so that we first create a fake tensor, and then we do various convertibility bits to it to make it look right. The two tricky bits: - We need to have no_dispatch enabled when we allocate the initial meta tensor, or fake tensor gets mad at us for making a meta fake tensor. This necessitates the double-callback structure of the callback arguments: the meta construction happens inside the function so it is covered by no_dispatch - I can't store tensors for the storages anymore, as that will result in a leak. But we have untyped storage now, so I just store untyped storages instead. Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87943 Approved by: https://github.com/eellison, https://github.com/albanD	2022-10-29 15:01:07 +00:00
AllenTiTaiWang	4210cebc16	[ONNX] Add internal node kind parsing (#87638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87638 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-10-29 11:51:23 +00:00
AllenTiTaiWang	cb05a4da39	[ONNX] Parametrized Avgpool2D test to have all test combinations (#87893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87893 Approved by: https://github.com/BowenBao	2022-10-29 11:45:28 +00:00
AllenTiTaiWang	f2ae459311	[ONNX] Disable ONNX ceil_mode and count_include_pad to aligntorch ceil_mode results in corner case (#87892 ) ONNX and PyTorch has different equation on pooling and different strategy on ceil_mode, which leads to discrepancy on corner case (#71549 ). Specifically, PyTorch avereage pooling is not following [the equation on documentation](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html), it allows sliding window to go off-bound instead, if they start within the left padding or the input (in NOTE section). More details can be found in #57178. This PR changes avgpool in opset 10 and 11 back the way as opset 9, which it stops using ceil_mode and count_include_pad in onnx::AveragePool A comprehensive test for all combinations of parameters can be found in the next PR. #87893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87892 Approved by: https://github.com/BowenBao	2022-10-29 11:35:10 +00:00
Huy Do	c810489dd9	Cleanup macos common conda installation (#87816 ) The conda dependencies have all been installed for `_mac-test` in https://github.com/pytorch/pytorch/pull/87541. I missed the same step for `_mac-build` and `_mac-test-mps` workflows, so both are also updated here. Note that arm64 is cross-compiled from x86, so the env file needs to be set explicitly in that case After this one, I have a WIP PR to consolidate macos pip dependencies next Pull Request resolved: https://github.com/pytorch/pytorch/pull/87816 Approved by: https://github.com/ZainRizvi	2022-10-29 08:43:45 +00:00
Huy Do	53fea90547	Store usage log on GitHub when S3 is not available (#87947 ) It turns out that we haven't uploaded the usage log to GitHub when S3 is not available (macos, rocm), for example, https://github.com/pytorch/pytorch/actions/runs/3325822440#artifacts only includes test-report, test-json, sccache stats, and build artifacts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87947 Approved by: https://github.com/clee2000	2022-10-29 08:34:13 +00:00
Edward Z. Yang	d3c01c722d	Fix pybind11 problems with c10::SymInt unregistered (#88011 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/88011 Approved by: https://github.com/weiwangmeta, https://github.com/albanD	2022-10-29 07:55:45 +00:00
Andrew Gu	e667c00656	[FSDP()][2/N] Refactor training state (#87916 ) This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`. - At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision). - At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87916 Approved by: https://github.com/mrshenli	2022-10-29 06:50:30 +00:00
Andrew Gu	cbc9faebfe	[FSDP()][1/N] Start refactoring FSDP root pre-forward (#87915 ) Welcome! This PR starts the refactoring journey. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87915 Approved by: https://github.com/mrshenli	2022-10-29 06:50:30 +00:00
PyTorch MergeBot	edd6cf9996	Revert "[ONNX] Deprecate operators.py (#87798 )" This reverts commit 88eff1072290177221e7a09d792f7f135b4c83ca. Reverted https://github.com/pytorch/pytorch/pull/87798 on behalf of https://github.com/weiwangmeta due to breaking internal builds see D40797126	2022-10-29 06:48:12 +00:00
AllenTiTaiWang	e3e84830aa	[ONNX] Move all torch.onnx.export related tests to test/onnx (#87292 ) Moving torch.onnx.export related tests to test/onnx integrates ONNX tests to the same CI machine, so the testing environment can be better managed. Fixes https://github.com/pytorch/pytorch/issues/87320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87292 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/kit1980	2022-10-29 05:31:30 +00:00
Loren Arthur	1dad051b05	Move workspace related functions to separate file (#87651 ) Move workspace related functions to separate file Test Plan: Existing tests Differential Revision: D40657708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87651 Approved by: https://github.com/malfet	2022-10-29 04:52:01 +00:00
Iris Zhang	0cf572ff6c	[C10D][BE] Add exception handlers to c10d collectives function (#87643 ) (#87988 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643 1. Add a decorator function exception_handlers to c10d collectives. 2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler. ``` python3 test/distributed/test_c10d_error_logger.py ``` Test Plan: Test in OSS. Reviewed By: H-Huang Differential Revision: D40281632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988 Approved by: https://github.com/H-Huang	2022-10-29 04:38:34 +00:00
Tovly Deutsch	20e16c013f	Allow caffe2 to build with fbcode/mode/mac (#87293 ) Summary: The Mac contbuild builds under the `fbcode/mode/mac` which caffe2 fails to build under. This is due to that build mode enforcing protobuf v3. The caffe2 targets already account for this issue under `arvr` build modes by swapping out protobuf dependencies. They don't account for the same issue under `fbcode/mode/mac`. This diff fixes that by checking for `is_fbcode_mac` in these situations (in addition to `arvr`). Test Plan: ``` buck build --flagfile fbsource//fbcode/mode/mac fbsource//xplat/caffe2/... ``` Reviewed By: kimishpatel Differential Revision: D39552724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87293 Approved by: https://github.com/kimishpatel	2022-10-29 04:20:57 +00:00
Elias Ellison	9835413009	Fake Tensor For (Conv) Propagation (#87641 ) Resubmitting https://github.com/pytorch/pytorch/pull/87302 so it can be ghstack'd with the pr below. Incorrect strides in any meta impl would lead to runtime assertion errors for fallback kernels, so start by just enabling it for conv. Replaces https://github.com/pytorch/pytorch/pull/87588. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87641 Approved by: https://github.com/jansel	2022-10-29 04:14:01 +00:00
Kazuaki Ishizaki	14d5f139d2	Fix typos under benchmarks, test, and tools directories (#87975 ) This PR fixes typos in `.md` files under benchmarks, test, and tools directories Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975 Approved by: https://github.com/kit1980	2022-10-29 01:26:17 +00:00
Richard Zou	18f3db2963	Fix functorch tests (#87914 ) Test Plan: - Run tests Differential Revision: D40777145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87914 Approved by: https://github.com/Chillee, https://github.com/osalpekar	2022-10-29 01:21:55 +00:00
Sergii Dymchenko	af0c339f00	Disable slow-gradcheck tests (#88008 ) Disable because slow-gradcheck tests take > 4 hrs and time out. Will need to figure out if and how to re-enable later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88008 Approved by: https://github.com/seemethere, https://github.com/huydhn	2022-10-29 00:23:50 +00:00
Nikita Shulga	785054d3a9	[CI] Report build errors in Windows build step (#88001 ) Should make failures like https://github.com/pytorch/pytorch/actions/runs/3346715682/jobs/5543900889 much more debuggable P.S. I don't know how to write batch, just hope its going to work Pull Request resolved: https://github.com/pytorch/pytorch/pull/88001 Approved by: https://github.com/seemethere	2022-10-28 23:59:49 +00:00
Daniil Kutz	1eba3f220e	Fix bugs found by static analysis (#85705 ) These PR fixes a number of bugs found by Svace static analyzer: 1. DEREF_AFTER_FREE at qnnpack_utils.h: Pointer '&convolution->zero_buffer' is dereferenced at qnnpack_utils.h:258 after the referenced memory was deallocated at operator-delete.c:25 by passing as 1st parameter to function 'pytorch_qnnp_delete_operator' at qnnpack_utils.h:251. 2. DEREF_AFTER_NULL at impl.cpp: After having been compared to NULL value at impl.cpp:1892, pointer 'schema' is passed as 2nd parameter in call to function 'c10::operator<<' at impl.cpp:1921, where it is dereferenced at function_schema_inl.h:13. 3. DEREF_OF_NULL at stmt.h: After having been compared to NULL value at stmt.h:744, pointer 'body->_M_ptr' is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at stmt.h:745, where it is dereferenced at exceptions.h:67. 4. DEREF_OF_NULL at loopnest.h: Pointer 'f->ptr' that can have only NULL value (checked at loopnest.cpp:1482), is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at loopnest.cpp:1483, where it is dereferenced at exceptions.h:67. This is the same error as 3: forwarding a nullptr to malformed_input(). 4. TAINTED_INT.LOOP in python_arg_parser: Integer value 'this->size' obtained from untrusted source at python_arg_parser.cpp:118 without checking its bounds is used as a loop bound at python_arg_parser.cpp:698 by calling function 'torch::FunctionParameter::set_default_str' at python_arg_parser.cpp:133. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85705 Approved by: https://github.com/kit1980	2022-10-28 23:51:55 +00:00
BowenBao	376acf7625	Add 'share_from_this' to 'torch::jit::Graph' (#87343 ) Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the `internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr. The behavior afterwards is random and probably undefined. Most of the time it works, if the holder is deallocated timely on python side, and the cache then cleared from `internals::registered_instance`. Things are back to normal. Otherwise, it fails with either segfault or a runtime error of message "Unable to cast from non-held to held instance". One of such scenarios is normally and correctly returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via cache. Due to this, the shared_ptr use_count will not increase. If there is no other use on C++ side, the graph will be freed, while python still has access, via the holder created previously. @t-vi had a great analysis and solution to this exact problem at #51833 which I hope I had seen before debugging this issue... ~~I'm building the PR based on the original commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately due to CLA issues. Need to check in CI if adding `enable_shared_from_this` breaks other stuff. Fixes #51833, and CI issues in #87258, #86182. cc @malfet, @kit1980 for changes on JIT IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87343 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet	2022-10-28 23:51:44 +00:00
Jerry Zhang	ecf277abec	[quant][improvement] Check the fixedqparam op qconfig based on backend_config (#87425 ) Summary: Previously we hardcoded the supported observers for fixedqparam ops, this PR changes that to take the information from BackendConfig, this allows users to customize the support for fixed qparam ops Test Plan: python test/test_quantization.py TestQuantizeFx.test_change_backend_config_for_fixed_qparam_ops Reviewers: Subscribers: Tasks: Tags: unlinked from diff since it's too hard to land Pull Request resolved: https://github.com/pytorch/pytorch/pull/87425 Approved by: https://github.com/andrewor14	2022-10-28 23:38:40 +00:00
Eli Uriegas	c3c817c972	Revert "ci: Switch merge / revert flow to our own infra" (#88016 )	2022-10-28 15:12:31 -07:00
Andrew Gu	a2ffc3be97	[AC] Add trailing "." to `_CHECKPOINT_PREFIX` like FSDP (#87951 ) This is for consistency with FSDP. - `_FSDP_WRAPPED_MODULE` and `_CHECKPOINT_WRAPPED_MODULE` are exactly the wrapped module variable name, meaning you can call `getattr(module, _FSDP_WRAPPED_MODULE)` or `getattr(module, _CHECKPOINT_WRAPPED_MODULE)`. - `_FSDP_PREFIX` and `_CHECKPOINT_PREFIX` include the trailing `"."` and are only used for FQNs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87951 Approved by: https://github.com/zhaojuanmao	2022-10-28 22:05:29 +00:00
Jithun Nair	4faf086e5f	Update build scripts for ninja and ROCm5.3 install (#87505 ) cc @jeffdaily @sunway513 @ROCmSupport Pull Request resolved: https://github.com/pytorch/pytorch/pull/87505 Approved by: https://github.com/seemethere	2022-10-28 22:05:12 +00:00
Eli Uriegas	349ad23ffb	ci: Switch merge / revert flow to our own infra (#88009 )	2022-10-28 14:37:55 -07:00
Michael Lazos	9691ba2dbd	Remove excess exception logging for minifier, cleanup backend failure exception format (#87537 ) Fixes https://github.com/pytorch/torchdynamo/issues/1376 Ensures exceptions are printed only in one place, once. implements some of the ideas from https://github.com/pytorch/torchdynamo/issues/1754 - Attaches a field to the exception which indicates that it's minified, a usage message is printed if this field is present cc @jansel @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87537 Approved by: https://github.com/anijain2305	2022-10-28 21:33:55 +00:00
Andrew Gu	1c37119a1f	[FSDP] New fix for composing with other module wrappers (#87950 ) We change `.module` to pass through `ActivationWrapper` directly to the inner wrapped module. This should fix the state dict issues. Given the invariant that `.module` always returns the inner wrapped module, FSDP always registers the `FlatParameter` on the inner wrapped module, regardless of if there is an intermediate `ActivationWrapper` or not. This avoids casing on whether `ActivationWrapper` is added before or after FSDP construction. This PR removes the added unit test in `test_fsdp_misc.py` for changing the wrapped module because I would rather not complicated `_lazy_init()` logic just to support that kind of adversarial behavior. The user should not be swapping out the wrapped module arbitrarily or deleting the `FlatParameter`. I mainly had those tests to make sure that all branches of the code I added was correct. Differential Revision: [D40799961](https://our.internmc.facebook.com/intern/diff/D40799961) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87950 Approved by: https://github.com/zhaojuanmao	2022-10-28 21:11:40 +00:00
Edward Z. Yang	c2c269c10a	Convert MetaConverter's tensor memo into a weak value dictionary. (#87911 ) This is in preparation for unifying fake tensor converter and meta converter's memo tables. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87911 Approved by: https://github.com/eellison	2022-10-28 21:05:13 +00:00
Edward Z. Yang	e72962a34d	Force people to call from_meta_and_device directly (#87903 ) It was pretty hard to tell at call site if I was doing device meta convert or not. This gets rid of the "dual" API and forces people to call the method manually for the device case. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87903 Approved by: https://github.com/eellison, https://github.com/albanD	2022-10-28 21:05:13 +00:00
Andrey Talman	ab8fbd26f8	Advance nightly docker to 11.6 (#87858 ) Fixes following: https://github.com/pytorch/pytorch/actions/runs/3242695506/jobs/5316334351 crash in Docker builds introduced by: #82682 The PR seems to introduce some changes not compatible with cuda 11.3 which is used by our Docker builds This is a reland of original pr: https://github.com/pytorch/pytorch/pull/86941 (Created this new PR to start fresh) Which was reverted because conda install, installed wrong version of pytorch. It installed pytorch for cuda 11.3 still rather then 11.6 This should be fixed now with Release 1.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87858 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/izaitsevfb	2022-10-28 19:55:33 +00:00
Eddie Yan	c5cb6ec066	Allow 64bit indexing for channels-last upsample2d on CUDA (#87901 ) #81665 CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/87901 Approved by: https://github.com/ngimel	2022-10-28 19:33:42 +00:00
Taylor Robie	fb64f7b804	[Profiler][Trivial] Move ID assignment code to `data_flow.cpp` (#87670 ) ID assignment has become a very complex facet of the profiler. The existing code has grown organically as I've discovered various refinements and has become very difficult to understand or reason about. (With more complexity coming in https://github.com/pytorch/pytorch/pull/87133) I want to take a step back and add some structure and additional comments to the ID assignment algorithm. Before I do, however, it's time to move it out of `collection.cpp` to a dedicated data flow file. Differential Revision: [D40666360](https://our.internmc.facebook.com/intern/diff/D40666360/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40666360/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87670 Approved by: https://github.com/slgong-fb	2022-10-28 18:40:18 +00:00
Taylor Robie	8d395ec6bc	[Profiler][Trivial] Add hashing struct for pairs and tuples. (#87668 ) There is a fairly simple and commonly used hash_combine in c10/util; however in order to use it in a map we need to wrap it in a hashing struct. By defining template functions we also get recursive unpacking for free. (A later PR will want to hash a `tuple<tuple<T0, T1>, tuple<T0, T1>>`) Differential Revision: [D40666359](https://our.internmc.facebook.com/intern/diff/D40666359/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87668 Approved by: https://github.com/slgong-fb	2022-10-28 18:40:18 +00:00
PyTorch MergeBot	d13b6781d8	Revert "[fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87257 )" This reverts commit 58650835bb91d927623e6bff5cc4844fbcad6368. Reverted https://github.com/pytorch/pytorch/pull/87257 on behalf of https://github.com/weiwangmeta due to breaking internal builds/BC-breaking change	2022-10-28 17:55:19 +00:00
Elias Ellison	fc21b9db23	Use Eager Code To Determine Conv Layout (#87305 ) The logic for determine conv backend and therefore output striding is very complex. It depends on build settings, input striding/contiguity, sizes, etc. Eventually we should port that logic to the meta impl for dynamic shapes but that will require a lot more work and keeping the implementations in sync. See https://github.com/pytorch/torchdynamo/issues/1701 This is a prerequisite to removing the inductor conv stride propagation and more general fake tensor for inductor propagation. In that PR, the meta impls for cpu conv give incorrect striding which led to test failures (https://github.com/pytorch/pytorch/pull/87083). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87305 Approved by: https://github.com/ezyang	2022-10-28 16:37:04 +00:00
Natalia Gimelshein	1bc0e923bb	add special case for power of 0.5 (#87912 ) Workaround for https://github.com/pytorch/torchdynamo/issues/1775, and calling sqrt is better in any case, but `libdevice.pow` still for some reason doesn't work if both arguments are scalars cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @mreso, can you please check if that takes you further with diffusers cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87912 Approved by: https://github.com/desertfire	2022-10-28 16:09:25 +00:00
Driss Guessous	35c611d30f	Add mem efficient backend flag (#87946 ) # Summary Add in a torch.backends.cuda flag and update context manager to pic between the three implementations of the scaled_dot_product_attention. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/87946 Approved by: https://github.com/cpuhrsch	2022-10-28 15:51:10 +00:00
Shen Li	89fd451934	Fix codeowner errors (#87954 ) Error message: "Unknown owner: make sure @mingzhe09088 exists and has write access to the repository." Pull Request resolved: https://github.com/pytorch/pytorch/pull/87954 Approved by: https://github.com/wangkuiyi	2022-10-28 15:19:09 +00:00
albanD	8a9aca7b8d	Reland 2 Many symintifications (#87604 ) (#87980 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87980 Approved by: https://github.com/ezyang	2022-10-28 13:40:11 +00:00
Shen Li	ce3e0e9856	Add state to distributed composable API (#87838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87838 Approved by: https://github.com/yhcharles	2022-10-28 13:31:40 +00:00
Christian Puhrsch	b192e7e415	Support non-contiguous NestedTensors for elementwise ops (#87888 ) Enables benchmarking of math path of sdp kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87888 Approved by: https://github.com/drisspg	2022-10-28 11:26:17 +00:00
leslie-fang-intel	f150e70ca2	add the function specialization for promote with ITensorListRef (#87756 ) Fixes [#87684](https://github.com/pytorch/pytorch/issues/87684) It's due to a new tensor list type is introduced as `ITensorListRef`. We need the function specialization for `prioritize` and `cached_cast` for this new tensor list type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87756 Approved by: https://github.com/jgong5, https://github.com/ezyang	2022-10-28 10:30:30 +00:00
PyTorch MergeBot	166b5d3e7c	Revert "[EZ] Fix simple bug in torchdynamo (#87821 )" This reverts commit ce7fcab9bdf61a34bc56b7cd45a882e4ad6ba175. Reverted https://github.com/pytorch/pytorch/pull/87821 on behalf of https://github.com/kit1980 due to Broke many dynamo tests https://github.com/pytorch/pytorch/actions/runs/3341984303/jobs/5534381456	2022-10-28 06:11:42 +00:00
Charlie Yan	78b406932f	Add me to reviewers of composable API changes (#87891 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87891 Approved by: https://github.com/mrshenli	2022-10-28 05:11:39 +00:00
Michael Suo	1da5aeb97b	[dynamo] Error when user nests FX with dynamo (#87797 ) Today, this doesn't work and dynamo errors out in a very non-obvious way (see: https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a). Here, we detect the error early and exit with a nicer msg. Also add a config option to just no-op dynamo (which need to unblock internal enablement). cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797 Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel	2022-10-28 04:59:08 +00:00
Xia, Weiwen	07f7c4615b	[MKLDNN] Replace pooling algorithm `pooling_avg` with `pooling_avg_exclude_padding` for future oneDNN upgrades (#87851 ) Description Replace pooling algorithm `pooling_avg` with `pooling_avg_exclude_padding` in implementation of mkldnn pooling. It's only a change of names, not algorithm. The former is an alias of the latter and it will be removed in future oneDNN library upgrades. This change has no effect on functionality or performance. Validation Covered by UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87851 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper	2022-10-28 04:58:55 +00:00
rboca	23b79e6f48	Update CMakeLists.txt (#87030 ) Fix Caffe2_CPU_INCLUDE with Caffe2_GPU_INCLUDE. The expanding parent scope should be with the same variable name. The compilation in certain build configurations is corrected with this fix. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87030 Approved by: https://github.com/kit1980	2022-10-28 04:56:40 +00:00
Kazuaki Ishizaki	daff5d3556	Fix typos under caffe2 directory (#87840 ) This PR fixes typos in `.md` files under caffe2 directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87840 Approved by: https://github.com/kit1980	2022-10-28 04:53:36 +00:00
Sherlock Huang	e8a97a3721	FakeTensorMode and Prims.add/sub/mul/div support scalar only inputs (#87759 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87759 Approved by: https://github.com/ngimel, https://github.com/mruberry, https://github.com/eellison	2022-10-28 04:34:25 +00:00
Michael Suo	d47ffecbe4	[dynamo] relax fake tensor restriction with `assume_constant_result` (#87895 ) This works now because of https://github.com/pytorch/pytorch/pull/87091, so don't error out anymore. cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87895 Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym	2022-10-28 04:05:06 +00:00
Jithun Nair	2e48b478e0	[ROCm] Use -rpath-link to fix libtinfo conflict (#83552 ) Fixes issue building PyTorch for ROCm5.3 and above on Ubuntu20.04 because libtinfo6 from conda conflicts with the one from the distro causing symbol not found errors. cc @jeffdaily @sunway513 @ROCmSupport Pull Request resolved: https://github.com/pytorch/pytorch/pull/83552 Approved by: https://github.com/malfet, https://github.com/pruthvistony	2022-10-28 03:50:43 +00:00
sanchitintel	9c793b366f	Move incorrectly placed closing curly brace of `extern "C"` block (#87853 ) ### Bug description When `__SYCL_DEVICE_ONLY__` is defined, while building PyTorch, the output of the preprocessing step would not have the closing curly brace of the `extern "C"` block, as it has been incorrectly placed. Compilers don't seem to report an error or a warning for a missing closing brace of an `extern "C"` block. ### Impact of the bug If `c10/macros/Macros.h` would be included in a C++ file, and after the preprocessing stage, if the preprocessed source file would have some templated code after `extern "C" {`, then, after compilation, linking might fail with the error `templates must have c++ linkage`). eg. https://stackoverflow.com/questions/61717819/template-with-c-linkage-error-when-using-template-keyword-in-main-cpp/61717908#61717908 (its answer also has a small snippet of code to reproduce such an issue). ### Solution in this PR one-liner bug fix that rectifies the placement of closing curly brace (`}`), so that the `extern "C"` block ends properly when `__SYCL_DEVICE_ONLY__` is defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87853 Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet	2022-10-28 03:42:20 +00:00
Sherlock Huang	13de4d2137	Meta OpInfo Test for stride correctness (#87849 ) Failing test logs here https://gist.github.com/SherlockNoMad/a7e132f3cb4152900f8a6d7df358c59e Pull Request resolved: https://github.com/pytorch/pytorch/pull/87849 Approved by: https://github.com/eellison	2022-10-28 03:40:14 +00:00
PyTorch MergeBot	8b4d95759c	Revert "Many symintifications (#87604 )" This reverts commit 777e6a2c5100f3274cff1bcf7e47ccbe1a651927. Reverted https://github.com/pytorch/pytorch/pull/87604 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2022-10-28 03:00:11 +00:00
Animesh Jain	2cb7c3f865	[dynamo][benchmarks] Prepone Cold start setup (#87913 ) Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR. cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87913 Approved by: https://github.com/wconstab	2022-10-28 02:41:13 +00:00
PyTorch MergeBot	641d8e0e69	Revert "Enable mypy check for distributed.py, and fix type errors (#87543 )" This reverts commit 2cc624cd4318414905d2475432aee13db9031cc6. Reverted https://github.com/pytorch/pytorch/pull/87543 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2022-10-28 02:20:25 +00:00
Andrew Gu	f967918411	[AC] Return `None` from `apply_activation_checkpointing()` (#87871 ) `_recursive_wrap()` returns `Tuple[nn.Module, int]`, where the `nn.Module` is the in-place modified module and the `int` is the numel wrapped. In that sense, the return value is not meant to be publicly used. The `apply_activation_checkpointing()` docs already suggest that the function returns `None`, so this PR simply follows that. Test Plan CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/87871 Approved by: https://github.com/zhaojuanmao	2022-10-28 02:00:39 +00:00
Mike Iovine	81c4049f4d	[Static Runtime] Move PrepackWeights to internal-only graph passes (#87799 ) Summary: The pass introduces an `fb::` operator and thus cannot be used in OSS. The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests. Test Plan: Existing tests Differential Revision: D40724547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799 Approved by: https://github.com/huydhn	2022-10-28 01:28:34 +00:00
Tugsbayasgalan Manlaibaatar	ce7fcab9bd	[EZ] Fix simple bug in torchdynamo (#87821 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87821 Approved by: https://github.com/voznesenskym, https://github.com/jansel	2022-10-28 00:52:00 +00:00
lezcano	fd27246c16	Fix decomposition for std (#87181 ) The previous implementation was lacking a few features and incurred on a pretty large error cc @ezyang @mruberry @ngimel @Lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87181 Approved by: https://github.com/ngimel, https://github.com/peterbell10	2022-10-28 00:50:29 +00:00
lezcano	f21d0b310c	Add decomposition for diagonal_scatter (#87282 ) cc @ezyang @mruberry @ngimel @Lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87282 Approved by: https://github.com/mruberry	2022-10-28 00:50:29 +00:00
Andrew Gu	9225f26176	[FSDP] Fix wrapped module changing after ctor (#87837 ) Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`. If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`. The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`. The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87837 Approved by: https://github.com/zhaojuanmao	2022-10-28 00:43:18 +00:00
Richard Barnes	7a3afe61d2	Check all CUDA API calls for errors in caffe2/ (#81816 ) Test Plan: Sandcastle Differential Revision: D35194868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81816 Approved by: https://github.com/ezyang	2022-10-28 00:41:06 +00:00
Richard Barnes	3ece9fb45d	Check all CUDA API calls for errors in torch/ (#81560 ) Summary: Original commit changeset: 0bb770d2cdb2 Original Phabricator Diff: D35194935 (`79e5b053b6`) Differential Revision: D35291874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81560 Approved by: https://github.com/ezyang	2022-10-28 00:40:48 +00:00
Bin Bao	4e3a0ff92e	Update how inductor cpu tests are skipped on fbcode (#87867 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87867 Approved by: https://github.com/anijain2305	2022-10-28 00:33:54 +00:00
PyTorch MergeBot	6cc4ae3d2d	Revert "[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809 )" This reverts commit 369755f8ce1b043c88efbc50ee09c0258dec5162. Reverted https://github.com/pytorch/pytorch/pull/87809 on behalf of https://github.com/kit1980 due to Broke trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 4, 4, linux.g5.4xlarge.nvidia.gpu), same error on pull.	2022-10-27 23:55:59 +00:00
PyTorch MergeBot	cda0d5a57b	Revert "[dynamo] Error when user nests FX with dynamo (#87797 )" This reverts commit a485528a7e4551461d57db3deb8b40c2acea08d2. Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/kit1980 due to Broke linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge), same error on pull	2022-10-27 21:16:58 +00:00
soulitzer	6ad3543a1b	BE: Improve test_will_engine_execute_node unittest (#87806 ) Adds the test from https://github.com/pytorch/pytorch/pull/86672 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87806 Approved by: https://github.com/albanD	2022-10-27 21:13:08 +00:00
foram-chandra	0f7df16c71	[doc] Add out-kwarg documentation to torch.where (#87870 ) Fixes #87862 cc: @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/87870 Approved by: https://github.com/lezcano	2022-10-27 21:03:42 +00:00
Alvaro Gaona	46b16977d9	Reimplement Kaiser window (#87330 ) Relates to #85366 - For reference follow #87082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87330 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-27 21:01:01 +00:00
Yanbo Liang	369755f8ce	[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809 ) Fixes #ISSUE_NUMBER cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809 Approved by: https://github.com/ngimel	2022-10-27 20:58:48 +00:00
Edward Z. Yang	1ff52225f1	Unify SymIntNode and SymFloatNode into SymNode (#87817 ) This refactor was prompted by challenges handling mixed int/float operations in C++. A previous version of this patch added overloads for each permutation of int/float and was unwieldy https://github.com/pytorch/pytorch/pull/87722/ This PR takes a different approach. The general outline of the patch is to combine the C++ types SymIntNode and SymFloatNode into a single type, SymNode. This is type erased; we no longer know statically at C++ if we have an int/float and have to test it with the is_int()/is_float() virtual methods. This has a number of knock on effects. - We no longer have C++ classes to bind to Python. Instead, we take an entirely new approach to our Python API, where we have a SymInt/SymFloat class defined entirely in Python, which hold a SymNode (which corresponds to the C++ SymNode). However, SymNode is not pybind11-bound; instead, it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode when it goes into C++. This implies a userland rename. In principle, it is also possible for the canonical implementation of SymNode to be written in C++, and then bound to Python with pybind11 (we have this code, although it is commented out.) However, I did not implement this as we currently have no C++ implementations of SymNode. Because we do return SymInt/SymFloat from C++ bindings, the C++ binding code needs to know how to find these classes. Currently, this is done just by manually importing torch and getting the attributes. - Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now takes SymInt/SymFloat, rather than SymNode, bringing it in line with how __torch_dispatch__ works. Some miscellaneous improvements: - SymInt now has a constructor that takes SymNode. Note that this constructor is ambiguous if you pass in a subclass of SymNode, so an explicit downcast is necessary. This means toSymFloat/toSymInt are no more. This is a mild optimization as it means rvalue reference works automatically. - We uniformly use the caster for c10::SymInt/SymFloat, rather than going the long way via the SymIntNode/SymFloatNode. - Removed some unnecessary toSymInt/toSymFloat calls in normalize_* functions, pretty sure this doesn't do anything. - guard_int is now a free function, since to guard on an int you cannot assume the method exists. A function can handle both int and SymInt inputs. - We clean up the magic method definition code for SymInt/SymFloat/SymNode. ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets plain methods; this is to help avoid confusion between the two types. Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817 Approved by: https://github.com/albanD, https://github.com/anjali411	2022-10-27 20:56:02 +00:00
Jiewen Tan	2205f56f46	[LTC] Remove lazy::View (#87822 ) Summary: This is the first part to remove the whole view and aliasing infrastructure in LTC, which is deprecated in favor of functionalization. It mainly removes things that use lazy::View. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/87822 Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/wconstab	2022-10-27 20:39:30 +00:00
Animesh Jain	83b381d34d	[dynamo] add inductor runs w/o cudagraphs (#87847 ) as title cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Pull Request resolved: https://github.com/pytorch/pytorch/pull/87847 Approved by: https://github.com/jansel	2022-10-27 19:49:29 +00:00
samdow	d2d0be9a76	fix typo in per sample grad test (#87790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87790 Approved by: https://github.com/zou3519	2022-10-27 19:43:44 +00:00
Akshit Khurana	b8b1d7be24	[dynamo] Add ao.nn to skipfiles inline allowlist (#87820 ) Summary: Allow torch.ao.nn module to be inlined Test Plan: Tested manually for https://github.com/pytorch/torchdynamo/issues/1737 Reviewers: Subscribers: Tasks: Tags: cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx Differential Revision: [D40768679](https://our.internmc.facebook.com/intern/diff/D40768679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87820 Approved by: https://github.com/jansel	2022-10-27 18:46:54 +00:00
Michael Suo	a485528a7e	[dynamo] Error when user nests FX with dynamo (#87797 ) Today, this doesn't work and dynamo errors out in a very non-obvious way (see: https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a). Here, we detect the error early and exit with a nicer msg. Also add a config option to just no-op dynamo (which need to unblock internal enablement). cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797 Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel	2022-10-27 17:17:59 +00:00
Natalia Gimelshein	f1b78224ca	Fix type promotion for 2 wrapped scalar args (#87845 ) Fixes #76801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87845 Approved by: https://github.com/SherlockNoMad, https://github.com/mruberry	2022-10-27 15:53:11 +00:00
Brian Hirsh	03d6af4db3	add nesting to TORCH_SHOW_DISPATCH_TRACE (#87751 ) Added indents to `TORCH_SHOW_DISPATCH_TRACE` so that you more easily see the call tree from the dispatcher. Definitely slower, but it's all guarded under the `DEBUG` build. Example output: I know we have the PyDispatcher now, but I still found this helpful for debugging ``` [call] op=[aten::ones], key=[BackendSelect] [redispatch] op=[aten::ones], key=[CPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::fill_.Scalar], key=[CPU] [call] op=[aten::clone], key=[AutogradCPU] [redispatch] op=[aten::clone], key=[CPU] [call] op=[aten::empty_strided], key=[BackendSelect] [redispatch] op=[aten::empty_strided], key=[CPU] [call] op=[aten::copy_], key=[CPU] [call] op=[aten::view], key=[PythonTLSSnapshot] [redispatchBoxed] op=[aten::view], key=[AutogradCPU] [redispatch] op=[aten::view], key=[ADInplaceOrView] [redispatch] op=[aten::view], key=[Functionalize] [call] op=[aten::view], key=[PythonTLSSnapshot] [redispatchBoxed] op=[aten::view], key=[Meta] [call] op=[aten::view], key=[PythonTLSSnapshot] [redispatchBoxed] op=[aten::view], key=[Python] [callBoxed] op=[aten::view], key=[CPU] [call] op=[aten::clone], key=[PythonTLSSnapshot] [redispatchBoxed] op=[aten::clone], key=[AutogradCPU] [redispatch] op=[aten::clone], key=[Functionalize] [callBoxed] op=[aten::clone], key=[PythonTLSSnapshot] [redispatchBoxed] op=[aten::clone], key=[Python] [callBoxed] op=[aten::clone], key=[CPU] [call] op=[aten::empty_strided], key=[BackendSelect] [redispatch] op=[aten::empty_strided], key=[CPU] [call] op=[aten::copy_], key=[CPU] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87751 Approved by: https://github.com/ezyang, https://github.com/zou3519	2022-10-27 15:47:56 +00:00
Brian Hirsh	23ff47ccc5	functionalization: fix detach() (#87750 ) `.detach()` worked in basic cases previously, but didn't properly preserve view relationships between the base and the output. This wasn't heavily tested, because autograd doesn't normally encounter `FunctionalTensorWrapper` directly, but could become more common if we fuse functionalization and autograd into a single tracing pass. This will also be a bug fix for LTC (and XLA when they use functionalization) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87750 Approved by: https://github.com/ezyang	2022-10-27 15:47:56 +00:00
Nikita Shulga	e2bbc0a134	[BE] Move remaining workflows off Xenial (#87834 ) Both BE and prerequisite for moving our CI/CD to C++17 compiler (gcc-5.4 is not fully C++17 compliant) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87834 Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/huydhn	2022-10-27 15:38:48 +00:00
jpvillam	1e1b045128	[ROCM] Enable Sparse Pickle Test (#82729 ) Missed stream context for serialization ### Description Missing ROCm stream context on memory operations for serialization ### Testing Ran the sparse pickle test Pull Request resolved: https://github.com/pytorch/pytorch/pull/82729 Approved by: https://github.com/ngimel	2022-10-27 15:11:28 +00:00
Mike Iovine	aaba0bd306	[JIT] Fix torch.jit.script for functions with many decorators (#87804 ) Summary: Python's function parsing from the `ast` module records the line number of the function definition, not the first decorator. So this diff fixes crashes like this: ``` IndexError: vector::_M_range_check: __n (which is 10) >= this->size() (which is 8) ``` Test Plan: New unit test Differential Revision: D40726352 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87804 Approved by: https://github.com/tugsbayasgalan, https://github.com/davidberard98	2022-10-27 12:29:51 +00:00
kshitij12345	1780e0ef7f	[complex] conv_transpose2d (#81805 ) Reference: https://github.com/pytorch/pytorch/issues/71108 Fixes : #86414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81805 Approved by: https://github.com/anjali411	2022-10-27 10:46:53 +00:00
XiaobingSuper	c36db82e12	TorchDynamo: Add convolution unary fusion for cpu in inference mode (#87063 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87063 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-10-27 06:55:32 +00:00
Taylor Robie	b16b5fb802	[Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. (#87244 ) A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.) Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete. This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl` from being reused. This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse. Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87244 Approved by: https://github.com/slgong-fb, https://github.com/albanD	2022-10-27 06:38:11 +00:00
Mengwei Liu	4b23905172	[torch] Add torch cpp cpu target for torch/csrc/api/src files (#87327 ) Summary: Duplicating fbcode target `fbcode//caffe2:torch-cpp-cpu` target in xplat. In D40460749 our user wants to use `torch::kNearest` enum which is defined in `torch/csrc/api/src/enum.cpp`. Adding this target to support it. Test Plan: Rely on CI Differential Revision: D40532087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87327 Approved by: https://github.com/ezyang	2022-10-27 06:04:22 +00:00
Richard Barnes	bf113e38fa	use nv_diag_suppress (#87712 ) Fixes: ``` /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead /dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead ``` cc @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87712 Approved by: https://github.com/soumith	2022-10-27 05:15:16 +00:00
Andrew Gu	107f92a683	[FSDP] ufmt FSDP test (#87812 ) This applies `ufmt` to all of the FSDP test files in the `test/distributed/fsdp/` directory. Test Plan CI Notes For VSCode users, - Install `ufmt`: https://pypi.org/project/ufmt/ - Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt - Include in `settings.json`: ``` { "[python]": { "editor.defaultFormatter": "omnilib.ufmt", "editor.formatOnSave": true, }, } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87812 Approved by: https://github.com/rohan-varma	2022-10-27 04:25:55 +00:00
Andrew Gu	e3cf81e0a7	[FSDP] ufmt /fsdp (#87811 ) This applies `ufmt` to all of the FSDP files in the `torch/distributed/fsdp/` directory. Test Plan CI Notes For VSCode users, - Install `ufmt`: https://pypi.org/project/ufmt/ - Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt - Include in `settings.json`: ``` { "[python]": { "editor.defaultFormatter": "omnilib.ufmt", "editor.formatOnSave": true, }, } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87811 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2022-10-27 04:25:55 +00:00
PyTorch MergeBot	49ce3ed14c	[vision hash update] update the pinned vision hash (#87831 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87831 Approved by: https://github.com/pytorchbot	2022-10-27 04:23:45 +00:00
Horace He	21bef8e944	fix sym_storage conversion and some cleanup (#87718 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87718 Approved by: https://github.com/ezyang	2022-10-27 02:45:18 +00:00
Jerry Zhang	58650835bb	[fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87257 ) Summary: att, this is experimental api so not marking it as bc-breaking. The match will be accepted only if all the filters in the list passes. Changing the filter arg to be list also allows us to pass in empty list that means no filter, which makes user code cleaner. Test Plan: python test/test_fx.py -k test_replace_pattern_with_filters Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87257 Approved by: https://github.com/SherlockNoMad	2022-10-27 01:59:19 +00:00
Jerry Zhang	195a13f48c	[quant][be] Remove unused function `quantize_node` (#87153 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87153 Approved by: https://github.com/andrewor14	2022-10-27 01:50:00 +00:00
Nikita Shulga	30ea8f5c20	Limit ROCM option to Linux only (#87833 ) As it's not available on neither Windows nor MacOS cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport Pull Request resolved: https://github.com/pytorch/pytorch/pull/87833 Approved by: https://github.com/kit1980	2022-10-27 01:24:03 +00:00
Jerry Zhang	0e3b5ea026	[quant][fx] Add _convert_to_reference_decomposed (#87094 ) Summary: _convert_to_reference_decomposed is a private convert function in fx graph mode quantization flow to convert a calibrated/trained model to a reference quantized model with decomposed quantized tensor representations. Test Plan: python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87094 Approved by: https://github.com/andrewor14	2022-10-27 01:22:08 +00:00
Digant Desai	a12d3d6b49	[profiler] Standard performance event names for the profiler (#87538 ) Summary: The goal is to create a hardware/backend independent event abstraction on which a standard set of tooling can be developed. Test Plan: CI Reviewed By: kimishpatel Differential Revision: D40238034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87538 Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign	2022-10-27 00:59:40 +00:00
Charlie Yan	2cc624cd43	Enable mypy check for distributed.py, and fix type errors (#87543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87543 Approved by: https://github.com/fduwjj	2022-10-27 00:22:54 +00:00
Valentin Andrei	5dbd80a605	[pytorch] Layer norm backward speed gain with warp shuffles (#87814 ) Summary: Improved native layer norm backward performance. Rewrote `GammaBetaBackwardCUDAKernel` to use shared memory only for the reduction step, but not for loading `mean` and `rstd`. The previous implementation used only `threadIdx.x = 0` to load `mean` and `rstd` into shared memory, and then all threads would access the values in order to do loop unrolling. This approached increased register usage and decreased occupancy, without much benefit from using shared memory (this is because the values were already cached in L1). The new implementation is simpler and register usage is smaller, thus occupancy is better. Added another implementation called `GammaBetaBackwardCUDAKernel_32x32` which is only for shapes dividing exactly to a (32 x 32) block. This permits using warp shuffles for speeding up loading `mean` and `rstd` as well as for the final reduction stage. The effective bandwidth of this implementation is equal to STREAM Triad. Observed that we can get additional benefit if we lower the threshold for calling `GammaBetaBackwardSimpleCUDAKernel` (simple col-wise reduction implementation) from `512` to `128`. Test Plan: Wrote a simple CUDA app that calls the previous implementation of `GammaBetaBackwardCUDAKernel` and the current one, using FP32 values and compares the results. The epsilon value we used for FP comparison is 0.00001 for the weight and 0.0001 for the bias. Ran the benchmark for various sizes A100 GPU and got the results below. Almost all sizes show good speedup. ``` Size (32, 32); Mismatches: dg = 0 db = 0 out of 32. reference = 0.0073 (ms); optimized = 0.0071 (ms); bw_opt = 1.14 GB/s; speedup = 2.68% Size (64, 32); Mismatches: dg = 0 db = 0 out of 32. reference = 0.0107 (ms); optimized = 0.0107 (ms); bw_opt = 1.50 GB/s; speedup = 0.22% Size (256, 128); Mismatches: dg = 0 db = 0 out of 128. reference = 0.0323 (ms); optimized = 0.0075 (ms); bw_opt = 32.89 GB/s; speedup = 330.16% Size (512, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0103 (ms); optimized = 0.0089 (ms); bw_opt = 440.54 GB/s; speedup = 15.82% Size (1024, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0197 (ms); optimized = 0.0136 (ms); bw_opt = 1151.44 GB/s; speedup = 44.91% Size (2048, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0416 (ms); optimized = 0.0283 (ms); bw_opt = 1105.31 GB/s; speedup = 47.01% Size (4096, 16384); Mismatches: dg = 0 db = 0 out of 16384. reference = 0.4420 (ms); optimized = 0.3915 (ms); bw_opt = 1277.58 GB/s; speedup = 12.90% Size (70000, 64); Mismatches: dg = 0 db = 0 out of 64. reference = 0.5908 (ms); optimized = 0.6850 (ms); bw_opt = 49.49 GB/s; speedup = -13.75% Size (131072, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 1.1961 (ms); optimized = 0.9234 (ms); bw_opt = 542.54 GB/s; speedup = 29.53% Size (1000, 520); Mismatches: dg = 0 db = 0 out of 520. reference = 0.0132 (ms); optimized = 0.0113 (ms); bw_opt = 343.83 GB/s; speedup = 16.88% Size (4005, 4005); Mismatches: dg = 0 db = 0 out of 4005. reference = 0.1441 (ms); optimized = 0.1054 (ms); bw_opt = 1134.36 GB/s; speedup = 36.71% Size (10000, 1000); Mismatches: dg = 0 db = 0 out of 1000. reference = 0.1293 (ms); optimized = 0.1248 (ms); bw_opt = 597.71 GB/s; speedup = 3.63% Size (1024, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.0738 (ms); optimized = 0.0735 (ms); bw_opt = 1039.40 GB/s; speedup = 0.45% Size (8192, 4096); Mismatches: dg = 0 db = 0 out of 4096. reference = 0.2673 (ms); optimized = 0.2223 (ms); bw_opt = 1125.01 GB/s; speedup = 20.25% Size (10000, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.7331 (ms); optimized = 0.8940 (ms); bw_opt = 833.54 GB/s; speedup = -18.00% Size (3072, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.2087 (ms); optimized = 0.2364 (ms); bw_opt = 968.64 GB/s; speedup = -11.71% Size (6144, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.4197 (ms); optimized = 0.5118 (ms); bw_opt = 894.63 GB/s; speedup = -18.00% Size (1024, 20000); Mismatches: dg = 0 db = 0 out of 20000. reference = 0.1480 (ms); optimized = 0.1297 (ms); bw_opt = 1177.68 GB/s; speedup = 14.12% Size (1024, 20000); Mismatches: dg = 0 db = 0 out of 20000. reference = 0.1483 (ms); optimized = 0.1278 (ms); bw_opt = 1195.26 GB/s; speedup = 16.04% Size (512, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0104 (ms); optimized = 0.0091 (ms); bw_opt = 646.72 GB/s; speedup = 14.44% Size (512, 6144); Mismatches: dg = 0 db = 0 out of 6144. reference = 0.0219 (ms); optimized = 0.0156 (ms); bw_opt = 1506.30 GB/s; speedup = 40.52% Size (512, 10240); Mismatches: dg = 0 db = 0 out of 10240. reference = 0.0424 (ms); optimized = 0.0370 (ms); bw_opt = 1057.84 GB/s; speedup = 14.63% Size (1000, 1000); Mismatches: dg = 0 db = 0 out of 1000. reference = 0.0139 (ms); optimized = 0.0119 (ms); bw_opt = 627.51 GB/s; speedup = 16.83% Size (2000, 2000); Mismatches: dg = 0 db = 0 out of 2000. reference = 0.0421 (ms); optimized = 0.0412 (ms); bw_opt = 724.10 GB/s; speedup = 2.20% Size (10240, 10240); Mismatches: dg = 0 db = 0 out of 10240. reference = 0.7210 (ms); optimized = 0.6098 (ms); bw_opt = 1281.40 GB/s; speedup = 18.24% Size (384, 128); Mismatches: dg = 0 db = 0 out of 128. reference = 0.0449 (ms); optimized = 0.0089 (ms); bw_opt = 41.50 GB/s; speedup = 403.48% Size (2048, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0208 (ms); optimized = 0.0169 (ms); bw_opt = 925.70 GB/s; speedup = 23.13% Size (267, 513); Mismatches: dg = 0 db = 0 out of 513. reference = 0.0342 (ms); optimized = 0.0090 (ms); bw_opt = 114.18 GB/s; speedup = 280.64% Size (67, 123479); Mismatches: dg = 0 db = 0 out of 123479. reference = 0.0562 (ms); optimized = 0.0552 (ms); bw_opt = 1133.46 GB/s; speedup = 1.81% Size (1024, 123479); Mismatches: dg = 0 db = 0 out of 123479. reference = 0.8573 (ms); optimized = 0.9245 (ms); bw_opt = 1020.02 GB/s; speedup = -7.27% Size (2048, 66679); Mismatches: dg = 0 db = 0 out of 66679. reference = 0.8778 (ms); optimized = 0.8590 (ms); bw_opt = 1185.05 GB/s; speedup = 2.19% Size (200, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0215 (ms); optimized = 0.0066 (ms); bw_opt = 58.49 GB/s; speedup = 226.81% Size (1000, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0109 (ms); optimized = 0.0092 (ms); bw_opt = 208.27 GB/s; speedup = 18.65% Size (6000, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0394 (ms); optimized = 0.0301 (ms); bw_opt = 381.90 GB/s; speedup = 30.98% Size (6272, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0403 (ms); optimized = 0.0300 (ms); bw_opt = 400.48 GB/s; speedup = 34.34% Size (200, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0218 (ms); optimized = 0.0066 (ms); bw_opt = 116.33 GB/s; speedup = 229.96% Size (1000, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0110 (ms); optimized = 0.0094 (ms); bw_opt = 407.29 GB/s; speedup = 17.26% Size (6000, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0535 (ms); optimized = 0.0594 (ms); bw_opt = 386.05 GB/s; speedup = -9.95% Size (6272, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0573 (ms); optimized = 0.0387 (ms); bw_opt = 619.62 GB/s; speedup = 48.06% Size (200, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0221 (ms); optimized = 0.0069 (ms); bw_opt = 222.78 GB/s; speedup = 220.76% Size (1000, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0113 (ms); optimized = 0.0097 (ms); bw_opt = 787.79 GB/s; speedup = 16.46% Size (6000, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0723 (ms); optimized = 0.0715 (ms); bw_opt = 640.95 GB/s; speedup = 1.10% Size (6272, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0751 (ms); optimized = 0.0572 (ms); bw_opt = 837.57 GB/s; speedup = 31.30% Size (200, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0232 (ms); optimized = 0.0071 (ms); bw_opt = 323.97 GB/s; speedup = 226.51% Size (1000, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0125 (ms); optimized = 0.0114 (ms); bw_opt = 1005.84 GB/s; speedup = 9.62% Size (6000, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0807 (ms); optimized = 0.0830 (ms); bw_opt = 828.02 GB/s; speedup = -2.76% Size (6272, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0836 (ms); optimized = 0.0695 (ms); bw_opt = 1033.62 GB/s; speedup = 20.27% Size (200, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0224 (ms); optimized = 0.0075 (ms); bw_opt = 408.58 GB/s; speedup = 198.10% Size (1000, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0165 (ms); optimized = 0.0135 (ms); bw_opt = 1132.42 GB/s; speedup = 22.26% Size (6000, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0993 (ms); optimized = 0.0989 (ms); bw_opt = 926.35 GB/s; speedup = 0.41% Size (6272, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.1033 (ms); optimized = 0.0826 (ms); bw_opt = 1159.55 GB/s; speedup = 25.09% Size (200, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.0230 (ms); optimized = 0.0076 (ms); bw_opt = 605.09 GB/s; speedup = 202.51% Size (1000, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.0207 (ms); optimized = 0.0213 (ms); bw_opt = 1076.45 GB/s; speedup = -2.69% Size (6000, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.1198 (ms); optimized = 0.1274 (ms); bw_opt = 1078.58 GB/s; speedup = -5.95% Size (6272, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.1293 (ms); optimized = 0.1189 (ms); bw_opt = 1207.95 GB/s; speedup = 8.76% Average speedup = 52.88% ``` For additional numerical validation used the following script: ``` def run_model_on_device(fs, X, gO, device_string, numeric_type): ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type) ln.reset_parameters() X.grad = None ln.zero_grad(set_to_none=True) out = ln(X) out.backward(gO) return (ln.weight.grad, ln.bias.grad) def run_correctness_test(eps_weight, eps_bias): dtype = torch.float for fs in (512, 1024, 2048, 4096, 8192, 10000, 500, 1000, 2001, 4005, 8117): for bs in (512, 1024, 2048, 4096, 525, 1033, 2064, 3000): mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float) X = mean_adjustment * torch.randn( bs, fs, device="cpu", dtype=torch.float, requires_grad=True ) X = X.detach().requires_grad_() gO = torch.rand_like(X) X_gpu = X.to("cuda") X_gpu = X_gpu.detach().requires_grad_() gO_gpu = gO.to("cuda") gO_gpu = gO_gpu.detach().requires_grad_() grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype) grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype) weight_grad_gpu_target = grad_gpu[0].detach().to("cpu") bias_grad_gpu_target = grad_gpu[1].detach().to("cpu") weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target) weight_mismatches = (weight_delta >= eps_weight).nonzero() weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100 bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target) bias_mismatches = (bias_delta >= eps_bias).nonzero() bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100 print( "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format( fs, bs, weight_mismatch_pct, bias_mismatch_pct ) ) ``` `NVFuserTest.FusionMagicSchedulerLayerNormBackward_CUDA` test also does additional numerical validation and it passes. Differential Revision: D40730981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87814 Approved by: https://github.com/weiwangmeta	2022-10-27 00:18:19 +00:00
Kazuaki Ishizaki	449778a939	Fix typos under .github directory (#87828 ) This PR fixes typos in `.md` files under .github directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87828 Approved by: https://github.com/clee2000	2022-10-27 00:01:10 +00:00
wchen61	2c66889f90	Synchronize before change cuda stream (#82050 ) (#82056 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/82050 Need synchronize before change cuda stream ### Description <!-- What did you change and why was it needed? --> ### Issue <!-- Link to Issue ticket or RFP --> ### Testing <!-- How did you test your change? --> Pull Request resolved: https://github.com/pytorch/pytorch/pull/82056 Approved by: https://github.com/ngimel	2022-10-26 23:44:13 +00:00
Nikita Karetnikov	59b9d29260	[primTorch] Check `error_regex` in `test_python_ref_errors` (#86987 ) cc @ezyang @mruberry @ngimel @Lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/86987 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-26 23:34:34 +00:00
Nikita Shulga	5ee5f5ac1b	[BE] Don't build CUDA-10.2 docker images (#87819 ) As CUDA-10.2 should not longer be used in CI/CD Test Plan: ` grep cuda10.2 .github -R\|grep -v mock` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87819 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2022-10-26 23:16:29 +00:00
Driss Guessous	3208c2f6bd	Add logging for nested tensor usage tracking (#87632 ) # Summary Add logging message so that we can track nested tensor adoption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87632 Approved by: https://github.com/cpuhrsch	2022-10-26 22:42:41 +00:00
Jiewen Tan	536474e823	[LTC] Remove tensor.storage_ (#87645 ) Summary: Since LTC now supports functionalization, we don't need to fake a storage to support is_alias_of anymore. Let's remove it. Test Plan: ./build/bin/test_lazy --gtest_filter=LazyOpsTest.IsAliasOf Pull Request resolved: https://github.com/pytorch/pytorch/pull/87645 Approved by: https://github.com/JackCaoG, https://github.com/bdhirsh	2022-10-26 22:41:19 +00:00
Catherine Lee	5edbc92683	print stderr for ghstack rebase (#87795 ) current output tends to be empty on failure, which makes it hard to debug Pull Request resolved: https://github.com/pytorch/pytorch/pull/87795 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2022-10-26 22:10:10 +00:00
Will Constable	91c95ff7c5	Enable graph_split_inductor test as it runs now (#87762 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87762 Approved by: https://github.com/davidberard98	2022-10-26 22:06:03 +00:00
Nikita Shulga	53c640a528	[CI] Delete `nnpack` installation from conda (#87813 ) Not sure why it was there to begin with and I really hope none of our CI depend on the package that was last updated 5 years ago, see https://anaconda.org/killeent/nnpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/87813 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/ZainRizvi	2022-10-26 21:51:13 +00:00
Cameron Voisey	1522946882	Simplify installation instruction in contributing file (#87460 ) Simplification of one of the installation instructions in CONTRIBUTING.md that I found tricky to parse at first. Also adds a link to the "Make no-op build fast" section to make it easier to navigate to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87460 Approved by: https://github.com/ngimel	2022-10-26 21:34:13 +00:00
soulitzer	adb76ef510	Expose API for backward execution order (#87507 ) In this PR: - graph_task stores graph roots on construction so that we can later traverse through the graph - before the nodes are returned, they needed to be converted from raw_ptr to shared_ptr, and this should be OK because the graph is guaranteed to be alive Pull Request resolved: https://github.com/pytorch/pytorch/pull/87507 Approved by: https://github.com/albanD	2022-10-26 21:28:45 +00:00
PyTorch MergeBot	926827b89c	Revert "Disable linux-bionic-py3_7-clang8-xla-test (#87737 )" This reverts commit 21f7e7d040c646b4ce7f4a4e973da97660462bdc. Reverted https://github.com/pytorch/pytorch/pull/87737 on behalf of https://github.com/kit1980 due to Re-enable XLA tests after https://github.com/pytorch/pytorch/pull/87818	2022-10-26 21:01:09 +00:00
Zafar	71933d381b	[ao] Fixing tests for block pruning shapes (#87326 ) The current unittests were only checking the tensors whose shapes were already multiples of the block size. That caused some hidden bugs to creep in. Specifically, for the shapes that would require padding for the mask/data, the sparsifier would try to apply shape-mismatching tensors onto each other. This caused segfaults as well as silent failures. This makes minor adjustments to the code to make sure the masks and data shapes are aligned, as well as fixing the tests to catch this. Test Plan: ```python python test/test_ao_sparsity.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87326 Approved by: https://github.com/jcaip	2022-10-26 20:55:14 +00:00
Sergii Dymchenko	1168f42790	Update XLA hash (#87818 ) This is a re-creation of https://github.com/pytorch/pytorch/pull/87808 so we don't have to wait. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87818 Approved by: https://github.com/clee2000	2022-10-26 20:54:25 +00:00
Bin Bao	bbcd4b2f2f	Clean up CPU test in test_torchinductor.py for fbcode (#87783 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87783 Approved by: https://github.com/bertmaher	2022-10-26 20:47:14 +00:00
Justin Chu	88eff10722	[ONNX] Deprecate operators.py (#87798 ) Deprecate `torch.onnx.operators` because it's only for backwards compatibility Pull Request resolved: https://github.com/pytorch/pytorch/pull/87798 Approved by: https://github.com/BowenBao	2022-10-26 20:42:06 +00:00
Sherlock Huang	b21fe312c0	Fix meta for index_add and index_put (#87775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87775 Approved by: https://github.com/ezyang, https://github.com/ngimel	2022-10-26 20:33:23 +00:00
Huy Do	8016fd9eb1	Set check-latest to false when setup python and pip cache in CI (#87621 ) I missed the fine print in https://github.com/actions/setup-python/blob/main/README.md#caching-packages-dependencies when setting up the cache using setup-python GHA > Restored cache will not be used if the requirements.txt file is not updated for a long time and a newer version of the dependency is available which can lead to an increase in total build time. The latter part is important because it implies that even with the cache, pip will still try to check if a newer version exists and that part can be flaky, i.e. https://github.com/pytorch/pytorch/actions/runs/3313764038/jobs/5472180293 This undesired behavior can be turned off by setting the advance option `check-latest` to false https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#check-latest-version. Per my understanding, this should tell pip install in these workflows to use the local cached copy of the package avoiding the need to query pypi every single time. `check-latest` was added quite recently https://github.com/actions/setup-python/pull/406, so `actionlint-1.6.15` fails to recognize it. Thus, this PR also upgrades `actionlint` to the latest 1.6.21 to pass the linter check. Here is an example error from 1.6.15 from https://github.com/pytorch/pytorch/actions/runs/3315388073/jobs/5475918454: ``` >>> Lint for .github/workflows/lint.yml: Error (ACTIONLINT) [action] input "check-latest" is not defined in action "actions/setup-python@v4". available inputs are "architecture", "cache", "cache-dependency-path", "python-version", "python-version-file", "token" 25 \| with: 26 \| python-version: 3.8 27 \| architecture: x64 >>> 28 \| check-latest: false 29 \| cache: pip 30 \| cache-dependency-path: \| 31 \| **/.github/requirements-gha-cache.txt ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87621 Approved by: https://github.com/ZainRizvi	2022-10-26 20:08:29 +00:00
PyTorch MergeBot	5f4329134e	Revert "Set check-latest to false when setup python and pip cache in CI (#87621 )" This reverts commit 4080b1db284fd531654bcb2984a7fe0ff3b310cd. Reverted https://github.com/pytorch/pytorch/pull/87621 on behalf of https://github.com/huydhn due to Somehow setup-python treats Python 3.10 as Python 3.1 in pr-label.yml. I missed this signal because this is only run at push	2022-10-26 19:40:53 +00:00
jpvillam	38dd4cbdf1	ROCm enable sparse_sampled_addmm (#86401 ) Enables: test_comprehensive_sparse_sampled_addmm_cuda_complex128 test_comprehensive_sparse_sampled_addmm_cuda_complex64 test_comprehensive_sparse_sampled_addmm_cuda_float32 test_comprehensive_sparse_sampled_addmm_cuda_float64 test_dispatch_meta_sparse_sampled_addmm_cuda_complex128 test_dispatch_meta_sparse_sampled_addmm_cuda_complex64 test_dispatch_meta_sparse_sampled_addmm_cuda_float32 test_dispatch_meta_sparse_sampled_addmm_cuda_float64 test_meta_sparse_sampled_addmm_cuda_complex128 test_meta_sparse_sampled_addmm_cuda_complex64 test_meta_sparse_sampled_addmm_cuda_float32 test_meta_sparse_sampled_addmm_cuda_float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86401 Approved by: https://github.com/ngimel	2022-10-26 19:39:24 +00:00
Will Constable	123b103bf1	Add dynamo_optimize_ddp arg to dist bench (#87768 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87768 Approved by: https://github.com/davidberard98	2022-10-26 19:29:35 +00:00
Will Constable	aa66c6e01e	Fix missing weight init and clean up helper (#87760 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87760 Approved by: https://github.com/davidberard98	2022-10-26 19:29:35 +00:00
Kazuaki Ishizaki	58dc95b321	Fix typos under aten directory (#87754 ) This PR fixes typos in `.md` files under aten directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87754 Approved by: https://github.com/kit1980	2022-10-26 19:29:05 +00:00
Huy Do	4080b1db28	Set check-latest to false when setup python and pip cache in CI (#87621 ) I missed the fine print in https://github.com/actions/setup-python/blob/main/README.md#caching-packages-dependencies when setting up the cache using setup-python GHA > Restored cache will not be used if the requirements.txt file is not updated for a long time and a newer version of the dependency is available which can lead to an increase in total build time. The latter part is important because it implies that even with the cache, pip will still try to check if a newer version exists and that part can be flaky, i.e. https://github.com/pytorch/pytorch/actions/runs/3313764038/jobs/5472180293 This undesired behavior can be turned off by setting the advance option `check-latest` to false https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#check-latest-version. Per my understanding, this should tell pip install in these workflows to use the local cached copy of the package avoiding the need to query pypi every single time. `check-latest` was added quite recently https://github.com/actions/setup-python/pull/406, so `actionlint-1.6.15` fails to recognize it. Thus, this PR also upgrades `actionlint` to the latest 1.6.21 to pass the linter check. Here is an example error from 1.6.15 from https://github.com/pytorch/pytorch/actions/runs/3315388073/jobs/5475918454: ``` >>> Lint for .github/workflows/lint.yml: Error (ACTIONLINT) [action] input "check-latest" is not defined in action "actions/setup-python@v4". available inputs are "architecture", "cache", "cache-dependency-path", "python-version", "python-version-file", "token" 25 \| with: 26 \| python-version: 3.8 27 \| architecture: x64 >>> 28 \| check-latest: false 29 \| cache: pip 30 \| cache-dependency-path: \| 31 \| **/.github/requirements-gha-cache.txt ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87621 Approved by: https://github.com/ZainRizvi	2022-10-26 19:23:55 +00:00
Bin Bao	2c1efe7472	Enable some PyTorch core tests with inductor (#87490 ) Summary: 1) Graph break on torch.random.set_rng_state since it blocks running inductor core tests; 2) Add several inductor-specific skips; 3) Enable several core tests for inductor CI; cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87490 Approved by: https://github.com/eellison	2022-10-26 18:58:33 +00:00
HDCharles	f7a04f310b	[ao][ns] Replacing List[QConfigMapping] in PNP (#86922 ) Summary: Added QConfigMultiMapping which is essentially a List[QConfigMapping] with set methods and dedicated handling to avoid unwanted matches and improve UX. note: the from __future__ import annotations line caused weird errors when the QConfigMultiMapping class was put in _numeric_suite_fx.py so it was moved. Test Plan: python test/test_quantization.py TestFxNumericSuiteNShadows Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86922 Approved by: https://github.com/vkuzo	2022-10-26 18:56:53 +00:00
PyTorch MergeBot	9639cb83eb	Revert "[pytorch] Layer norm backward speed gain with warp shuffles (#87445 )" This reverts commit b6f28334bc3276a56d79dea6cb7ed99411556348. Reverted https://github.com/pytorch/pytorch/pull/87445 on behalf of https://github.com/weiwangmeta due to breaking internal builds due to MS compiler	2022-10-26 18:51:38 +00:00
Ethan Pronovost	585d71513d	Add type annotations to distribution.py (#87577 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87577 Approved by: https://github.com/kit1980	2022-10-26 18:50:48 +00:00
arnaudstiegler	16e35bd179	Adding expm1 to MPS (#87147 ) Fixes #86744 - Implementing the new `expm1_out_mps` function in `aten/src/ATen/native/mps/operations/UnaryOps.mm` - Adding it to `aten/src/ATen/native/native_functions.yaml` - Adding it to existing `test.test_mps.TestNLLLoss.test_unary_ops` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87147 Approved by: https://github.com/kulinseth	2022-10-26 17:45:46 +00:00
Sergii Dymchenko	493ff6ac5b	Install py for pytest-sugar (#87803 ) linux-focal-py3.7-clang10-onnx / test is failng, the issue is https://github.com/Teemu/pytest-sugar/issues/241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87803 Approved by: https://github.com/seemethere, https://github.com/huydhn	2022-10-26 17:43:35 +00:00
albanD	e2e428b03c	Remove custom Ceil in favor of sympy.ceiling (#87294 ) [Alban]: the other changes that used to be in this PR (neg and fix for true div) are moved to other places where they already exist. Namely neg is already in master and true div will be in the next PR on the stack where all other functions are fixed at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87294 Approved by: https://github.com/ezyang	2022-10-26 17:33:53 +00:00
albanD	777e6a2c51	Many symintifications (#87604 ) Adds expand_inplace conv conv_double_backward convolution adaptive_avg_pool2d_symint _embedding_bag_backward_symint cudnn_grid_sampler cuda 32 bit indexing nll_loss / nll_loss_2d tensor split pooling same mode cudnn_is_acceptable storage nbytes Pull Request resolved: https://github.com/pytorch/pytorch/pull/87604 Approved by: https://github.com/ezyang	2022-10-26 17:33:53 +00:00
Ivan Yashchuk	ae4fbac819	Enable nvprims.transpose fusions for nvFuser (#86967 ) This PR allows transposes to be fused with other operations. If a fusion group is formed only from operations that just manipulate metadata in PyTorch (transpose, view, etc.) then this group is not sent to nvFuser. On top of that if we have converted to `nvprims` but then decided to not form a fusion group we modify the graph use `prim.impl_aten` attribute instead of calling `prim(args, *kwargs)` that has a higher overhead. cc @kevinstephano @jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86967 Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad	2022-10-26 17:00:07 +00:00
PyTorch MergeBot	ac0c13f665	Revert "[ROCm] Use -rpath-link to fix libtinfo conflict (#83552 )" This reverts commit a10446c4d826ae5505fa129ea9800d3924b25364. Reverted https://github.com/pytorch/pytorch/pull/83552 on behalf of https://github.com/kit1980 due to Broke ios/macos builds https://github.com/pytorch/pytorch/actions/runs/3329991911/jobs/5507911292	2022-10-26 16:43:13 +00:00
Rohan Varma	701b3dd773	optim utils all_gather_into_tensor (#87769 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87769 Approved by: https://github.com/awgu	2022-10-26 16:20:46 +00:00
Richard Zou	642b63e1e7	Add test that `import torch` doesn't modify global logging state (#87629 ) Fixes https://github.com/pytorch/pytorch/issues/87626 Also adds the same test for `import functorch`. Users have complained at us when we do modify the global logging state, which has happened in the past. Test Plan: - tested locally; I added `logging.basicConfig` to `torch/__init__.py` and checked that the test got triggered Pull Request resolved: https://github.com/pytorch/pytorch/pull/87629 Approved by: https://github.com/albanD	2022-10-26 15:53:28 +00:00
Chien-Chin Huang	422f946b8c	[FSDP][BE] Improve the assert message of sharded load_state_dict (#87486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87486 Approved by: https://github.com/awgu	2022-10-26 15:51:54 +00:00
Pruthvi Madugundu	c2ef5c4f7e	[ROCm] Move ROCm CI build to python 3.8 version (#86677 ) Currently it is python 3.7 want to upgrade to python 3.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86677 Approved by: https://github.com/malfet	2022-10-26 15:34:38 +00:00
Antoni Viros i Martin	775fef51b7	Implement copy_, fill_, and ones_like for Nested Tensors backends (#87728 ) Summary: This diff implements copy_ in order to allow pinned memory transfers for nested tensors, as well as fill_ and ones_like, to test whether nested tensors can be created with other factory functions. Test Plan: Pass all CI and sandcastle jobs. Reviewed By: mikekgfb Differential Revision: D40689594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87728 Approved by: https://github.com/cpuhrsch	2022-10-26 14:48:27 +00:00
Jithun Nair	a10446c4d8	[ROCm] Use -rpath-link to fix libtinfo conflict (#83552 ) Fixes issue building PyTorch for ROCm5.3 and above on Ubuntu20.04 because libtinfo6 from conda conflicts with the one from the distro causing symbol not found errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83552 Approved by: https://github.com/malfet	2022-10-26 14:40:29 +00:00
Mike Iovine	ed7a8ab436	[Static Runtime] Make canEnableStaticRuntime examine sub-blocks (#87396 ) Summary: Someone was running into problems where 1) Static Runtime enablement would fail 2) We would try to fall back to the JIT interpreter after trying to create `StaticModule` 3) The fallback fails because Static Runtime mangled the graph. We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it. With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed. Test Plan: New unit test Differential Revision: D40564452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396 Approved by: https://github.com/tenpercent	2022-10-26 14:34:29 +00:00
Ivan Yashchuk	72f446b9bc	Remove getitem special handling in the partitioner (#87073 ) This special handling of getitem unnecessary splits fusions at functions with tuple outputs. Example script: ```py import torch from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport from torch.fx.experimental.proxy_tensor import make_fx def func(x): xx = torch.ops.nvprims.add(x, 1) var, mean = torch.ops.nvprims.var_mean(x, correction=0) var_cos = torch.ops.nvprims.cos(var) mean_sin = torch.ops.nvprims.sin(mean) return torch.ops.nvprims.add(var_cos, mean_sin) a = torch.randn(5, 3, 3, device="cuda") gm = make_fx(func)(a) gm.graph.print_tabular() supported_ops = NvfuserPrimOperatorSupport() partitioner = CapabilityBasedPartitioner( gm, supported_ops, allows_single_node_partition=False ) partitions = partitioner.propose_partitions() print(partitions) partitioned_graph = partitioner.fuse_partitions(partitions) partitioned_graph.graph.print_tabular() ``` Output on master: ```py opcode name target args kwargs ------------- --------- --------------------------- ---------------- ----------------- placeholder x_1 x_1 () {} call_function add nvprims.add.default (x_1, 1) {} call_function var_mean nvprims.var_mean.main (x_1, [0, 1, 2]) {'correction': 0} call_function getitem <built-in function getitem> (var_mean, 0) {} call_function getitem_1 <built-in function getitem> (var_mean, 1) {} call_function cos nvprims.cos.default (getitem,) {} call_function sin nvprims.sin.default (getitem_1,) {} call_function add_1 nvprims.add.default (cos, sin) {} output output output (add_1,) {} [{cos, sin, add_1}, {var_mean, add, getitem, getitem_1}] opcode name target args kwargs ------------- --------- --------------------------- ---------------------- -------- placeholder x_1 x_1 () {} call_module fused_1 fused_1 (x_1,) {} call_function getitem_2 <built-in function getitem> (fused_1, 0) {} call_function getitem_3 <built-in function getitem> (fused_1, 1) {} call_module fused_0 fused_0 (getitem_2, getitem_3) {} output output output (fused_0,) {} ``` Output with this PR: ``` [{var_mean, add_1, cos, sin, add, getitem_1, getitem}] opcode name target args kwargs ----------- ------- -------- ---------- -------- placeholder x_1 x_1 () {} call_module fused_0 fused_0 (x_1,) {} output output output (fused_0,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87073 Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad	2022-10-26 14:18:46 +00:00
Natalia Gimelshein	59aacc40ca	Couple fixes for argmax/argmin (#87758 ) Removes a wrong assert, makes min number of warps = 2 (1 for some reason generates invalid code, https://github.com/openai/triton/issues/802). Hopefully fixes https://github.com/pytorch/torchdynamo/issues/1743, cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @mreso Pull Request resolved: https://github.com/pytorch/pytorch/pull/87758 Approved by: https://github.com/Chillee, https://github.com/soumith	2022-10-26 06:33:43 +00:00
Charlie Yan	0294787bd6	Format distributed.py (#87667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87667 Approved by: https://github.com/zhaojuanmao	2022-10-26 06:02:30 +00:00
Yanbo Liang	a24635208b	[Inductor] update triton commit pin (#87732 ) Fixes https://github.com/pytorch/torchdynamo/issues/1746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87732 Approved by: https://github.com/ngimel	2022-10-26 05:40:25 +00:00
PyTorch MergeBot	02797db24f	[vision hash update] update the pinned vision hash (#87744 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87744 Approved by: https://github.com/pytorchbot	2022-10-26 05:09:42 +00:00
Zachary DeVito	0d13ffbbae	[inductor] Fix finalization issues when using multiprocessing (#87725 ) If python was launched with 'spawn' it will not use the standard shutdown methods that concurrent.futures requires. So we register a shutdown with the method it does uses. Without this, shutdown hangs since the workers will not exit. cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87725 Approved by: https://github.com/wconstab	2022-10-26 04:09:12 +00:00
Chien-Chin Huang	8a6a126182	[FSDP][BE] Split state_dict related hooks to a separate file to reduce development conflicts (#87421 ) This PR does following two things to improve the code quality. 1. Split state_dict related hooks to a separate file to reduce development conflicts. 2. Remove unused APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87421 Approved by: https://github.com/rohan-varma	2022-10-26 03:43:08 +00:00
Nikita Shulga	82c8365c16	[BE] Delete `TH_DISALLOW_COPY_AND_ASSIGN` (#87743 ) Replace it with `AT_DISALLOW_COPY_AND_ASSIGN` and delete the header that contained this define Pull Request resolved: https://github.com/pytorch/pytorch/pull/87743 Approved by: https://github.com/atalman, https://github.com/ngimel	2022-10-26 03:31:56 +00:00
Nikita Shulga	354549e033	[MPS] Use `bandPartWithTensor:numLowerTensor:...` (#87752 ) To make it uniform with the rest of usage of this op throughout MPS codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/87752 Approved by: https://github.com/kulinseth	2022-10-26 03:30:45 +00:00
Shen Li	de65f156ed	Add distributed composable API contract (#87580 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87580 Approved by: https://github.com/yhcharles	2022-10-26 02:36:40 +00:00
Huy Do	9c2555f018	Upgrade CI binary build runner from 4x to 12xlarge (#87727 ) It currently takes a whopping 2h30m just to build PyTorch binary for every PR and commit. Pushing it to 12xlarge reduces the time to 1h40m https://github.com/pytorch/pytorch/actions/runs/3323869550/jobs/5494754029, not exactly a linear (and fair) trade, but good enough to reduce this long pole. I'll monitor the queue for 12xlarge after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87727 Approved by: https://github.com/kit1980, https://github.com/malfet	2022-10-26 02:28:36 +00:00
Justin Chu	85a79a7f50	[ONNX] Expand `_cast_` symbolic functions (#87666 ) The `_cast_` family of symbolic functions has been created from a template function. Even though it saved some lines, it very much obscured the intention of the code. Since the list doesn't really change and the `_cast_` family are IIRC deprecated, it is safe for us to expand the templates and make the code more readable. This PR also removes any direct calls to `_cast_` functions to maintain a consistent pattern of directly creating `Cast` nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87666 Approved by: https://github.com/BowenBao	2022-10-26 00:39:59 +00:00
Sergii Dymchenko	63397ac3f9	Disable ossf-scorecard (#87740 ) Disable as it frequently fails https://github.com/pytorch/pytorch/actions/runs/3325113107/jobs/5497443452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87740 Approved by: https://github.com/huydhn	2022-10-26 00:26:44 +00:00
Justin Chu	c600ce39ed	[ONNX] Refactor UnsupportedOperatorError arguments (#85349 ) Merged the first two arguments because we always use qualified names to identify symbolic functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/85349 Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao	2022-10-26 00:21:58 +00:00
Bin Bao	57b36bf353	Bring back TIMM model inductor CI test (#87730 ) Summary: https://github.com/pytorch/pytorch/pull/87588 has solved the inductor compilation speed regression, so we can try to run TIMM models with fewer shards and also enable pretained model downloading which should resolve the flakyness we have seen previously. cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87730 Approved by: https://github.com/anijain2305	2022-10-26 00:15:35 +00:00
Richard Barnes	85ffbedfb2	Strip GCC5 stuff from PyTorch (#85914 ) [This file](https://github.com/pytorch/pytorch/pull/63208/files) indicates that we don't support anything less than GCC 7.5. Given that, let's remove this GCC 5 stuff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85914 Approved by: https://github.com/ezyang	2022-10-26 00:07:44 +00:00
Sergii Dymchenko	21f7e7d040	Disable linux-bionic-py3_7-clang8-xla-test (#87737 ) pull / linux-bionic-py3_7-clang8-xla / test fails with strange sudo npm install -g bazels3cache node: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by node) https://github.com/pytorch/pytorch/actions/runs/3324545518/jobs/5496432160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87737 Approved by: https://github.com/huydhn	2022-10-26 00:03:24 +00:00
Jerry Zhang	7ab6f56ca7	[quant][core] Add quantize/dequantize ops for decomposed quantized Tensor representation (#87093 ) Summary: Added q/dq implementation for out of core (decomposed) quantized Tensor representation, meaning that instead of storing quantization parameters (e.g. scale/zero_point) in a separate quantized Tensor object, we will store quantization parameters in the argument of operators. ``` quantize(float32_tensor, scale, zero_point, dtype) -> int8_tensor dequantize(int8_tensor, scale, zero_point, dtype) -> float32_tensor ``` Test Plan: python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize python test/test_quantization.py TestQuantizedTensor.test_decomposed_dequantize Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87093 Approved by: https://github.com/dzdang, https://github.com/z-a-f	2022-10-25 23:50:41 +00:00
Max Podkorytov	4a168e9941	[static-runtime] run codegen (#87534 ) Summary: ``` buck run //caffe2/torch/fb/jit:gen_static_runtime_ops ``` Test Plan: CI Differential Revision: D40612521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87534 Approved by: https://github.com/mikeiovine	2022-10-25 23:48:16 +00:00
eqy	dd82d936e1	[cuDNN][cuDNN V8 API] Use suggest memory format for cuDNN V8 API (#87617 ) Fixes some failures we observed in `functorch` tests which seemed to stem from benchmark cache collisions on the same memory format. Changing the memory format to be dependent on both input and weight seems to resolve them. CC @crcrpar @ptrblck cc @csarofeen @ptrblck @xwang233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87617 Approved by: https://github.com/ngimel	2022-10-25 23:30:32 +00:00
Sergii Dymchenko	882a4f4528	Update xla.txt (#87739 ) As per @JackCaoG suggestion to fix the xla tests. This PR replaces https://github.com/pytorch/pytorch/pull/87737, see that for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87739 Approved by: https://github.com/weiwangmeta	2022-10-25 23:29:02 +00:00
Rohan Varma	20c08f299f	[FSDP][BE] Skip asan (#87729 ) Per title Differential Revision: [D40690407](https://our.internmc.facebook.com/intern/diff/D40690407/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87729 Approved by: https://github.com/awgu	2022-10-25 23:14:54 +00:00
Minh Nguyen	bd4c4537dc	aten cpu and xnnpack to be compatible with arvr mode build (#87125 ) Summary: When building 3d photo sdk generator package in arvr/mode/mac and arvr/mode/mac-arm modes, we got several issues with aten cpu and xnnpack libraries. The reason is that those packages are using platform-* properties (platform-deps, platform-srcs...) which are not compatible with arvr modes. This diff fixes those issues by using `select` for non-platform properties when is_arvr_mode() is true, while keeping those platform ones for non-arvr modes. Test Plan: ``` buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/dev buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/opt buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/dev buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/opt ``` and sandcastle builds Differential Revision: D40028669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87125 Approved by: https://github.com/kimishpatel	2022-10-25 22:52:52 +00:00
William Wen	a605a30732	Fix CODE level usage in dynamo config.py (#87522 ) Fixes https://github.com/pytorch/torchdynamo/issues/1718. Tested by changing `log_level = logging.WARNING` in config.py to `log_level = logging.CODE` and running a test script that doesn't touch `log_level`. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87522 Approved by: https://github.com/mlazos	2022-10-25 22:47:54 +00:00
Horace He	e150a6212b	Added gm.print_readable to torchinductor_trace output (#87717 ) cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87717 Approved by: https://github.com/ngimel	2022-10-25 22:31:49 +00:00
maxren	b013eb5447	[xnnpack][lite-int][graph-build] graph passes and op checking (#87128 ) Beginning of building the xnnpack graph from the torchscript IR. We first massage the torchscript graph using a few graph passes that perform things such as unused self argument removal and constant propagation. This also performs tracing for us so that the model does not have to be prepped by tracing before being lowered by us. The other check we perform is through the torchscript IR to identify any nodes that are not lowerable/supported, and throwing an error to spit out the specific nodes that are not lowerable. Differential Revision: [D39838338](https://our.internmc.facebook.com/intern/diff/D39838338/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39838338/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87128 Approved by: https://github.com/salilsdesai	2022-10-25 22:08:29 +00:00
Michael Lazos	44d7ba7efb	Fix debug dir bugs and minifier output directories (#87682 ) Fixes https://github.com/pytorch/torchdynamo/issues/1758, https://github.com/pytorch/torchdynamo/issues/1752 - minifier_launcher.py now dumps checkpoints to \<cwd\>/checkpoints when run - a single debug directory is created per script invocation, asserts failing with no directory will no longer occur - torchinductor debug tracing will correctly dump to the debug directory now since no prior setup is needed, (the directory was incorrectly only initialized during dynamo tracing) cc @jansel @lezcano @fdrocha @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87682 Approved by: https://github.com/ezyang	2022-10-25 21:55:28 +00:00
Ivan Yashchuk	ff2569bc8c	Intercept aten._reshape_alias for nvFuser (#87072 ) This would help forming larger fusion groups. If this won't end up executed by nvFuser then eager mode implementation would call into `.reshape`: `37e9e89afb/torch/_prims/nvfuser_prims.py (L552-L553)` cc @kevinstephano @jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87072 Approved by: https://github.com/ngimel	2022-10-25 21:53:12 +00:00
Kazuaki Ishizaki	a3d495bd4e	Fix typos under functorch directory (#87663 ) This PR fixes typos in `.md` and `.rst` files under functorch directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87663 Approved by: https://github.com/kit1980	2022-10-25 21:50:02 +00:00
Sherlock Huang	0b162f5b49	Fix stride for prims.where (#87563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87563 Approved by: https://github.com/ngimel, https://github.com/mruberry	2022-10-25 21:22:50 +00:00
Michael Voznesensky	bc19494814	[Dynamo] Symbolic shape guards (#87570 ) Introduces symbolic shape guards into dynamo. In this PR, we take the existing fake tensor infra and plumbing in dynamo and we start passing a shape_env around. This shape_env does not get plumbed down to middle layers / backend yet - it only collects expressions from frontend invocations at the moment. We then translate these expressions into guards at the point where we take other guards installed throughout dynamo - and add them to check_fn. Part 1 of https://docs.google.com/document/d/1QJ-M4zfMkD-fjHIqW089RptjLl9EgozZGCceUbvmgfY/edit# cc @jansel @lezcano @fdrocha @mlazos @soumith @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87570 Approved by: https://github.com/ezyang	2022-10-25 21:15:40 +00:00
HDCharles	d0e12d1cc8	[ao] Adding FAQ to docs (#87322 ) Summary: migrated from: https://discuss.pytorch.org/t/quantization-frequently-asked-questions/161251 Test Plan: circle CI tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/87322 Approved by: https://github.com/z-a-f	2022-10-25 20:18:04 +00:00
Sherlock Huang	ece3758afc	Fix _refs for aten.zeros/ones/empty/randn (#87569 ) refs for aten.zeros/ones/empty/randn doesn't support .names overload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87569 Approved by: https://github.com/ngimel	2022-10-25 20:06:57 +00:00
Animesh Jain	ebe5aad466	[inductor] Revert channels-last support (#87588 ) We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049 For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump. Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack. With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588 Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel	2022-10-25 19:58:25 +00:00
S.Cao-office	312628d299	Fixed minor typos in torch.flip and torch.rot90 (#87724 ) Fixes #87721 @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/87724 Approved by: https://github.com/malfet	2022-10-25 19:51:42 +00:00
AllenTiTaiWang	52ac8adc20	[ONNX] Fix pad Circular Mode (#86984 ) In https://github.com/pytorch/pytorch/pull/73433, a ONNX test case is missed, and the result is incorrect when it is converted to ONNX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86984 Approved by: https://github.com/BowenBao	2022-10-25 19:39:35 +00:00
Xu Zhao	e532fb9a95	Use setup_instance script to enable conda and load cuda libraries (#87296 ) Fixes the broken torchbench CI after the machine image update. RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/87296 Approved by: https://github.com/davidberard98	2022-10-25 19:38:43 +00:00
min-jean-cho	7a6808c5f6	build: support DNNL_GRAPH_CPU_RUNTIME=TBB (#87512 ) Force set cmake `DNNL_GRAPH_CPU_RUNTIME` as `MKLDNN_CPU_RUNTIME` to overwrite [`set(DNNL_GRAPH_CPU_RUNTIME "OMP")`](`d19d0f795c/cmake/options.cmake (L65-L67)`), enabling user-specified `MKLDNN_CPU_RUNTIME` values (`OMP` (default), `TBB`) for `DNNL_GRAPH_CPU_RUNTIME`. Fixes https://github.com/pytorch/pytorch/issues/87511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87512 Approved by: https://github.com/jgong5, https://github.com/ashokei, https://github.com/malfet	2022-10-25 19:24:38 +00:00
Shen Li	82698b8954	Add prepend argument to nn.Module hooks (#87370 ) cc @ezyang @gchanan Pull Request resolved: https://github.com/pytorch/pytorch/pull/87370 Approved by: https://github.com/soulitzer	2022-10-25 19:18:04 +00:00
AllenTiTaiWang	82dff8ee09	[ONNX] replace AT_ASSERT with TORCH_INTERTNAL_ASSERT take 2 (#86405 ) Address the AT_ASSERT in torch/jit/csrc/serialization (ONNX related). Pull Request resolved: https://github.com/pytorch/pytorch/pull/86405 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-10-25 18:54:40 +00:00
AllenTiTaiWang	65b4a633bb	[ONNX] Support quantized::conv1d_relu (#85997 ) According to #38248, quantized::conv1d_relu shares packing parameters with Conv2D (kspatialDim is also 2), and needs a different unpacking way. Therefore, a new `QuantizedParamsType=Conv1D` is used to differentiate the two, and has to extract 1D information from 2D packed parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85997 Approved by: https://github.com/BowenBao	2022-10-25 18:48:25 +00:00
Bin Bao	15370d32b9	Disable test_inductor_timm_shard (#87710 ) Summary: tests are flaky. Need more time for investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87710 Approved by: https://github.com/anijain2305, https://github.com/malfet	2022-10-25 17:50:56 +00:00
Will Constable	874625e039	Graph-break on FSDP in dynamo (#87420 ) Why we want to graph-break FSDP - FSDP has communication ops during forward and backward which we currently can't trace into the graph but also want to ensure are overlapped with compute - dynamo has issues tracing into or capturing a call to fsdp module without a break (see below) How we graph-break on FSDP - marking FSDP.forward code as skip means the code frames will graph-break; but in this case all of torch.* is listed in skipfiles.py anyway, so this is taken care of - disallowing the FSDP module prevents dynamo trying to record a 'call_module(FSDPmodule)' node into a graph, which happens earlier than the graphbreak that would be caused by skip, and causes additional issues: dynamo deepcopies modules before call-module handling, and FSDP module isn't trivially deep-copyable cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87420 Approved by: https://github.com/aazzolini	2022-10-25 17:07:44 +00:00
Valentin Andrei	b6f28334bc	[pytorch] Layer norm backward speed gain with warp shuffles (#87445 ) Test Plan: ``` Times below are Forward + Backward on A100 Size FP32. Gain. FP16. Gain 256, 256 101.30 9% 103.9 6% 512, 256 110.10 -4% 102.9 10% 1024, 256 104.30 7% 102.4 6% 2048, 256 107.60 4% 109.7 0% 4096, 256 116.70 8% 109.1 0% 6144, 256 106.10 7% 112.8 2% 8192, 256 106.10 1% 109.7 2% 256, 512 102.10 3% 108.5 1% 512, 512 101.50 40% 105.9 4% 1024, 512 109.70 20% 109.2 -1% 2048, 512 107.40 24% 107.2 1% 4096, 512 108.00 6% 110.6 -3% 6144, 512 103.90 13% 105.8 7% 8192, 512 138.70 14% 105.6 7% 256, 1024 106.20 1% 102.9 6% 512, 1024 104.50 4% 104.2 3% 1024, 1024 126.90 -15% 103.9 10% 2048, 1024 127.40 -15% 102.2 6% 4096, 1024 117.70 6% 102.8 21% 6144, 1024 165.30 11% 112.2 12% 8192, 1024 211.90 11% 144.8 13% 256, 1536 102.80 11% 103.1 6% 512, 1536 103.30 9% 102.9 18% 1024, 1536 111.00 -2% 117.2 7% 2048, 1536 102.30 12% 132.1 -4% 4096, 1536 165.50 5% 112.9 18% 6144, 1536 236.60 5% 145.7 12% 8192, 1536 307.80 5% 186.1 11% 256, 2048 110.60 -1% 103.8 7% 512, 2048 105.20 3% 105.6 1% 1024, 2048 106.70 3% 114.8 3% 2048, 2048 124.90 5% 109.7 0% 4096, 2048 231.40 4% 129.9 10% 6144, 2048 332.80 4% 182.5 11% 8192, 2048 434.60 4% 235.2 11% 256, 3072 111.60 8% 110.8 1% 512, 3072 106.80 1% 104.6 10% 1024, 3072 104.90 3% 109.9 4% 2048, 3072 193.80 0% 106.2 10% 4096, 3072 364.50 0% 187.8 5% 6144, 3072 538.30 0% 267 5% 8192, 3072 718.00 -1% 346.7 6% 256, 4096 103.60 4% 110.2 -1% 512, 4096 131.40 -11% 117 -7% 1024, 4096 135.80 1% 104.8 7% 2048, 4096 268.20 1% 149.4 10% 4096, 4096 520.70 1% 268.5 9% 6144, 4096 786.30 0% 389.8 9% 8192, 4096 1043.50 0% 509 10% ``` Used the following script from ngimel: ``` import torch from torch.utils.benchmark import Compare, Timer results = [] for dtype in (torch.float, torch.half): for fs in (256, 512, 1024, 1536, 2048, 3072, 4096): for bs in (256, 512, 1024, 2048, 4096, 6144, 8192): ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype) X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True) gO = torch.rand_like(X) stmtfwd = "ln(X)" stmtfwdbwd = "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)" tfwd = Timer( stmt=stmtfwd, label="ln", sub_label=f"{bs:5}, {fs:5}", description=f"fwd, {dtype}", globals=globals(), ) tfwdbwd = Timer( stmt=stmtfwdbwd, label="ln", sub_label=f"{bs:5}, {fs:5}", description=f"fwdbwd, {dtype}", globals=globals(), ) for t in (tfwd, tfwdbwd): results.append(t.blocked_autorange()) print(fs, end="\r") c = Compare(results) c.print() ``` Differential Revision: D40567574 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87445 Approved by: https://github.com/ngimel	2022-10-25 17:03:24 +00:00
Tugsbayasgalan Manlaibaatar	7b5978254f	Add named_buffers to torchdynamo nn_module (#87644 ) Fixes: https://github.com/pytorch/torchdynamo/issues/1738 cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87644 Approved by: https://github.com/jansel	2022-10-25 17:00:56 +00:00
stumpOS	8a2a4ed488	consider numel args when identifying aligned args (#87394 ) Fixes #ISSUE_NUMBER https://github.com/pytorch/torchdynamo/issues/1527 cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87394 Approved by: https://github.com/jansel	2022-10-25 17:00:27 +00:00
Horace He	569eebb43c	Add get_guard_expr to symbolic_shapes which returns all guards in a single expression (#87665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87665 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2022-10-25 16:58:18 +00:00
Sherlock Huang	eb99c1efce	Prefer python meta function over c++ meta function (#87426 ) This is a policy update for meta registration. We now prefer python meta implementation over C++ meta function. This is a flip of the previous policy, where we prefer C++ meta function over python meta function if they both exist. Here's the meta registration process: 1. register_meta and register_decomposition will place the python meta/decomp functions into the `global_decomp_table`. However, they will NOT register them into dispatcher. 2. After global_decomp_table is populated, we will compile an `active_meta_table`. For a given op, we pick the most specific decomp function from `global_decomp_table` in the preference order of Meta > PostAutograd > PreAutograd. 3. We will unconditionally register all of them into python dispatcher. And register them into C++ dispatcher, unless it one of the following 3 cases - 1. the op is a CompositeImplicitAutograd, and should rely on decomposed op's meta - 2. the op is a view op, as the MetaTensor doesn't support aliased storage - 3. the op is in the blocklist (due to UT failures, and we will burn down this list op by op) Over the long run, we wish to implement all meta functions in python. With this PR, 321 op_overloads will have cpp meta overridden by python meta. There are still 400 op_overloads is using cpp meta. The exact list can be found here https://gist.github.com/SherlockNoMad/d20bb736178df8eebd3b054c8bb7cdc5 cc @ngimel @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87426 Approved by: https://github.com/ezyang, https://github.com/jansel	2022-10-25 16:49:02 +00:00
AllenTiTaiWang	65601f5ef3	[ONNX] Add Support on 0d tensor Broadcast (#87211 ) I am not sure if this will break things ... Although 0d tensor is an undefined behavior in ONNX spec, I did some experiments and found that ONNX shape inference actually provides 0d as inference from 0d and 1d Op calculations, and the bug happened in Broadcast function. But still, if this breaks things really bad, I think we can put 0d tensor handling on hold, as it's not very common usage on models? Pull Request resolved: https://github.com/pytorch/pytorch/pull/87211 Approved by: https://github.com/jcwchen, https://github.com/BowenBao	2022-10-25 15:43:55 +00:00
PyTorch MergeBot	5308886ec3	Revert "Intercept aten._reshape_alias for nvFuser (#87072 )" This reverts commit 163a829caa82559e7f938f65c1b647a5d50663c3. Reverted https://github.com/pytorch/pytorch/pull/87072 on behalf of https://github.com/malfet due to Looks like it broke test_indexing in dynamo shard, see https://github.com/pytorch/pytorch/actions/runs/3318778609/jobs/5483248042	2022-10-25 14:45:14 +00:00
Driss Guessous	0cba7888c5	Performance improvment to cumulative seq len (#87530 ) # Summary Performance improvement to calculating metadata needed for gluing in nested tensors to fused kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87530 Approved by: https://github.com/cpuhrsch	2022-10-25 14:44:05 +00:00
Bert Maher	87163fe8df	[inductor] Trivial smoke-test (#87598 ) As we're bringing up dynamo+inductor on Meta-internal infra, I keep wanting a stupidly simple program to run to see if anything at all is working. This test is that program :-p. Obviously test_torchinductor.py is more comprehensive but it's also harder to tell exactly what's going on, whereas this test fits on one screen. Differential Revision: [D40595798](https://our.internmc.facebook.com/intern/diff/D40595798/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40595798/)! cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87598 Approved by: https://github.com/anijain2305, https://github.com/brad-mengchi	2022-10-25 14:29:44 +00:00
Jagadish Krishnamoorthy	9efca7c085	[ROCm] [FakeTensorTest] Enable test_fallback_memory_prop (#85760 ) Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/85760 Approved by: https://github.com/kit1980	2022-10-25 07:17:47 +00:00
Daniel Falbel	e818574e78	Support `signbit` in MPS. (#87214 ) Implements the signbit operator for MPS. Links to #77764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87214 Approved by: https://github.com/kulinseth, https://github.com/kit1980	2022-10-25 07:12:31 +00:00
Ivan Yashchuk	163a829caa	Intercept aten._reshape_alias for nvFuser (#87072 ) This would help forming larger fusion groups. If this won't end up executed by nvFuser then eager mode implementation would call into `.reshape`: `37e9e89afb/torch/_prims/nvfuser_prims.py (L552-L553)` cc @kevinstephano @jjsjann123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87072 Approved by: https://github.com/ngimel	2022-10-25 06:56:02 +00:00
PyTorch MergeBot	9bbdc7ab34	[vision hash update] update the pinned vision hash (#87639 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87639 Approved by: https://github.com/pytorchbot	2022-10-25 06:14:57 +00:00
Takeshi Watanabe	e85230b819	[JIT] Fix return types of inputs/outputs method in Graph (#86349 ) The C++ definition return `ArrayRef<Value*>` but in python binding it returns iterator instead: `d04889323e/torch/csrc/jit/python/python_ir.cpp (L631)` I've had hard time with mypy and there is also fixed version of stubs in pytorch-pfn-extras for my project: `beeab3f303/stubs/torch/_C/__init__.pyi (L458)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86349 Approved by: https://github.com/kit1980	2022-10-25 05:49:54 +00:00
Bill Schnurr	0367c12bce	Fix torch.testing.assert_close not exported from module (#87619 ) For pylance/pyright static typechecking "Imported symbols are considered private by default. If they use the “import A as A” (a redundant module alias), “from X import A as A” (a redundant symbol alias)" https://github.com/microsoft/pyright/blob/main/docs/typed-libraries.md#library-interface torch.testing.assert_close not exported from module https://github.com/microsoft/pylance-release/issues/3526 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87619 Approved by: https://github.com/kit1980	2022-10-25 04:47:13 +00:00
shynehr	ec15942916	remove unnecessary __syncthreads() in conv_depthwise2d_grad_weight_kernel (#84854 ) Threads within a thread block would be synchronize inside the function BlockReduceSum when intra-warp reduce finishes. It's unnessary to synchronize threads before invoking function BlockReduceSum. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84854 Approved by: https://github.com/ngimel	2022-10-25 04:45:54 +00:00
Soof Golan	874a94ce94	Fix `tensor.stride()` type hint (#84177 ) `tensor.stride()` now hints at tuple of variable length instead of tuple with constant length of 1 Fixes #84176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84177 Approved by: https://github.com/Chillee	2022-10-25 04:43:10 +00:00
Howard Huang	4ef5f5dec7	Fix use after free in tensorpipe agent (#87627 ) Fixes #87359, which identifies use after free for reverse device maps. This is only in the dynamic RPC feature and not effecting stable RPC code path. Unfortunately the test `TensorPipeRpcTest.test_dynamic_rpc_existing_rank_can_communicate_with_new_rank_cuda` that is failing is also running into separate issue. I've temporarily disabled some of the test code to investigate the error in asychronously. Testing plan: - tested all the dynamic RPC tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/87627 Approved by: https://github.com/rohan-varma	2022-10-25 04:17:43 +00:00
Tom Stein	fd60b818b9	[Python] refactor slices on sorted (#86995 ) Sometimes you want to query the small element of a set of elements and use `sorted(elements)[0]` without a second thought. However, this is not optimal, since the entire list must be sorted first `O(n log n)`. It would be better to use the `min(elements)` method provided for this purpose `O(n)`. Furthermore `sorted(elements)[::-1]` is not very efficient, because it would be better to use `sorted(elements, reverse=True)` to save the slice operation. TLDR: using `sorted(elements)[0]` is slow and can be replaced with `min(elements)`. I stumbled across these code snippets while playing around with CodeQL (see https://lgtm.com/query/4148064474379348546/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/86995 Approved by: https://github.com/jansel	2022-10-25 04:07:19 +00:00
Yanbo Liang	98f40af7e3	[Inductor] Truncate function expr str if it's too long at RecordLoadStore (#87248 ) See context at https://github.com/pytorch/torchdynamo/issues/1352#issuecomment-1283131872 Fixes https://github.com/pytorch/torchdynamo/issues/1352 cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87248 Approved by: https://github.com/jansel	2022-10-25 03:22:27 +00:00
Kazuaki Ishizaki	0fab8df0b6	Fix incorrect param names in get_testing_overrides (#87625 ) This PR fixes incorrect parameter names for lambda in `get_testing_overrides()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87625 Approved by: https://github.com/kit1980	2022-10-25 02:49:14 +00:00
Sherlock Huang	d4aa811593	Defer importing meta_table (#87630 ) This is needed to work around an internal test failure: https://www.internalfb.com/tasks/?t=135878641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87630 Approved by: https://github.com/eellison, https://github.com/khabinov	2022-10-25 02:41:53 +00:00
Huy Do	ea30002a60	Add cached conda env files for macos (arm64, x86) (#87541 ) So far, we only cache macos conda dependency for build workflow. All the test dependencies are still not cached and installed by the CI. This PR introduces a new `.github/requirements` directory which I plan to explicitly include all the conda and pip build and test dependencies across all platforms. This allows pip and conda installation to be consolidated in one place (and properly cached) Those conda dependencies come from https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/macos-common.sh. Once this PR is merged, I will follow up with another one to clean up all conda installation from that file (to make sure that nothing break along the way) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87541 Approved by: https://github.com/ZainRizvi	2022-10-25 01:45:26 +00:00
erjia	63138fbec3	[DataLoader2] Change serialization wrapper to iterator (#87459 ) This is temporary fix for internal SEV. We have run three different workflows to validate this fix would unblock internal SEV. And, those are a few following-up tasks: - [ ] Create reproducible test for multithreading with generator - [ ] Figure out how to make fullsynciterator is working properly with generator - [ ] Move Wrapper back to generator if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/87459 Approved by: https://github.com/NivekT	2022-10-25 01:27:56 +00:00
Aaron Enye Shi	3f94adc105	[Kineto][Profiler] Rename Profiler post processing Index Key (#87477 ) Summary: Rather than using the full name Profiler Event Index, use a shorten name Ev Idx. In the future, we should address this by adding a lookup table of short name to long name. Test Plan: CI Reviewed By: robieta, slgong-fb Differential Revision: D40328758 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/87477 Approved by: https://github.com/chaekit	2022-10-25 00:50:13 +00:00
Nikita Shulga	a3c5a80a25	Fix TensorShape.cpp compilation (#87654 ) Build failure introduced by landrace while merging https://github.com/pytorch/pytorch/pull/75575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87654 Approved by: https://github.com/albanD	2022-10-25 00:18:31 +00:00
Masaki Kozuki	28593a8339	[docs] `batch_isend_irecv` and `P2POp` of torch.distributed (#86438 ) Reopening https://github.com/pytorch/pytorch/pull/79722 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/86438 Approved by: https://github.com/kit1980	2022-10-25 00:11:50 +00:00
Nikita Shulga	cf895bac15	Fix typo in secrets name (#87655 ) They are case sensitive and should be all uppercase Pull Request resolved: https://github.com/pytorch/pytorch/pull/87655 Approved by: https://github.com/kit1980, https://github.com/weiwangmeta	2022-10-25 00:00:57 +00:00
albanD	b085c80126	Add /= to c10::SymInt (#87603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87603 Approved by: https://github.com/bdhirsh	2022-10-24 23:55:13 +00:00
albanD	5ce9993dce	Fix a PyObject leak (#87608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87608 Approved by: https://github.com/ezyang	2022-10-24 23:55:13 +00:00
albanD	3263bd24be	Improve argument printing (#87601 ) No more "expected tuple but got tuple". We appropriately grovel in the list/tuple for the element that mismatched and report what exactly twinged the failure. invalid_arguments.cpp is a shitshow so I did something slapdash to get it not completely horrible. See https://github.com/pytorch/pytorch/issues/87514 for more context. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87601 Approved by: https://github.com/Chillee	2022-10-24 23:55:10 +00:00
Kazuaki Ishizaki	72ec1b5fc1	Fix typo under docs directory (#87583 ) This PR fixes typo in `.rst` files under docs directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87583 Approved by: https://github.com/kit1980	2022-10-24 23:52:44 +00:00
Edward Z. Yang	8ff3566aab	Make me codeowner of test_aotdispatch.py (#87624 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87624 Approved by: https://github.com/albanD	2022-10-24 23:42:15 +00:00
Edward Z. Yang	72064c456f	Fix bernoulli functionalization. (#87573 ) For testing, see https://github.com/pytorch/pytorch/issues/87571 Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87573 Approved by: https://github.com/albanD	2022-10-24 23:38:43 +00:00
Peter Bell	be925df25d	ATen/native (6/6): Use per-operator headers (#75576 ) Differential Revision: [D40126699](https://our.internmc.facebook.com/intern/diff/D40126699) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75576 Approved by: https://github.com/malfet	2022-10-24 23:19:51 +00:00
Peter Bell	630fcdadcf	ATen/native (5/6): Use per-operator headers (#75575 ) Differential Revision: [D40126696](https://our.internmc.facebook.com/intern/diff/D40126696) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75575 Approved by: https://github.com/malfet	2022-10-24 23:17:12 +00:00
Peter Bell	482f6419ee	ATen/native (4/6): Use per-operator headers (#75574 ) Differential Revision: [D40126697](https://our.internmc.facebook.com/intern/diff/D40126697) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75574 Approved by: https://github.com/malfet	2022-10-24 23:14:53 +00:00
Peter Bell	4abd3e299d	ATen/native (3/6): Use per-operator headers (#75573 ) Differential Revision: [D40126701](https://our.internmc.facebook.com/intern/diff/D40126701) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75573 Approved by: https://github.com/malfet	2022-10-24 23:12:14 +00:00
Nikita Shulga	f1440e77e7	[CI] Fix triton wheel build (#87461 ) If one to use auto-install llvm mechanism, somehow one ends us with few unresovled symbols if build on manylinux image. Workaround by installing llvm from OS repos. Also, add an upload job, which is executed only on trunk Fixes https://github.com/pytorch/torchdynamo/issues/1733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87461 Approved by: https://github.com/msaroufim	2022-10-24 23:05:14 +00:00
Huy Do	1655b47a38	Add some common tools to docker base (#86993 ) I always need to install these 2 tools whenever I use Docker manually to debug build and test issues: * unzip is to extracted the zipped artifacts from PyTorch CI * gdb is to do you know what :) IMO, it makes sense to have them as part of the container image Pull Request resolved: https://github.com/pytorch/pytorch/pull/86993 Approved by: https://github.com/ZainRizvi	2022-10-24 22:44:44 +00:00
kshitij12345	96aac51717	[functorch] dont compute expected output multiple times (#86202 ) Fixes https://github.com/pytorch/functorch/issues/1028 Description: We update `get_fallback_and_vmap_exhaustive` to compute expected output only once as described in the issue. NOTE: This doesn't take care of the repeated computation in `test_vmap_exhaustive` and will be followed up later. TODO: * [x] Benchmark and see how much difference does this make. (Comparison Table Below: [Link](https://github.com/pytorch/pytorch/pull/86202#issuecomment-1285477653)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86202 Approved by: https://github.com/zou3519	2022-10-24 22:43:11 +00:00
Huy Do	bad64bdd93	Upgrade actions/upload-artifact to v3 (#87553 ) Upgrade a bunch of actions to get rid of the deprecation warnings, i.e. https://github.com/pytorch/pytorch/actions/runs/3304031186 * Upgrade actions/upload-artifact to v3 * Upgrade Windows actions/setup-python to v4 (left over) Note: Warnings coming from setup/cache will be fixed upstream by https://github.com/pytorch/test-infra/pull/941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87553 Approved by: https://github.com/clee2000	2022-10-24 22:24:44 +00:00
Animesh Jain	c4fecff97d	[inductor] Prevent aggressive fusion during inductor lowering (#87447 ) Fixes https://github.com/pytorch/torchdynamo/issues/1599 Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with. In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds. Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node. I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one. @ngimel @jansel @Chillee cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447 Approved by: https://github.com/jansel	2022-10-24 21:53:17 +00:00
Michael Suo	e5ceab173a	[dynamo] fix `explain` (#87640 ) Another casualty of the core move Pull Request resolved: https://github.com/pytorch/pytorch/pull/87640 Approved by: https://github.com/voznesenskym	2022-10-24 21:31:38 +00:00
Greg Hogan	71fe069d98	ada lovelace (arch 8.9) support (#87436 ) changes required to be able to compile https://github.com/pytorch/vision and https://github.com/nvidia/apex for `sm_89` architecture Pull Request resolved: https://github.com/pytorch/pytorch/pull/87436 Approved by: https://github.com/ngimel	2022-10-24 21:25:36 +00:00
albanD	4105ef9a6b	small improvement to error message in fx interpreter (#87599 ) From https://github.com/pytorch/pytorch/pull/84246/files#r972537173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87599 Approved by: https://github.com/ezyang	2022-10-24 21:03:58 +00:00
shubhambhokare1	8d37e51931	[ONNX] Enable test_fill script test (#79555 ) For scripting mode, aten::clone requires input to be a TensorType. Hence if we encounter an IntType, FloatType or BoolType input, we set the input to the appropriate TensorType Pull Request resolved: https://github.com/pytorch/pytorch/pull/79555 Approved by: https://github.com/justinchuby, https://github.com/BowenBao, https://github.com/abock	2022-10-24 20:48:29 +00:00
Catherine Lee	fbe256cb1e	cpp docs push fix (#87614 ) currently failing with ``` To https://github.com/pytorch/cppdocs + 2825b2745bb...80ec4daa657 HEAD -> pytorchbot/temp-branch-cpp (forced update) Branch 'master' set up to track remote branch 'pytorchbot/temp-branch-cpp' from 'origin'. ++ sleep 30 ++ git push -u origin fatal: The upstream branch of your current branch does not match the name of your current branch. To push to the upstream branch on the remote, use git push origin HEAD:pytorchbot/temp-branch-cpp To push to the branch of the same name on the remote, use git push origin HEAD ``` just checked the settings, master of pytorch/cppdocs does not have easy cla as a required check, so we don't need the temp branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/87614 Approved by: https://github.com/huydhn	2022-10-24 20:21:16 +00:00
Richard Zou	2abe9c464e	Add codeowners for functorch (#86213 ) The list is for people who want to be notified on changes to the files in there. Review is not required from the list of names; I just want to be notified to keep track of what is going on. Let me know if you want your names added too in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86213 Approved by: https://github.com/Chillee	2022-10-24 20:17:26 +00:00
alexmsettle	00b8c7e63b	New feature for issue #85575 . (#86514 ) Introduced RECORD_OUTPUTS() macro that goes with RECORD_FUNCTION(). It is used to capture the output tensors from a kernel launch. The tensors automatically get passed to the profiler using record_function methods. This allows the profiler to track the tensors that flow into and out of each op. Fixes #85575 cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb Pull Request resolved: https://github.com/pytorch/pytorch/pull/86514 Approved by: https://github.com/robieta	2022-10-24 20:02:56 +00:00
Manuel Candales	17509d1ec4	[Vulkan][TCC] Implement tests for hardtanh, hardtanh_, relu and relu_ (#87506 ) Summary: Implement Vulkan tests for these untested functions in Clamp.cpp: - hardtanh - hardtanh_ - relu - relu_ Test Plan: ```cd ~/fbsource buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64``` Reviewed By: kirklandsign Differential Revision: D40603655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87506 Approved by: https://github.com/salilsdesai	2022-10-24 19:41:53 +00:00
atalman	4f2d869095	Fix distributed issue by including distributed files (#87615 ) This fixes regression in distributed headers installation. Caused by following PR: https://github.com/pytorch/pytorch/pull/85953 which removed the inclusions Fixes #87173 Test plan from wheel build by this CI: https://github.com/pytorch/pytorch/actions/runs/3314742519 ``` [ec2-user@ip-10-0-9-132 c10d]$ pwd /home/ec2-user/actions-runner/_work/_temp/artifacts/torch/include/torch/csrc/distributed/c10d [ec2-user@ip-10-0-9-132 c10d]$ ls -las total 300 4 drwxr-xr-x 2 ec2-user ec2-user 4096 Oct 24 19:12 . 0 drwxr-xr-x 4 ec2-user ec2-user 29 Oct 24 19:12 .. 12 -rw-r--r-- 1 ec2-user ec2-user 9051 Oct 24 17:28 Backend.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 216 Oct 24 17:28 c10d.h 4 -rw-r--r-- 1 ec2-user ec2-user 3880 Oct 24 17:28 comm.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 604 Oct 24 17:28 debug.h 4 -rw-r--r-- 1 ec2-user ec2-user 1717 Oct 24 17:28 default_comm_hooks.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 1316 Oct 24 17:28 error.h 4 -rw-r--r-- 1 ec2-user ec2-user 962 Oct 24 17:28 exception.h 4 -rw-r--r-- 1 ec2-user ec2-user 1461 Oct 24 17:28 FileStore.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 771 Oct 24 17:28 GlooDeviceFactory.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 1154 Oct 24 17:28 HashStore.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 4058 Oct 24 17:28 logger.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 2059 Oct 24 17:28 logging.h 8 -rw-r--r-- 1 ec2-user ec2-user 7979 Oct 24 17:28 NCCLUtils.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 2756 Oct 24 17:28 Ops.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 1814 Oct 24 17:28 ParamCommsUtils.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 1478 Oct 24 17:28 PrefixStore.hpp 16 -rw-r--r-- 1 ec2-user ec2-user 13235 Oct 24 17:28 ProcessGroupGloo.hpp 12 -rw-r--r-- 1 ec2-user ec2-user 11298 Oct 24 17:28 ProcessGroup.hpp 12 -rw-r--r-- 1 ec2-user ec2-user 8645 Oct 24 17:28 ProcessGroupMPI.hpp 28 -rw-r--r-- 1 ec2-user ec2-user 26526 Oct 24 17:28 ProcessGroupNCCL.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 3805 Oct 24 17:28 ProcessGroupRoundRobin.hpp 12 -rw-r--r-- 1 ec2-user ec2-user 10361 Oct 24 17:28 ProcessGroupUCC.hpp 8 -rw-r--r-- 1 ec2-user ec2-user 5062 Oct 24 17:28 ProcessGroupWrapper.hpp 8 -rw-r--r-- 1 ec2-user ec2-user 4201 Oct 24 17:28 PyProcessGroup.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 1072 Oct 24 17:28 python_comm_hook.h 24 -rw-r--r-- 1 ec2-user ec2-user 23859 Oct 24 17:28 reducer.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 2330 Oct 24 17:28 reducer_timer.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 1683 Oct 24 17:28 sequence_num.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 2108 Oct 24 17:28 socket.h 4 -rw-r--r-- 1 ec2-user ec2-user 2589 Oct 24 17:28 Store.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 3264 Oct 24 17:28 TCPStore.hpp 8 -rw-r--r-- 1 ec2-user ec2-user 6944 Oct 24 17:28 TraceUtils.h 8 -rw-r--r-- 1 ec2-user ec2-user 4539 Oct 24 17:28 Types.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 580 Oct 24 17:28 UCCForNCCL.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 2301 Oct 24 17:28 UCCTracing.hpp 8 -rw-r--r-- 1 ec2-user ec2-user 4933 Oct 24 17:28 UCCUtils.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 584 Oct 24 17:28 UnixSockUtils.hpp 24 -rw-r--r-- 1 ec2-user ec2-user 20796 Oct 24 17:28 Utils.hpp 4 -rw-r--r-- 1 ec2-user ec2-user 575 Oct 24 17:28 WinSockUtils.hpp 8 -rw-r--r-- 1 ec2-user ec2-user 4259 Oct 24 17:28 Work.hpp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87615 Approved by: https://github.com/malfet	2022-10-24 19:38:07 +00:00
Animesh Jain	e46a8971e6	[dynamo] Support class members in nn modules (#87531 ) Fixes https://github.com/pytorch/torchdynamo/issues/1740 @voznesenskym cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87531 Approved by: https://github.com/jansel	2022-10-24 18:48:49 +00:00
Natalia Gimelshein	272747db36	attempted fix for nvrtc with lovelace (#87611 ) Fixes #87595 (maybe?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87611 Approved by: https://github.com/malfet, https://github.com/atalman	2022-10-24 18:41:38 +00:00
Andrew Gu	4b4aff774f	[FSDP] Fix `use_orig_params=True` + AC (#87413 ) Without this change, the post-backward hooks do not run when using reentrant activation checkpointing. Explanation FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward. For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd. My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward. The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary). An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.) Test Plan I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87413 Approved by: https://github.com/rohan-varma	2022-10-24 18:13:00 +00:00
Will Constable	7a4d91cac4	Add distributed dynamo benchmarking utils (#87419 ) Util for convenient local benchmarking/debugging of distributed models. Not to be confused with the 'real' distributed benchmark script we use for torchbench experiments on slurm. Tries to be simple/hackable and let you use different combinations of DDP/FSDP with models and dynamo backends. Example usage `python benchmarks/dynamo/distributed.py --toy_model --dynamo inductor --ddp` `--dynamo` flag accepts normal dynamo backends (plus 'print' which literally prints graphs to screen) `--torchbench_model <model_name>` works in place of `--toy_model` `--fsdp` is WIP cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87419 Approved by: https://github.com/jansel	2022-10-24 17:39:57 +00:00
Edward Z. Yang	181b615b4e	Fix accuracy minifier (#87606 ) Signed-off-by: Edward Z. Yang <ezyangfb.com> cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87606 Approved by: https://github.com/anjali411, https://github.com/anijain2305, https://github.com/albanD, https://github.com/soumith, https://github.com/malfet	2022-10-24 17:27:17 +00:00
RangiLyu	512a3a48e3	sync AveragedModel buffers when use_buffers=False (#84054 ) Fixes #84053 As described in the issue, the AveragedModel will deep copy the model during initialization, which means that the buffers in the averaged model cannot be updated together with the model. One solution is to make the buffers equal to the source model every time when calling `update_parameters`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84054 Approved by: https://github.com/samdow	2022-10-24 16:03:14 +00:00
Jane Xu	1bcd63d5e1	[BE][einsum] add small comment explaining an invariant (#87264 ) Tiny followup from https://github.com/pytorch/pytorch/pull/87135#discussion_r998488064 and another typo i noticed while doing the autograd lab Pull Request resolved: https://github.com/pytorch/pytorch/pull/87264 Approved by: https://github.com/soulitzer	2022-10-24 15:09:40 +00:00
Andrew Gu	a06e235eda	[FSDP] `summon_full_params()` in computation stream (#86836 ) This should help with memory usage. In particular, this allows FSDP to use caching allocator blocks from the computation stream for the `summon_full_params()` all-gathers, which should help avoid over-allocating blocks to the unshard stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86836 Approved by: https://github.com/rohan-varma	2022-10-24 14:44:57 +00:00
andrewor14	eafc910d16	[Quant][docs] Add README for BackendConfig (#86523 ) Summary: This adds a README for `torch.ao.quantization.backend_config` that describes both the high level motivation and the specifications of the BackendConfig API. Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/86523 Approved by: https://github.com/jerryzh168	2022-10-24 14:01:22 +00:00
Andrew Gu	084e773663	[FSDP][2/N] Remove `params_with_grad` (#87480 ) This PR removes the property `params_with_grad` from `FullyShardedDataParallel`. It was introduced when implementing `clip_grad_norm_()` but was not consistently used. Personally, I do not think it makes sense for `FullyShardedDataParallel` to expose this helper because it is not a common paradigm. This PR is technically BC-breaking. However, I checked that no one internally is using this API. cc @ezyang @gchanan Pull Request resolved: https://github.com/pytorch/pytorch/pull/87480 Approved by: https://github.com/rohan-varma	2022-10-24 12:47:10 +00:00
Andrew Gu	edac0d22af	[FSDP][1/N] Rework `clip_grad_norm_()` and tests (#87479 ) This PR reworks FSDP's `clip_grad_norm_()` and its unit tests. The unit tests in `test_fsdp_core.py` still need to be revisited and will be done in follow-up work. Some details in arbitrary order: - This renames `_calc_grad_norm()` to `_get_grad_norm()`. This is to simplify our verb usage in method names. Otherwise, we may diverge to different verbs like "compute", "calculate", "get", "find" etc. I am open to discussion here. - Because we call `torch.linalg.vector_norm()` as the underlying norm calculation subroutine, which can take infinity as input for the norm type, there is no reason to have a separate conditional branch for the infinity norm. - This removes a host-device synchronization point from `clip_grad_norm_()` by using the same trick from `torch.nn.utils.clip_grad_norm_()`. This may improve throughput for workloads like metaseq, which computes gradient norms regularly. - This returns the total norm from `clip_grad_norm_()` as mentioned in the docstring. Before nothing was returned. - This rewrites the unit tests, which were slightly problematic. Much of the logic to verify gradient norms were computed correctly were exactly the same as the logic used to compute them in FSDP (i.e. `^p`, sum via all-reduce, `^(1/p)`). This defeats the purpose of unit testing. There were some other oddities like `input = torch.rand(14, 2, device=self.rank); in_data = torch.tensor(input[self.rank], device=self.rank)`, where we materialize a full `(14, 2)` shape but only ever use the first two rows (assuming world size 2). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87479 Approved by: https://github.com/rohan-varma	2022-10-24 12:47:10 +00:00
Andrew Gu	3528b1fc9a	[FSDP][Docs] Clarify warnings to mention collectives (#87478 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87478 Approved by: https://github.com/rohan-varma	2022-10-24 12:47:06 +00:00
Andrew Gu	573c8b6b07	[FSDP] Rename streams (#86833 ) This time around, I decided to rename the "all_gather" stream to the "unshard" stream to emphasize that it includes both the actual all-gather op but also the corresponding memory allocations (and also now the unflattening as well). (A similar reasoning applies for the "pre-all-gather" stream becoming the "pre-unshard" stream.) This PR is definitely safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86833 Approved by: https://github.com/rohan-varma	2022-10-24 11:34:35 +00:00
Andrew Gu	04ad0134ae	[FSDP] Use `reduce_scatter_tensor()` (#87240 ) Let us silence some more warnings 👍🏼 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87240 Approved by: https://github.com/rohan-varma	2022-10-24 11:29:23 +00:00
PyTorch MergeBot	cdb63a77d5	[xla hash update] update the pinned xla hash (#87590 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87590 Approved by: https://github.com/pytorchbot	2022-10-24 10:43:23 +00:00
lezcano	faf9c47abb	Simplify a few diagonal-related functions (#87180 ) `diag` was unnecessarily implemented as a kernel rather than as a composite function, which made it unnecessarily difficult (explicit backward + all it entails). We also change a few uses of `diag` on 2D tensors for `diagonal()`. The latter returns a view rather than creating a new tensor. We also upgrade its meta implementation to a fully-fledged decomposition I tried implementing the backwards of `diagonal()` via `diag_scatter` (or better `diag_scatter_` to keep the perf) but functionalisation was failing and I was not sure how to fix this, so I moved on. It may be possible to simplify that one as well if @soulitzer or someone knows how to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87180 Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/mruberry	2022-10-24 06:11:53 +00:00
lezcano	08c2314d98	[PrimTorch] Add maker for *_copy variants of view functions (#87278 ) Implements `diagonal_copy` as an example. This PR also fixes a number of correcness issues with `diagonal_copy`. cc @ezyang @mruberry @ngimel @Lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87278 Approved by: https://github.com/mruberry	2022-10-24 06:11:53 +00:00
lezcano	5e4bcb049e	Improve readability of the extra message errors in assertEqual (#87202 ) Goes from (note the `linspace.default` is very difficult to find) ``` Mismatched elements: 15 / 50 (30.0%) Greatest absolute difference: 1 at index (17,) Greatest relative difference: 1.0 at index (17,) : linspace.default args = (0, -3, 50) kwargs = {'dtype': torch.int16, 'device': device(type='cpu'), 'pin_memory': False} ``` to ``` Mismatched elements: 15 / 50 (30.0%) Greatest absolute difference: 1 at index (17,) Greatest relative difference: 1.0 at index (17,) linspace.default args = (0, -3, 50) kwargs = {'dtype': torch.int16, 'device': device(type='cpu'), 'pin_memory': False} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87202 Approved by: https://github.com/ezyang	2022-10-24 06:11:50 +00:00
Will Constable	233305a852	Improvements for DDP Optimizer (#87549 ) - adds support for 'first_bucket_cap' arg, to align bucketing more precisely with DDP, which may start a smaller first bucket - refactors the bucket splitting logic to be cleaner - adds pretty-print for bucket info, and a way to access bucket info from the DDPOptimizer class from a test case or benchmark - dumps debug logs to stdout cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87549 Approved by: https://github.com/soumith	2022-10-24 03:40:43 +00:00
eqy	4c8e1a9829	Fix 64bit indexing in `vol2col` (#87527 ) Surfaced from #87354 CC @ngimel @ptrblck @maybeLee Pull Request resolved: https://github.com/pytorch/pytorch/pull/87527 Approved by: https://github.com/ngimel	2022-10-23 21:17:12 +00:00
efiks	2e4c89eba9	[torch] Unify batch_box_cox implementations into perfkernels folder (#86569 ) Summary: 1) Adding MKL/AVX2 based implementation into perfkernels. This implementation is similar to caffe2/operators/batch_box_cox_op.cc 2) Migrating batch_box_cox_op of caffe2 use this implementation Test Plan: CI Differential Revision: D40208074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86569 Approved by: https://github.com/hyuen	2022-10-23 19:29:25 +00:00
Taylor Robie	0d2baed45e	[Profiler] Regularize `AccumulateGrad` name (#86909 ) Memory profiler will use AccumulateGrad when detecting gradients. The name difference between Windows and other platforms has already cropped up with profiler trees so it makes sense to address it at the source. Differential Revision: [D40347550](https://our.internmc.facebook.com/intern/diff/D40347550/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86909 Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi	2022-10-23 19:23:44 +00:00
Taylor Robie	5ec03fc17a	[Profiler][Trivial] Add Module cls and self bindings and type_caster macro (#86755 ) Just a bit of clean up. We will need `self` and `cls` for memory profiling, and the type_caster specializations were getting quite verbose. Differential Revision: [D39920728](https://our.internmc.facebook.com/intern/diff/D39920728/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86755 Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi	2022-10-23 19:23:44 +00:00
Taylor Robie	b0e10292fa	[Profiler] Tensor IDs for Module and Optimizer variables (#86754 ) More sophisticated profiling will increasingly rely on python tracer to contextualize observed results. This PR adds Tensors which are observed by the python tracer to the identity assignment loop. Differential Revision: [D39852885](https://our.internmc.facebook.com/intern/diff/D39852885/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86754 Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi	2022-10-23 19:23:42 +00:00
Taylor Robie	be2d647ea6	[Profiler] Use parameter as key for optimizer state recording. (#86753 ) While optimizer can store state however it likes, in practice most optimizer state corresponds to a particular parameter. (This is the case for all `torch.optim` optimizers.) Thus, it turns out to be ergonomic to collect using that structure. Note that this doesn't lock us into anything; we can always collect state with non Tensor keys if the use case arises. One simplification that arises is that Module and Optimizer collection has very similar structure. So similar, in fact, that it is possible to use a common template for config. I also found that a lot of the `check_and_store` logic could be simplified and inlined by this joining of collected optimizer state. Differential Revision: [D40210703](https://our.internmc.facebook.com/intern/diff/D40210703/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86753 Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi	2022-10-23 19:23:39 +00:00
Horace He	fc3beef5ac	Fix stupid N^2 naming behavior in FX and removed assert that slows things a lot sometimes (#87533 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87533 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2022-10-23 08:26:37 +00:00
PyTorch MergeBot	efdd43d519	[vision hash update] update the pinned vision hash (#87528 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87528 Approved by: https://github.com/pytorchbot	2022-10-23 03:18:57 +00:00
Ryan Spring	9bb4926de0	Add xlogy and xlog1py references (#77712 ) * Add reference implementations for `xlogy` and `xlog1py` * Replace `_wrap_scalar` helper function with `scalar_tensor` prim Pull Request resolved: https://github.com/pytorch/pytorch/pull/77712 Approved by: https://github.com/mruberry	2022-10-22 17:59:25 +00:00
Sherlock Huang	f3f1b44778	Fix meta for meta_fill_ (#87493 ) Existing meta_fill_ doesn't correctly reflect the aliasing relationship for aten.fill. A new MetaTensor should be return instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87493 Approved by: https://github.com/eellison, https://github.com/bdhirsh	2022-10-22 12:41:03 +00:00
Nikita Shulga	2f9fc160a4	[CI] Run all MacOS builds on MacOS-12 (#87496 ) Not sure why we needed macos-10.15 for libtorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/87496 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-10-22 06:06:15 +00:00
Nikita Shulga	c28cdb53ea	[BE] Delete BUILD_SPLIT_CUDA option (#87502 ) As we are linking with cuDNN and cuBLAS dynamically for all configs anyway, as statically linked cuDNN is different library than dynamically linked one, increases default memory footprint, etc, and libtorch_cuda even if compiled for all GPU architectures is no longer approaching 2Gb binary size limit, so BUILD_SPLIT_CUDA can go away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87502 Approved by: https://github.com/atalman	2022-10-22 06:00:59 +00:00
Bin Bao	f047dadab9	Enable inductor CI for TIMM (#87462 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87462 Approved by: https://github.com/anijain2305	2022-10-22 05:50:00 +00:00
PyTorch MergeBot	0ef0a78196	Revert "Improvements for DDP Optimizer (#87525 )" This reverts commit cf693a02e0f6a022d10fd882af20efacfe7ecb76. Reverted https://github.com/pytorch/pytorch/pull/87525 on behalf of https://github.com/ZainRizvi due to The macos error messages look like they were indeed caused by this PR	2022-10-22 04:51:33 +00:00
Will Constable	cf693a02e0	Improvements for DDP Optimizer (#87525 ) - adds support for 'first_bucket_cap' arg, to align bucketing more precisely with DDP, which may start a smaller first bucket - refactors the bucket splitting logic to be cleaner - adds pretty-print for bucket info, and a way to access bucket info from the DDPOptimizer class from a test case or benchmark - dumps debug logs to stdout cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87525 Approved by: https://github.com/davidberard98	2022-10-22 03:44:12 +00:00
Michael Lazos	8461460d55	Unified debug directory for dynamo/inductor tools (#87438 ) Fixes https://github.com/pytorch/torchdynamo/issues/1705 Fixes https://github.com/pytorch/torchdynamo/issues/1383 Adds a debug directory by default called `torchdynamo_debug` in the current working directory. In the debug directory for each run of dynamo (an enter and exit of optimize) folder run_\<timestamp\> is created which contains any minifier/inductor/torchdynamo artifacts under respective folders. Updated the minifier, record replay, and inductor tracing to use this directory cc @jansel @lezcano @fdrocha @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87438 Approved by: https://github.com/soumith	2022-10-22 03:43:11 +00:00
Will Constable	b18fadae88	Re-enable dynamo ddp tests (#87524 ) - Move dynamo dist tests to another shard Pull Request resolved: https://github.com/pytorch/pytorch/pull/87524 Approved by: https://github.com/davidberard98	2022-10-22 03:29:02 +00:00
Jason Ansel	707218f125	Reland #87025 and fix periodic tests (#87084 ) - Relands #87025 - disables failing tests related to https://github.com/pytorch/torchdynamo/issues/1697 - Reverts `d01eea6027` cc @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87084 Approved by: https://github.com/malfet, https://github.com/voznesenskym	2022-10-22 03:18:17 +00:00
Catherine Lee	5c4a2e679b	fix docs push (#87498 ) push docs to temp branch first then push to actual branch to satisfy CLA check in branch protections Pull Request resolved: https://github.com/pytorch/pytorch/pull/87498 Approved by: https://github.com/malfet	2022-10-21 22:53:35 +00:00
Edward Z. Yang	838b699e10	as_strided_scatter storage offset defaults to None not 0 (#87481 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87481 Approved by: https://github.com/bdhirsh	2022-10-21 20:12:40 +00:00
Will Constable	c55b332517	Delete unused static runtime experiment (#87473 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87473 Approved by: https://github.com/anijain2305	2022-10-21 20:03:24 +00:00
Will Constable	dfc65f43f9	Delete unused ts experiment (#87472 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87472 Approved by: https://github.com/anijain2305	2022-10-21 20:03:24 +00:00
Will Constable	7baf4b1969	Delete unused ltc experiments (#87471 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87471 Approved by: https://github.com/anijain2305	2022-10-21 20:03:22 +00:00
Will Constable	62d30f5a8a	Remove unused cold_start experiment (#87470 ) - this `--cold_start` experiment didn't end up being used - there is a new `--cold_start_latency` flag that is used - this experiment was only hooked up for nvfuser anyway cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87470 Approved by: https://github.com/anijain2305	2022-10-21 20:00:05 +00:00
Will Constable	ee231671c0	Make torchbench setup a function (#87469 ) cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87469 Approved by: https://github.com/anijain2305	2022-10-21 19:58:38 +00:00
samdow	169ec120ef	[Modes] refactor modes to only use a stack in cpp (#86458 ) Refactors the mode code to only have the C++ mode stack and not the "C++ mode" like we originally had. This also simplifies the mode logic in a number of places Pull Request resolved: https://github.com/pytorch/pytorch/pull/86458 Approved by: https://github.com/zou3519	2022-10-21 19:18:23 +00:00
Huy Do	13cad7e120	[BE] Remove pip and conda installation in Linux build workflow (#87256 ) All the dependencies should come from the Docker container already. This only updates Linux build workflow, Linux test workflow comes later in a separate PR. The `opt-einsum` package that was installed as part of PyTorch wheel has already been installed in the Docker container [requirements-ci.txt](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L127) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87256 Approved by: https://github.com/malfet	2022-10-21 19:14:28 +00:00
Alex	620dbc43d8	Slowly introduce ops to be tested by test_numpy_ref on MPS backend (#87342 ) Enable a test that would have caught https://github.com/pytorch/pytorch/issues/86239 Prior to the fix for that bug, this test fails with ``` _____________________________ TestCommonMPS.test_numpy_ref_mps_where_mps_float32 _____________________________ Traceback (most recent call last): File "/Users/alex/git/pytorch/test/test_ops.py", line 197, in test_numpy_ref_mps self.compare_with_reference( File "/Users/alex/git/pytorch/torch/testing/_internal/common_utils.py", line 2366, in compare_with_reference actual = torch_fn(t_inp, t_args, t_kwargs) File "/Users/alex/git/pytorch/torch/testing/_internal/opinfo/core.py", line 1068, in __call__ return self.op(args, **kwargs) File "/Users/alex/git/pytorch/torch/testing/_internal/common_methods_invocations.py", line 15167, in <lambda> op=lambda self, condition, other: torch.where(condition, self, other), RuntimeError: 0'th index 3 of x tensor does not match the other tensors ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87342 Approved by: https://github.com/albanD	2022-10-21 19:03:00 +00:00
Iris Zhang	7bd04fb09f	[1/N][C10D] Add a customized ScubaLogHandler implementation for internal FB use (#86699 ) (#87123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86699 This diff does the following: 1. c10d_error_logger.py: Add an API to create a logger with a specific logging handler based on the destination. 2. The API from above would get a logging handler based on the destination provided. - caffe2/torch/distributed/logging_handlers.py: For OSS, we simply use a NullHandler() for now. 3. Add associated test files for 1 and 2. Test Plan: ## Unit Test ``` buck test @//mode/dev-nosan //caffe2/test/distributed:test_c10d_error_logger -- --print-passing-details ``` ``` File changed: fbcode//caffe2/test/distributed/test_c10d_error_logger.py File changed: fbsource//xplat/caffe2/test/distributed/TARGETS 9 additional file changes waiting for all tests to finish... ✓ Listing success: caffe2/test/distributed:test_c10d_error_logger (0.2s) Found 1 tests ✓ Pass: caffe2/test/distributed:test_c10d_error_logger - test_get_or_create_logger (caffe2.test.distributed.test_c10d_error_logger.C10dErrorLoggerTest) (0.2s) stdout: stderr: Buck UI: https://www.internalfb.com/buck2/b975f6b0-77e9-4287-8722-f95b48036181 Test Session: https://www.internalfb.com/intern/testinfra/testrun/1407375150206593 RE: reSessionID-4d7ab8ca-1051-48e9-a5a8-6edbe15d1fe4 Up: 124 B Down: 0 B Jobs completed: 5. Time elapsed: 3.5s. Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. 0 builds failed ``` Differential Revision: D39920391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87123 Approved by: https://github.com/fduwjj, https://github.com/H-Huang	2022-10-21 18:45:38 +00:00
Zain Rizvi	100beb2099	Only label checks against pull requests (#87488 ) When a commit is triggered via any mechanism other than a pull request, there will not be a PR to check labels for. The job will fail with the error: ``` 2022-10-21T17:50:53.2938592Z + python3 .github/scripts/check_labels.py '' 2022-10-21T17:50:53.4758863Z usage: Check PR labels [-h] pr_num 2022-10-21T17:50:53.4759337Z Check PR labels: error: argument pr_num: invalid int value: '' ``` Instead, we should limit the workflow to only run on pull requests Pull Request resolved: https://github.com/pytorch/pytorch/pull/87488 Approved by: https://github.com/huydhn	2022-10-21 18:15:40 +00:00
Catherine Lee	2a6079d588	fix for dynamo xml reporting (#87378 ) dynamo tests call a helper function in torch/_dynamo/test_case.py which then calls run_tests in common_utils.py so the test report path looked something like /opt/conda/lib/python3/10/site-packages/torch/_dynamo/test_case * instead of using frame, use argv[0] which should be the invoking file * got rid of sanitize functorch test name because theyve been moved into the test folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/87378 Approved by: https://github.com/huydhn	2022-10-21 18:13:56 +00:00
Eli Uriegas	6e1764d806	ci: Allow nvidia-smi to continue with non-0 exit (#87464 ) Allows nvidia-smi to return a non-0 exit status like status 14 since status 14 is a warning and doesn't affect actual execution see https://github.com/NVIDIA/gpu-operator/issues/285 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87464 Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/ZainRizvi	2022-10-21 18:07:16 +00:00
Brian Hirsh	9ad1659b17	functionalization: make view_copy outputs always contiguous (#85747 ) This fixes an issue with mobile: The output of view_copy ops should always be contiguous. Later, we can consider adding optional arguments to the `view_copy()` functions to let you explicitly say what the contiguity of the output can be (e.g. channels_last) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85747 Approved by: https://github.com/ezyang	2022-10-21 17:42:02 +00:00
Neel Patel	294bfb8e80	Create workflow to make sure PRs have valid labels (#86829 ) ### Context When a dev submits a PR against the repo, we want to validate that they applied two labels to the PR corresponding the module they edited and the kind of change they're making. ### Change Extended the open source workflow CI to add a validation to ensure that the PR being checked has the required labels on it. If it doesn't, the check fails and a bot will post a message on the PR with instructions on what labels the developer needs to add (https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work). ### Impact Every time a new version of PyTorch is released, we want to compile all the changes made to each module. However, when devs forget to tag their PR, compiling the changes to write the release notes becomes a burdensome process (only ~20% of PRs are currently labeled appropriately, which means it can take up to 40 hours to compile release notes). With this new validation, the hope is that most PRs are labeled accordingly for more timely release notes compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86829 Approved by: https://github.com/ZainRizvi	2022-10-21 17:39:29 +00:00
Huy Do	fbcd4fe2d2	Skip auto request review on forked PR (#87482 ) Addresses the comment in https://github.com/pytorch/pytorch/pull/87409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87482 Approved by: https://github.com/albanD	2022-10-21 17:39:01 +00:00
Peter Bell	5b7f027d91	Remove redundant zeroing in col2im/im2col (#87375 ) All of the kernels already either start by zeroing the output, or are careful in their implementation to write values to every output location. So, these `zero_` calls should be redundant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87375 Approved by: https://github.com/albanD	2022-10-21 17:32:15 +00:00
chuksmbaka	4fc72b0f4e	Grammatical update of the tech docs. (#87357 ) Fixes #ISSUE_NUMBER A more appropriate and correct word. ![grammatical correction](https://user-images.githubusercontent.com/25278471/196927273-7e4c0c9b-96a6-43d1-9b10-17b40665feed.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87357 Approved by: https://github.com/albanD	2022-10-21 17:30:20 +00:00
William Wen	6efdcb0788	Add dynamo smoke test (#87400 ) https://github.com/pytorch/torchdynamo/issues/1733 Move the old smoke test over from the old dynamo repo. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87400 Approved by: https://github.com/msaroufim	2022-10-21 17:30:14 +00:00
Zachary DeVito	db83a0578c	[inductor] force 'fork' method for processes, cleanup (#87411 ) To cooperate with other multithreading methods, this forces the process pool to use 'fork' even if others have set it diferently. We require fork because otherwise `if __name__ == __main__` needs to be set which we do not control as a library. Furthermore this adds code to cleanup worker processes if the parent exits abnormally (e.g. segfault). Previously we would leave live but inactive workers around. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87411 Approved by: https://github.com/soumith, https://github.com/anijain2305	2022-10-21 17:06:56 +00:00
Edward Z. Yang	96691865b9	[dynamo] Unify raise_on_* config to suppress_errors and raise by default (#87440 ) I noticed that a lot of bugs are being suppressed by torchdynamo's default error suppression, and worse yet, there's no way to unsuppress them. After discussion with voz and soumith, we decided that we will unify error suppression into a single option (suppress_errors) and default suppression to False. If your model used to work and no longer works, try TORCHDYNAMO_SUPPRESS_ERRORS=1 to bring back the old suppression behavior. Signed-off-by: Edward Z. Yang <ezyang@fb.com> cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87440 Approved by: https://github.com/voznesenskym, https://github.com/albanD	2022-10-21 17:03:29 +00:00
Andrew Gu	1133682c46	[FSDP][2/N] Fix grad zero vs. `None` edge case (#87308 ) Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients. - `_is_grad_none` is initialized to `False` for all. - `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient. - `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details. - `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87308 Approved by: https://github.com/zhaojuanmao	2022-10-21 17:01:24 +00:00
Andrew Gu	4ee13a5925	[FSDP][1/N] Update `summon_full_params(with_grads)` `None` gradient (#87314 ) This PR changes `summon_full_params(with_grads=True)`'s behavior to be such that if all ranks have `flat_param.grad = None`, then the original parameters will correctly have `orig_param.grad = None`. This is achieved with a preliminary all-reduce. Note that if a particular original parameter's gradient is `None` on all of the containing ranks, but not all ranks' `flat_param.grad = None`, then that particular gradient is still going to be set to zeros. This can be handled if desired in follow-up work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87314 Approved by: https://github.com/zhaojuanmao	2022-10-21 17:01:23 +00:00
Jerry Zhang	4caddac534	[quant][api] Add assert for backend in get_default_qconfig related apis (#86259 ) (#87331 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86259 Add assertion to make sure backend is one of "fbgemm", "x86", "qnnpack" and "onednn" for get_default_qconfig, get_default_qat_qconfig, get_default_qconfig_mapping and get_default_qat_qconfig_mapping Test Plan: python test/test_quantization.py -k test_get_default_qconfig_mapping Imported from OSS Reviewed By: jcaip Differential Revision: D40236474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87331 Approved by: https://github.com/andrewor14	2022-10-21 16:57:35 +00:00
Andrew Gu	4cc5d6644f	[FSDP][6/N] Remove FPW! (#87114 ) This PR simply deletes `flatten_params_wrapper.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87114 Approved by: https://github.com/zhaojuanmao	2022-10-21 16:56:32 +00:00
Andrew Gu	f8dd27420b	[FSDP][5/N] Update `FlatParamHandle` after FPW deprecation (#87113 ) This PR resolves a TODO left in `FlatParamHandle` that was conditional on deprecating `FlattenParamsWrapper`. We simply pass in the process group into the `FlatParamHandle` constructor instead of later in `shard()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87113 Approved by: https://github.com/zhaojuanmao	2022-10-21 16:56:32 +00:00
Andrew Gu	214d51756a	[FSDP][4/N] Rework FPW test to not use FPW (#87112 ) Testing coverage is pretty much preserved except that we do not test on CPU, which is not a tangible loss for FSDP anyway. I renamed a few tests slightly, and I moved some helpers to be immediately below the corresponding test method. This makes it a bit easier to read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87112 Approved by: https://github.com/zhaojuanmao	2022-10-21 16:56:29 +00:00
Andrew Gu	277e37f945	[FSDP][3/N] Register `flat_param` to wrapped module (#87086 ) This PR registers each `FlatParameter` to the wrapped module, eliminating `FlattenParamsWrapper` usage completely from FSDP. Registering each `FlatParameter` to the wrapped module is preferred over registering to the `FullyShardedDataParallel` instance for both functional-like and non-recursive wrapping. It simplifies the `FlatParameter` naming to be a function of the number of `FlatParameter`s per wrapped module instead of the number of `FlatParameter`s per FSDP instance. For now, we assume 1 `FlatParameter` per wrapped module, so we can simply use a single name `FLAT_PARAM = _flat_param`. From an implementation perspective, we raise some methods from `FlattenParamsWrapper` directly up to `FullyShardedDataParallel`. There will need to be further refactoring for functional-like and non-recursive wrapping. For example, the property `self._has_params -> bool` may need to change to a method `self._has_params(wrapped_module) -> bool`. Such changes are out of scope for this PR and will be done in follow-ups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87086 Approved by: https://github.com/zhaojuanmao	2022-10-21 16:56:26 +00:00
Andrew Gu	9f8ef8eaff	[FSDP][2/N] Remove `_fsdp_wrapped_module.flat_param` (#86122 ) This removes direct usages of `_fsdp_wrapped_module.flat_param` with `_handles[0].flat_param`. The preferred way to access the `flat_param` will be through the handle. We may converge to only storing `self._handles` and no longer `self.params` in the future. Right now, `self.params` is always exactly `[handle.flat_param for handle in self._handles]`. cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86122 Approved by: https://github.com/zhaojuanmao	2022-10-21 16:56:24 +00:00
Brian Hirsh	ce0c6e828e	Reland "add an API for external backends to register custom device names (#86992 )" (#87453 ) Re-land of https://github.com/pytorch/pytorch/pull/86992 This reverts commit a895af92506f206889610251624590798d0deabd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87453 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-21 16:51:36 +00:00
jyx-su	70c46d32e2	Fix input dimension issue in RNN, LSTM, GRU error message (#87442 ) Fixes #86576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87442 Approved by: https://github.com/albanD	2022-10-21 16:28:32 +00:00
PyTorch MergeBot	0c1dec375f	Revert "Back out "Revert D40198461: [pytorch][PR] Backport currently dont work with some models if:" (#87124 )" This reverts commit a42fbfa0cb467b582799a5132561c82a3d33b1b7. Reverted https://github.com/pytorch/pytorch/pull/87124 on behalf of https://github.com/ZainRizvi due to This is causing periodic jobs to fail	2022-10-21 16:03:00 +00:00
Edward Z. Yang	d73d4aa7de	Audit for error prone isinstance int/float and add lint (#87345 ) We recently fixed a bug on symbolic-shapes branch where an isinstance(x, int) test failed when passed a SymIntNode. To prevent this, I've added a lint for all the codepaths where we may pass SymInt/SymFloat directly to reject direct isinstance int/float tests, and instead use one of the aliases. The lint rule explains the options. I then go and fix all of them. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87345 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2022-10-21 15:55:24 +00:00
Peter Bell	1285542f9b	OpInfo: Add test that sample_inputs_func returns a generator (#84567 ) This also includes a small list exception for single element lists since none of the memory usage or performance implications of lists apply there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84567 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-21 15:28:47 +00:00
Masaki Kozuki	aa8248cc9a	Reenable `isinstance` with `torch.distributed.ReduceOp` (#87303 ) tentatively marking as draft as I haven't gotten a comprehensive list of side effects... Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself Rel: https://github.com/pytorch/pytorch/issues/87191 cc @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303 Approved by: https://github.com/wanchaol	2022-10-21 15:05:36 +00:00
Antonio Kim	d37dc6f698	Make LazyGraphExecutor extensible (#87218 ) Add `LazyGraphExecutor` to backend interface so that its is extensible by a vendor backend. I've made some preliminary methods virtual. Not sure if we want to make all methods in `LazyGraphExecutor` virtual. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87218 Approved by: https://github.com/wconstab, https://github.com/alanwaketan	2022-10-21 14:28:14 +00:00
Kazuaki Ishizaki	d80a5f9a96	Fix typo under torch directory (#87274 ) This PR fixes typo in .md files under torch directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/87274 Approved by: https://github.com/albanD	2022-10-21 14:22:20 +00:00
Nikita Shulga	ae62cf7c02	[MPS] Revamp copy_to_mps_ implementation (#86956 ) Tensor's view in linear storage is represented by the following parameters: `.shape`, `.stride()` and `.storage_offset()`. Only tensors that are representable as 1d-views can be copied from host to device (and vice versa) using single [`copy(from:sourceOffset:to:destinationOffset:size:)`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language=objc) call. Modify `copy_to_mps_` function to do the following steps: - Cast `src` tensor to dst data type if needed - Expand `src` tensor to `dst` tensor shape - Clone `src` tensor if it is not stride contiguous (i.e. can not be represented by `src.view(src.numel())`) - Create an empty tensor if `dst` is not stride-contiguous or if its strides are different then potentially cloned `src` strides - Do 1d copy for `src` to (potentiall temp) `dst` - Finally do re-striding/copy on MPS if needed Add test to cover cases where stide-contiguous permuted tensor is copied to MPS, non-stride-contiguous tensor is copied to MPS and if permuted CPU tensor is copied to differently permuted MPS tensor Fixes https://github.com/pytorch/pytorch/issues/86954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86956 Approved by: https://github.com/kulinseth	2022-10-21 14:10:05 +00:00
Michael Voznesensky	435e78e523	[dynamo] [easy] RM spurious `)` (#87439 ) Fixes #ISSUE_NUMBER cc @jansel @lezcano @fdrocha @mlazos @soumith @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87439 Approved by: https://github.com/msaroufim, https://github.com/soumith	2022-10-21 07:55:23 +00:00
Sherlock Huang	ab901b4817	Python binding for dispatcher getAllOpNames (#87422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87422 Approved by: https://github.com/bdhirsh	2022-10-21 06:55:10 +00:00
Soumith Chintala	7caeac1718	[inductor] Fix channels_last conv2d propagation when CuDNN is not found (#87266 ) Fixes https://github.com/pytorch/torchdynamo/issues/1701 cc @jansel @lezcano @fdrocha @mlazos @voznesenskym @yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/87266 Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/voznesenskym	2022-10-21 06:36:16 +00:00
Antonio Kim	6b59d9b566	Fix registration hooks (#87369 ) There is a bug in the implementation of the registration hooks introduced in https://github.com/pytorch/pytorch/pull/86148 whereby if the hook returns a tensor, then the short circuiting logic: ``` value = hook(self, name, value) or value ``` Raises an exception ``` RuntimeError: Boolean value of Tensor with more than one value is ambiguous ``` Fixing the logic so that it only checks to see if the value is `None` before overriding Fixes #85837 CC: @albanD @jbschlosser Pull Request resolved: https://github.com/pytorch/pytorch/pull/87369 Approved by: https://github.com/albanD	2022-10-21 05:12:25 +00:00
Soumith Chintala	ff43288d31	[AOT][CUDAGraphs] torchdynamo -> torch._dynamo (#87243 ) Fixes lingering issues from the torchdynamo -> torch._dynamo migration Pull Request resolved: https://github.com/pytorch/pytorch/pull/87243 Approved by: https://github.com/suo, https://github.com/voznesenskym, https://github.com/jansel	2022-10-21 03:14:28 +00:00
Richard Zou	13ab819356	[functorch] fix AOTAutograd tutorial (#87415 ) It was raising asserts previously Pull Request resolved: https://github.com/pytorch/pytorch/pull/87415 Approved by: https://github.com/Chillee	2022-10-21 01:53:24 +00:00
Bin Bao	b1cf377cce	Enable inductor CI for huggingface (#86792 ) Summary: Unit tests will be enabled after fixed in trunck. TorchBench and TIMM need more setup and are coming later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86792 Approved by: https://github.com/jansel, https://github.com/huydhn	2022-10-21 01:38:46 +00:00
Yanbo Liang	9ba632253a	[Inductor] Convert 0d CPU tensor to scalar during triton codegen (#87329 ) This is a follow up to address [this](https://github.com/pytorch/torchdynamo/pull/1284#pullrequestreview-1130319129). We revised to use the codegen approach to handle 0d CPU tensor, which will not support cudagraph any more. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87329 Approved by: https://github.com/ngimel	2022-10-21 01:24:00 +00:00
Nikita Shulga	961ebca225	Add `weights_only` option to `torch.load` (#86812 ) This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling. Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as `torch.Tensor` and `torch.Storage` variants. Defaults `weights_only` is set to `False`, but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable. To some extent, addresses https://github.com/pytorch/pytorch/issues/52596 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86812 Approved by: https://github.com/ezyang	2022-10-21 01:09:50 +00:00
Jason Ansel	e3d73bbb07	Remove jansel/voz from dynamo CODEOWNERS (#87430 ) Now that CC bot is working on PRs this is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87430 Approved by: https://github.com/voznesenskym	2022-10-21 00:59:31 +00:00
Chien-Chin Huang	bd1e95ce30	Improve the performance of validate_non_overlapping_shards_metadata (#85639 ) `validate_non_overlapping_shards_metadata()` uses a quadratic algorithm to verify the overlapping. However, in some cases (only one dimension is sharded), we a O(nlogn) algorithm can easily be implemented. This PR changes the implementation of `validate_non_overlapping_shards_metadata()`. Differential Revision: [D39681725](https://our.internmc.facebook.com/intern/diff/D39681725/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85639 Approved by: https://github.com/wanchaol	2022-10-20 23:51:48 +00:00
Han Qi (qihqi)	a42fbfa0cb	Back out "Revert D40198461: [pytorch][PR] Backport currently dont work with some models if:" (#87124 ) Summary: reland after fixing windows build failure for OVR. Notable change: ``` #if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD) ``` changed to ```#if defined(FBCODE_CAFFE2) \|\| defined(FB_XPLAT_BUILD) ``` Appearently `-DFB_XPLAT_BUILD` wasn't getting picked up in windows if using `or `to connect Original commit changeset: 7a31fc4b455f Original Phabricator Diff: D40198461 Test Plan: waitforsandcastle Reviewed By: davidberard98, cccclai Differential Revision: D40290932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87124 Approved by: https://github.com/gmagogsfm	2022-10-20 23:02:10 +00:00
PyTorch MergeBot	f38a88c4dd	Revert "[dynamo] use optimizers correctly in benchmarking (#87311 )" This reverts commit 703c19008df4700b6a522b0ae5c4b6d5ffc0906f. Reverted https://github.com/pytorch/pytorch/pull/87311 on behalf of https://github.com/anijain2305 due to Bin (desertfire) is trying to get torchbench models in CI, and this PR prevents that. I will bring this back after models are in CI.	2022-10-20 22:01:51 +00:00
Yanbo Liang	a91abedf0d	[Inductor] TorchInductor tracing fx_graph.py should import overrides (#87271 ) Running the generated script would be failed if there are ops like ```philox_rand_like``` and ```philox_rand_like```. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87271 Approved by: https://github.com/jansel	2022-10-20 21:59:12 +00:00
Catherine Lee	1801b57cf6	set ci in mps (#87325 ) dunno if installing xml runner like this is a good idea Pull Request resolved: https://github.com/pytorch/pytorch/pull/87325 Approved by: https://github.com/huydhn, https://github.com/malfet	2022-10-20 21:50:20 +00:00
Sherlock Huang	f7da9db9c1	Unify decomp registries into global_decomposition_table (#86857 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86857 Approved by: https://github.com/ezyang	2022-10-20 21:29:05 +00:00
Svetlana Karslioglu	7e83f65ad5	Add General Project Policies (#87385 ) Add General Project Policies to the Governance page Pull Request resolved: https://github.com/pytorch/pytorch/pull/87385 Approved by: https://github.com/orionr	2022-10-20 21:02:09 +00:00
George Qi	17202b3637	[maskedtensor] fix docs formatting (#87387 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87387 Approved by: https://github.com/cpuhrsch	2022-10-20 20:48:25 +00:00
samdow	bc8cf33244	add deprecation warning to nn stateless functional_call (#87367 ) Same as the release version but just for master Pull Request resolved: https://github.com/pytorch/pytorch/pull/87367 Approved by: https://github.com/albanD, https://github.com/atalman	2022-10-20 20:16:49 +00:00
Catherine Lee	9b88dcf248	[ci] handle libomp upgrade on github (#87382 ) like #86979, idk if this is a good idea but it seems to fix the problem Pull Request resolved: https://github.com/pytorch/pytorch/pull/87382 Approved by: https://github.com/seemethere	2022-10-20 19:40:59 +00:00
Richard Zou	0826863962	[functorch][docs] Downgrade the warning about forward-mode AD coverage (#87383 ) Previously we claimed that "forward-mode AD coverage is not that good". We've since improved it so I clarified the statement in our docs and downgraded the warning to a note. Test Plan: - view docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/87383 Approved by: https://github.com/samdow	2022-10-20 18:51:13 +00:00
Michael Voznesensky	2fd008ed43	[dynamo] Add support for invoking nn sequential (#87156 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87156 Approved by: https://github.com/jansel	2022-10-20 18:14:40 +00:00
Horace He	68e946b0c3	Fixed tune_layout to not do anything for non-2d convolutions (#87328 ) cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87328 Approved by: https://github.com/ngimel	2022-10-20 18:02:51 +00:00
Richard Zou	b805e1abef	[functorch] Fix torch.cat batching rule (#86932 ) The bug was discovered in https://github.com/pytorch/pytorch/pull/86842. torch.cat has an edge case where it ignores all tensors of shape [0]. So if any of the BatchedTensors have logical shape [0] but physical shape [B, 0], then we coerce them to shape [0] by slicing them. Why don't we just ignore those Tensors? We need to propagate requires_grad-ness somehow (e.g. if the BatchedTensor wraps a Tensor of shape [B, 0] that requires grad, then the output must require grad). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/86932 Approved by: https://github.com/Chillee	2022-10-20 18:01:31 +00:00
Taylor Robie	c16b7b41f7	[Profiler][Trivial] Small style and safety fixes (#86752 ) I noticed a couple abbreviations in the new optimizer capture code that are worth expanding. I also made the RawTensorMetadata a bit safer. Differential Revision: [D40210702](https://our.internmc.facebook.com/intern/diff/D40210702/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86752 Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi	2022-10-20 17:34:16 +00:00
Zachary DeVito	1e4a274248	[dynamo] avoid popen.communicate() (#87335 ) It seems like when popen.communicate() is used it waits for all the desendents of popen to close the stdin/stderr. However, if we have have worker processes running in the child, and the child segfaults, those processes will stay alive until someone waitpid's the child. Since those children have open handles to the stdin/stderr pipe, communicate never returns. This change just writes the output to temp files and directly calls wait() on the child, which returns as soon as it dies. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87335 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym	2022-10-20 17:28:27 +00:00
Zain Rizvi	75a5a46aa0	Retry sccache downloads (#87306 ) This is meant to mitigate network flakiness like the one seen on [this build](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872) which results in s3 refusing a connection and sccache failing to download Adding the retry at the workflow level instead of the curl level since as per the job it doesn't seem like the curl command was retried at all. It's possible that the specific html code returned during "Connection refused" isn't one of the ones the gets retried, or the retries don't show on the console and it needed a longer period of time between retries or that. Using the job level retry with a generous retry delay solves for both possibilities. Sample error log: ``` Run sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v[2](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:2).15 --output /usr/local/bin/sccache sudo chmod +x /usr/local/bin/sccache echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}" echo "SCCACHE_S[3](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:3)_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}" shell: /bin/bash -e {0} env: AWS_ACCESS_KEY_ID: * AWS_SECRET_ACCESS_KEY: * BUILD_ENVIRONMENT: macos-12-py3-x86-6[4](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:4) DEVELOPER_DIR: /Applications/Xcode_13.3.1.app/Contents/Developer CONDA_ENV: /Users/runner/work/_temp/conda_environment_3283124[6](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:6)93 CONDA_RUN: conda run -p /Users/runner/work/_temp/conda_environment_3283124693 --no-capture-output CONDA_BUILD: conda run -p /Users/runner/work/_temp/conda_environment_3283124693 conda-build CONDA_INSTALL: conda install -p /Users/runner/work/_temp/conda_environment_3283124693 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 curl: ([7](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:7)) Failed to connect to s3.amazonaws.com port 443 after [8](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:8)6 ms: Connection refused Error: Process completed with exit code 7. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87306 Approved by: https://github.com/seemethere	2022-10-20 17:16:45 +00:00
Rui Zhu	4b757f4633	Assert if padding mask type is unexpected (#86353 ) (#87106 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353 Fix the issue described in https://github.com/pytorch/pytorch/issues/86120 Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad Differential Revision: D40129968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106 Approved by: https://github.com/malfet	2022-10-20 16:01:54 +00:00
efiks	38543d8da0	[torch] Add fmsub to vectrozation primitives (#86568 ) Summary: Add fmsub which is similar to fmadd Test Plan: CI Differential Revision: D40215267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86568 Approved by: https://github.com/ajtulloch, https://github.com/malfet	2022-10-20 15:10:44 +00:00
PyTorch MergeBot	a895af9250	Revert "add an API for external backends to register custom device names (#86992 )" This reverts commit fb6826bfd82660aa905459f894c81d97d143dd2c. Reverted https://github.com/pytorch/pytorch/pull/86992 on behalf of https://github.com/jeanschmidt due to breaking internal builds - D40534212 - arstudio-windows-tests-landcastle-0	2022-10-20 14:51:08 +00:00
albanD	9199f9188c	Add inplace function testing to test_proxy_tensor (#87324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87324 Approved by: https://github.com/ezyang	2022-10-20 14:20:19 +00:00
albanD	254b681dc6	Convert torch.Size() argument to sym size in test_proxy_tensor (#87304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87304 Approved by: https://github.com/ezyang	2022-10-20 14:20:19 +00:00
albanD	9bd6ea5d76	Add meta inplace testing (#87291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87291 Approved by: https://github.com/ezyang	2022-10-20 14:20:16 +00:00
albanD	2e08ac8696	Add randint OpInfo (#87231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87231 Approved by: https://github.com/ezyang	2022-10-20 14:20:12 +00:00
Bert Maher	8b704eddcd	Update the pinned triton hash (#87300 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87300 Approved by: https://github.com/jansel	2022-10-20 14:15:48 +00:00
PyTorch MergeBot	c4cf701889	Revert "[complex] conv_transpose2d (#81805 )" This reverts commit 528dd05108cdac6726748c34e385b5c3136256df. Reverted https://github.com/pytorch/pytorch/pull/81805 on behalf of https://github.com/jeanschmidt due to Breaking internal builds - D40534110 - android-java-tests-0	2022-10-20 13:44:15 +00:00
PyTorch MergeBot	05ad7bd743	Revert "Advance nightly docker to 11.6 (#86941 )" This reverts commit c5de535bc0b785abbacfebddf660af4cd3b2a6a1. Reverted https://github.com/pytorch/pytorch/pull/86941 on behalf of https://github.com/atalman due to Workflow is passing but installs CUDA 11.3 PyTorch rather then 11.6	2022-10-20 13:17:11 +00:00
Nikita Karetnikov	1b8af28fe8	[primTorch] Add refs for `softmax`, `softmin`, `log_softmax` (#84956 ) cc @ezyang @mruberry @ngimel @Lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-20 12:29:04 +00:00
Animesh Jain	703c19008d	[dynamo] use optimizers correctly in benchmarking (#87311 ) We were not setting optimizers correctly * This hid the issue that we see here - https://github.com/pytorch/torchdynamo/issues/1687 * This has also revealed that we are activating profilers for every dynamo optimized model call. This could affect speedup cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87311 Approved by: https://github.com/mlazos, https://github.com/yanboliang	2022-10-20 05:46:25 +00:00
Horace He	8349bf1cd1	Added special printing to FloorDiv so it's printed out with // insead of as a name (#87263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87263 Approved by: https://github.com/ezyang	2022-10-20 05:06:22 +00:00
erjia	b90db4a78f	[DataPipe] Fix type checking to accept both Iter and Map DataPipe (#87285 ) Fixes https://github.com/pytorch/data/issues/841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87285 Approved by: https://github.com/NivekT	2022-10-20 05:05:56 +00:00
Antoni Viros i Martin	d94e33f041	Add support for .to() for NestedTensor backends (#87146 ) Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140. Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job. Differential Revision: D40437585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146 Approved by: https://github.com/drisspg	2022-10-20 03:46:50 +00:00
PyTorch MergeBot	472bdb3aa8	[vision hash update] update the pinned vision hash (#87339 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87339 Approved by: https://github.com/pytorchbot	2022-10-20 03:45:18 +00:00
soulitzer	c18eead2df	Update saved variable hooks to no longer trigger on wrapped numbers (#87316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87316 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-20 03:01:11 +00:00
andrewor14	0cae309069	[Quant] Add get_symmetric_qnnpack_qconfig_mapping (#87002 ) Summary: Today, in order to get XNNPACK quantized ops to work, the user must write some code that refers to private data structures (`_FIXED_QPARAMS_OP_TO_OBSERVER`) to create a QConfigMapping that is compatible with the symmetric constraints in the QNNPACK BackendConfig. This is because `get_default_qconfig("qnnpack")` produces a QConfig that does not satisfy these constraints, and the default QConfigMapping for QNNPACK uses this Qconfig. Instead, we simply put this code into a helper function to make it easier for the user to run XNNPACK quantized ops. In the future, once there is feature parity between the set of ops supported by QNNPACK and XNNPACK, we should revisit whether to simply change `get_default_qconfig("qnnpack")` to return an XNNPACK-compatible QConfig. Test Plan: python test/test_quantization.py TestQuantizeFx.test_symmetric_qnnpack_qconfig_mapping Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/87002 Approved by: https://github.com/vkuzo	2022-10-20 02:33:15 +00:00
Huy Do	e6bc8f415b	[BE] Move conda cmake installation to Docker (#87309 ) This is parts of the effort to consolidate pip and conda installation in the CI to improve our CI reliability. This moves conda cmake installation to Docker in those use cases that require it: * Ubuntu bionic and focal On the other hand: * XLA doesn't seem to need conda cmake anymore (Build and test successfully) * Centos is not in used anywhere in the CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/87309 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2022-10-20 02:13:11 +00:00
Zachary DeVito	0d2c2110f1	[allocator] Introduce the abstract class CUDACachingAllocator (#87251 ) This replaces the manual function pointers, making it easier to write new drop-in allocators. Note that most allocation goes through the Allocator interface, which CUDAAllocator inherits from, and this arrangement avoids adding and additional layer of dispatch along this pathway compared to what existed before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251 Approved by: https://github.com/wconstab	2022-10-20 01:17:00 +00:00
Huy Do	888e15408e	Fix wrong lintrunner version (#87295 ) The syntax is invalid for pip. I missed this a while back: ``` Run pip install -r .github/requirements-gha-cache.txt ERROR: Invalid requirement: 'lintrunner=0.9.2' (from line 11 of .github/requirements-gha-cache.txt) Hint: = is not a valid operator. Did you mean == ? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87295 Approved by: https://github.com/ZainRizvi	2022-10-20 01:04:42 +00:00
Horace He	bd757b364c	Ensure that symbolic variables incorporate fresh constraints before they're used (#87254 ) cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87254 Approved by: https://github.com/jansel	2022-10-20 00:37:40 +00:00
Sahan Paliskara	bcde75427e	run torch::deploy test using pip install (#86507 ) This PR runs the unit tests for [multipy](https://github.com/pytorch/multipy) in pytorch core such that we are able to make sure changes in core do not break multipy as adding `_prims` did. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86507 Approved by: https://github.com/anirbanr-fb-r2p, https://github.com/d4l3k	2022-10-20 00:15:45 +00:00
Rohan Varma	07bd053a7e	[rpc] Wrap exception creation with try/catch (#87224 ) Sometimes, we cannot recreate the exception with only string (for example if it is a custom exception type). Ideal situation would be to carry over all details on how to recreate the remote end's exception and throw that on client, but for now, we raise a RuntimeError with the original error msg when we cannot reconstruct. Created from CodeHub with https://fburl.com/edit-in-codehub Differential Revision: [D40353274](https://our.internmc.facebook.com/intern/diff/D40353274/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87224 Approved by: https://github.com/fduwjj	2022-10-20 00:02:24 +00:00
Edward Z. Yang	c97ffcff46	[discussion] fix for aot autograd outputs that dont require grad (#86838 ) Fixes https://github.com/pytorch/functorch/issues/1052 I got here after some discussion with Alban. Today, if you aot_function() trace a program where some of its inputs have `requires_grad=True`, but some outputs are expected to have `requires_grad=False`, we will incorrectly set all outputs to have `requires_grad=True`. A simple solution is to use autograd.function's API for marking outputs as non-differentiable, based on what we witnessed when we traced the forward. This will make the `autograd.Function` that we return wrong, if you created it using inputs that required grad, and tried to re-use it with inputs that have different `requires_grad` field. But as long as we're hiding behind dynamo, which should guard on requires_grad, then we'll re-run `aot_function()` and get out a new compiled function that does the right thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86838 Approved by: https://github.com/ezyang	2022-10-19 23:41:54 +00:00
Michael Lazos	c9b618447d	Fix line numbers bug (#87247 ) Fixes https://github.com/pytorch/torchdynamo/issues/1462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87247 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-10-19 22:44:01 +00:00
Nikita Shulga	c8889f4e10	`cuda._is_in_bad_fork`->`_C._cuda_isInBadFork` (#87317 ) Former is always available, while later is only available if PyTorch compiled with CUDA And if it does, then ``` $ python -c "import torch;print(torch._C._cuda_isInBadFork == torch.cuda._is_in_bad_fork)" True ``` Fixes https://github.com/pytorch/torchdynamo/issues/1709 ( at least the symptom) cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87317 Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/soumith, https://github.com/jansel	2022-10-19 22:15:28 +00:00
Yanbo Liang	56b150ac63	[Dynamo] Support optimizing over any Tensor with requires_grad = True (#87141 ) Fixes https://github.com/pytorch/torchdynamo/issues/1604 Re-submit for https://github.com/pytorch/torchdynamo/pull/1646 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87141 Approved by: https://github.com/jansel	2022-10-19 22:13:07 +00:00
albanD	12b2f70a89	Symintify pad ops (#87046 ) Following comments below, we need to add support for `std::negate`/`std::min`/`std::max`/`operator-` for SymInt Pull Request resolved: https://github.com/pytorch/pytorch/pull/87046 Approved by: https://github.com/ezyang	2022-10-19 21:43:08 +00:00
atalman	c5de535bc0	Advance nightly docker to 11.6 (#86941 ) Fixes following: https://github.com/pytorch/pytorch/actions/runs/3242695506/jobs/5316334351 crash in Docker builds introduced by: #82682 The PR seems to introduce some changes not compatible with cuda 11.3 which is used by our Docker builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/86941 Approved by: https://github.com/malfet	2022-10-19 21:26:55 +00:00
Peter Bell	6eeeb88172	OpInfo: Sample input cleanup (4/n) (#86324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324 Approved by: https://github.com/mruberry	2022-10-19 21:25:45 +00:00
albanD	c141f28b64	Fix compilation warning and spurious print (#87297 ) Fixes compilation warning, make this warning an error and remove a random print. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87297 Approved by: https://github.com/malfet	2022-10-19 20:56:37 +00:00
Nikita Shulga	4a533f1215	Tweak several test serialization to store models state_dict (#87143 ) Namely, change: - `test_meta_serialization` - `test_serialization_2gb_file` - `test_pathlike_serialization` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87143 Approved by: https://github.com/ezyang	2022-10-19 20:51:32 +00:00
George Qi	cf2be34ff5	[maskedtensor] add docs (#84887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84887 Approved by: https://github.com/cpuhrsch	2022-10-19 20:44:34 +00:00
PyTorch MergeBot	cd21613526	Revert "[primTorch] Add refs for `softmax`, `softmin`, `log_softmax` (#84956 )" This reverts commit c09ca93e4733fdf0183433114dda2fc30a846700. Reverted https://github.com/pytorch/pytorch/pull/84956 on behalf of https://github.com/ZainRizvi due to This is causing the MPS test test_output_match_log_softmax_with_dtype_cpu_float32 (__main__.TestConsistencyCPU) to fail	2022-10-19 20:36:55 +00:00
Chien-Chin Huang	c08c799750	[FSDP] Add set_state_dict_type API to setup state_dict_type without using context manager (#86243 ) FSDP.state_dict_type is a context manager. However, users may want to decide what state_dict is going to used during initialization. `set_state_dict_type` allows users to do so. Differential Revision: [D40083670](https://our.internmc.facebook.com/intern/diff/D40083670/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86243 Approved by: https://github.com/rohan-varma	2022-10-19 19:46:18 +00:00
PyTorch MergeBot	f3cc588d09	Revert "Dynamo FX graph stack traceback fix (#87136 )" This reverts commit 89e6078bc3d83b61e03511304ec42743b84df42e. Reverted https://github.com/pytorch/pytorch/pull/87136 on behalf of https://github.com/clee2000 due to causing a lot of tests to fail on master even though pr is green	2022-10-19 18:57:24 +00:00
Nikita Karetnikov	c09ca93e47	[primTorch] Add refs for `softmax`, `softmin`, `log_softmax` (#84956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-19 18:45:40 +00:00
Zachary DeVito	00c91f4446	[allocator] disable tests that don't work for cudaMallocAsyncAllocator (#87250 ) Two tests were failing locally for me and don't appear to be run in our CI. Disabling them so we can otherwise refactor the allocators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87250 Approved by: https://github.com/wconstab	2022-10-19 18:29:35 +00:00
Richard Zou	15ca68526c	[functorch] Get rid of defunct functorch/setup.py (#87235 ) We initially left it there for BC concerns. - It has been more than a month since then, - I have migrated folks who used the previous install command (pip install ...pytorch.git@subdir=functorch) off of it so it's time to get rid of it Test Plan: - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/87235 Approved by: https://github.com/Chillee	2022-10-19 18:01:55 +00:00
Richard Zou	ac80da2293	[functorch] add test for torch.manual_seed inside grad transform (#87233 ) I can see this behavior regressing really easily, so adding a test for it. Test Plan: - run test Pull Request resolved: https://github.com/pytorch/pytorch/pull/87233 Approved by: https://github.com/Chillee	2022-10-19 18:01:55 +00:00
Zachary DeVito	f56ce8dbad	[allocator] Move getFreeMutex (#87237 ) It isn't used at all the allocators and this change makes that more clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237 Approved by: https://github.com/wconstab	2022-10-19 18:00:40 +00:00
William Wen	89e6078bc3	Dynamo FX graph stack traceback fix (#87136 ) Migration from https://github.com/pytorch/torchdynamo/pull/1655. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87136 Approved by: https://github.com/voznesenskym	2022-10-19 17:15:43 +00:00
atalman	40d0fa5314	Reenable aot tests on windows for cuda 11.7 and up (#87193 ) Reenable aot tests on windows for cuda 11.7 and up Issue: https://github.com/pytorch/pytorch/issues/69460 seems to be mitigated in CUDA 11.7 hence re-enable this test cc @peterjc123 @mszhanyi @skyline75489 @nbcsm Pull Request resolved: https://github.com/pytorch/pytorch/pull/87193 Approved by: https://github.com/malfet	2022-10-19 17:09:37 +00:00
Huy Do	86a581928a	Pin ios conda dependencies (#87229 ) I also pin blas to 1.0 instead of the newer 2.116 available elsewhere (https://anaconda.org/conda-forge/blas) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87229 Approved by: https://github.com/izaitsevfb, https://github.com/ZainRizvi, https://github.com/malfet	2022-10-19 17:01:11 +00:00
Nikita Shulga	a79e034d89	[MPS] Do not dispatch empty job in `bitwise_not` (#87286 ) Follows the pattern from https://github.com/pytorch/pytorch/pull/85285 and returns before computing dispatching an empty metal kernel for bitwise not operation. Fixes crash when invoked with empty MPS tensor on AMD GPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/87286 Approved by: https://github.com/kulinseth	2022-10-19 17:00:10 +00:00
Natalia Gimelshein	6775c3e19d	fix 0d cpu tensor handling when it's the first arg (#87273 ) Fixes https://github.com/pytorch/torchdynamo/issues/1681 When at least one of the pw args is on cuda, set device to cuda. We assume that cases of true device mismatch have been already weeded out during tracing, and what we have is 0d cpu tensor + cuda tensor interop. Also fix 0d tensor test that previously wasn't compiling with dynamo. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87273 Approved by: https://github.com/soumith, https://github.com/voznesenskym	2022-10-19 16:55:27 +00:00
Brian Hirsh	fb6826bfd8	add an API for external backends to register custom device names (#86992 ) This API adds some improvements to external backends who are building C++ backends out of tree using the `PrivateUse1` dispatch key. The docs and linked examples go over the API in more detail, but you should be able to use it like: ``` # This should probably be in the __init__.py file of a external backend's python package > torch.register_privateuse1_backend("foo")` # And it will allow the user to do this: > a = torch.ones(2, device="foo") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86992 Approved by: https://github.com/albanD	2022-10-19 16:44:17 +00:00
William Wen	cc64863d71	Clean Inductor complication cache during dynamo dashboard run (#87246 ) Implement improvement from https://github.com/pytorch/torchdynamo/issues/1644. Tested by running `python benchmarks/dynamo/runner.py --print_run_commands --training` and inspecting the generated `run.sh` file for the `--cold_start_latency` flag, e.g. ``` python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=benchmark_logs/inductor_torchbench_float32_training_cuda_performance.csv --training --inductor --no-skip --dashboard -x fambench_xlmr -x detectron2_fasterrcnn_r_50_c4 -x detectron2_fasterrcnn_r_50_dc5 -x detectron2_maskrcnn_r_101_fpn -x detectron2_maskrcnn_r_50_fpn -x detectron2_fasterrcnn_r_50_fpn -x detectron2_maskrcnn -x detectron2_fasterrcnn_r_101_dc5 -x opacus_cifar10 -x detectron2_maskrcnn_r_101_c4 -x pyhpc_turbulent_kinetic_energy -x maml -x detectron2_fasterrcnn_r_101_fpn -x pyhpc_equation_of_state -x detectron2_fasterrcnn_r_101_c4 -x pyhpc_isoneutral_mixing --cold_start_latency ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87246 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-10-19 16:39:12 +00:00
Brian Hirsh	b3071e2eb6	functionalization: skip meta reference compute for aot autograd (#87108 ) The context is that historically, XLA/LTC tensors haven't had accurate stride information, and functionalization would run "reference" meta kernels for view ops on the side to properly compute strides. This is more complicated in symint tracing world - we have a `FunctionalTensorWrapper()` that wraps the underlying tensor and has its own set of sizes/strides metadata, but we never create proxy objects for the sizes/strides of the wrapper. In symint tracing world with aot autograd, we're guaranteed that our underlying strides are accurate anyway, since aot autograd uses fake tensors to perform tracing. We encountered a few bugs with symint's from the `FunctionalTensorWrapper` making their way into `__torch_dispatch__`. To side-step that area of bugs completely (and marginally improve perf), this PR disables the meta tensor tracing for non XLA/LTC use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87108 Approved by: https://github.com/ezyang, https://github.com/wconstab	2022-10-19 15:59:28 +00:00
Brian Hirsh	4801397b6e	ban .sizes() and .strides() calls in derivatives.yaml (#86611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86611 Approved by: https://github.com/wconstab, https://github.com/albanD	2022-10-19 15:59:28 +00:00
anjali411	182ee87996	symintify nll loss fns (#86915 ) (#87095 ) This reverts commit bbd7b38d5580c44ffb4404d431e07bc2316e59d5. Reland https://github.com/pytorch/pytorch/pull/86915 with a fix for python arg parser handing for SymInt and SymIntList. This was uncovered because we are calling directly into python bindings code through test_autocast.py (`torch._C._nn.nll_loss`) without providing a value for the optional symint arg (`ignore_index`). The arg parser constructs the SymInt and SymIntList using the recorded "default_int" or "default_int_list" (schema string parsing) in case a value is not received for an optional argument. Since we weren't handling the symint case properly, the default_int just had a garbage value which was later being used to construct SymInt. Follow up issue for other unhandled parameter types: https://github.com/pytorch/pytorch/issues/87283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87095 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-19 14:50:51 +00:00
leizhenyuan	c6187ea326	add support for pin memory on xpu device (#86545 ) add support for pin memory on xpu device Pull Request resolved: https://github.com/pytorch/pytorch/pull/86545 Approved by: https://github.com/ezyang	2022-10-19 13:24:48 +00:00
kshitij12345	528dd05108	[complex] conv_transpose2d (#81805 ) Reference: https://github.com/pytorch/pytorch/issues/71108 Fixes : #86414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81805 Approved by: https://github.com/anjali411	2022-10-19 09:12:27 +00:00
XiaobingSuper	232fbd90ff	[TorchDynamo]: fused bias for cpu convolution path (#87050 ) For aten.convolution CPU path, the bias always can be fused, so this PR adds a device check: if inputs' device is CPU, we will fuse it for a good performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87050 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-10-19 07:13:38 +00:00
Horace He	5e23074f0d	Fixed FakeTensor not calling CompositeImplicitAutograd decomps sometimes (#87252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87252 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2022-10-19 07:13:30 +00:00
Jason Ansel	b5bdc34541	[inductor] Sympy compability fix (#87249 ) Test Plan: github tests Reviewed By: yf225, voznesenskym Differential Revision: D40495411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87249 Approved by: https://github.com/ngimel, https://github.com/voznesenskym	2022-10-19 06:32:42 +00:00
Chiao	6faa6c68e8	fsdp lazy_init typo (#87184 ) Minor typo, changed with -> without Pull Request resolved: https://github.com/pytorch/pytorch/pull/87184 Approved by: https://github.com/awgu	2022-10-19 05:11:31 +00:00
Horace He	2418ddb1ec	Unified symbolic shape variables between Inductor and AOTDispatcher (#87161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161 Approved by: https://github.com/jansel	2022-10-19 04:50:34 +00:00
PyTorch MergeBot	48df4b7a1d	[vision hash update] update the pinned vision hash (#87100 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87100 Approved by: https://github.com/pytorchbot	2022-10-19 04:12:55 +00:00
Nikita Shulga	dfe3fc028c	[CI] Add triton wheels build workflow (#87234 ) Also, add `torchtriton` and `jinja2` as extra `dynamo` dependency to PyTorch wheels, Version packages as first 10 characters of pinned repo hash and make `torch[dynamo]` wheel depend on the exact version it was build against. TODO: Automate uploading to nightly wheels storage Pull Request resolved: https://github.com/pytorch/pytorch/pull/87234 Approved by: https://github.com/msaroufim	2022-10-19 03:35:16 +00:00
David Berard	c413a32135	Release note script: match topics with spaces or underscores (#87011 ) e.g. match "new features" in the category as "new_features" Pull Request resolved: https://github.com/pytorch/pytorch/pull/87011 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-10-19 02:28:45 +00:00
Driss Guessous	c471c29fdc	Update sdp guards for performance (#87241 ) # Summary Makes the contiguous check for the nt input more strict/correct as well as makes some performance improvements to the checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/87241 Approved by: https://github.com/cpuhrsch	2022-10-19 02:16:31 +00:00
Nikita Shulga	6d0d7afe8d	[GHA][BE] Delete unused macros from `common.yml.j2` (#87253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87253 Approved by: https://github.com/huydhn	2022-10-19 02:11:54 +00:00
Michael Suo	31e731e5ae	[dynamo] fix logging (#87239 ) Currently, setting `torch._dynamo.config.log_level` doesn't do anything, as the module name has changed during the move. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87239 Approved by: https://github.com/jansel, https://github.com/soumith, https://github.com/mlazos	2022-10-19 01:43:11 +00:00
Tongzhou Wang	7ff1ca4e33	Add type annotation to get_worker_info (#87017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87017 Approved by: https://github.com/ejguan, https://github.com/NivekT	2022-10-19 00:25:04 +00:00
Yidi Wu	4dc579838b	Allow fx.Graph.owning_module to be used as attribute. (#86822 ) Summary: The current behavior of owning_module setter is difficult to understand: it changes the owning_module to None if owners is not 0 but increments the owners count. If the owning_module is None, the owners count should be 0 as none of them is accessible. On the other hand, if the owners count increases, the owning_module should be a collection (e.g. a list). This diff changes owning_module to be a normal attribute. The semantic is that graph can have at most one owning module and can be assigned to new module. The alternative is to use a list to represent the owning_modules of a graph but it breaks backward compatibility and the exact use cases of having multiple owning_modules are not clear. Test Plan: Test with CI. Differential Revision: D40200624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86822 Approved by: https://github.com/tugsbayasgalan	2022-10-19 00:12:59 +00:00
Seonglyong Gong	3eb7429385	[Profiler][trivial] Add profiler options to trace metadata (#87102 ) Summary: Add profiler options (`profile_memory`, `record_shapes`, `with_stack`, `with_modules`, and `with_flops`) to trace metadata Test Plan: CI tests Differential Revision: D40373514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87102 Approved by: https://github.com/aaronenyeshi	2022-10-19 00:00:10 +00:00
Christian Puhrsch	f6c6048b10	Use CUTLASS GEMM for NT bmm (#85894 ) Copy of https://github.com/pytorch/pytorch/pull/85710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894 Approved by: https://github.com/drisspg	2022-10-18 23:11:47 +00:00
Jane Xu	80790ecee4	[einsum] Call view instead of sum to remediate MPS regression (#87135 ) Fixes #87010. It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible. Benchmarking results show that, on MPS, we would be going from the following code taking 29.89ms instead of the current 1466ms, almost a 50x speedup. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k).max().item() ``` And a regular einsum will now take .506ms instead of 2.76ms. ``` q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float) torch.einsum('b i d, b j d -> b i j', q, k) ``` Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!! Pull Request resolved: https://github.com/pytorch/pytorch/pull/87135 Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD	2022-10-18 23:01:28 +00:00
Jane Xu	c4a03e4da1	[einsum] keep the promise that we contract left to right (#87199 ) We promise that if path is not defined, we would go left to right. The previous code did not keep that promise as we push'd combined ops to the back of the list. For most use cases this is fine (einsum with 3 or fewer inputs), but we should do what we say. Test plan: Added a print statement to print the sizes of ops we're contracting to see if the order is fixed. Code run: ``` import torch a = torch.rand(1) b = torch.rand(2) c = torch.rand(3) d = torch.rand(4) torch.einsum('a,b,c,d->abcd', a,b,c,d) ``` BEFORE--it does a+b, then c+d, then a+b+c+d, which...is right, but it's not the order specified by the user. ``` /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] ``` WITH THIS CHANGE--it actually goes left to right: a+b, a+b+c, a+b+c+d ``` /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] /Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.) return _VF.einsum(equation, operands) # type: ignore[attr-defined] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87199 Approved by: https://github.com/soulitzer	2022-10-18 22:58:44 +00:00
Driss Guessous	d06d569e90	Update the sdp benchmark to work with nested tensors (#87215 ) # Summary Update the sdp benchmark to work with nested tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/87215 Approved by: https://github.com/cpuhrsch	2022-10-18 21:38:45 +00:00
Christian Puhrsch	e8c4adf3c3	Add torch.sparse overview section (#85265 ) The goal of this section is to provide a general overview of how PyTorch handles sparsity for readers who are already familiar with sparse matrices and their operators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85265 Approved by: https://github.com/jisaacso	2022-10-18 21:07:57 +00:00
PyTorch MergeBot	31edccf6c7	Revert "Temporarily disable ios jobs (#87186 )" This reverts commit d29dc2b72a6cb5fb24ff3eacd816e08bd16298dc. Reverted https://github.com/pytorch/pytorch/pull/87186 on behalf of https://github.com/huydhn due to Official conda channel is back and conda-forge has been reverted	2022-10-18 21:03:23 +00:00
Catherine Lee	223ad9bc9e	[ci] remove circleci mac jobs (#87225 ) mac jobs are run on every pr after approval, so these are redundant ios jobs can stay until the end of the year because they are on periodic and not run on every pr Pull Request resolved: https://github.com/pytorch/pytorch/pull/87225 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/janeyx99	2022-10-18 20:57:57 +00:00
Catherine Lee	9a786202b7	[ci] fix log printing (#87223 ) idk how i missed this example https://github.com/pytorch/pytorch/actions/runs/3275717751/jobs/5391093040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87223 Approved by: https://github.com/malfet, https://github.com/kit1980, https://github.com/janeyx99	2022-10-18 20:57:27 +00:00
PyTorch MergeBot	afa5086078	Revert "Install blas from conda-forge (#87150 )" This reverts commit f02f0e3ad1565e3da1e78efaa994e80c7577fd0c. Reverted https://github.com/pytorch/pytorch/pull/87150 on behalf of https://github.com/huydhn due to Conda issue has been resolved upstream https://github.com/pytorch/pytorch/issues/87148	2022-10-18 20:54:06 +00:00
Aaron Enye Shi	e7cefff058	[Kineto][Profiler] Guard event metadata python thread via verbose flag (#87096 ) Summary: For Python Tracing enabled trace files, this field "python thread": 0 is repeated for every python_function event. This bloats the trace json size for large number of events or deep call stacks. Instead make this metadata guarded by the verbose flag. Test Plan: CI Reviewed By: robieta, slgong-fb Differential Revision: D40325815 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/87096 Approved by: https://github.com/slgong-fb, https://github.com/robieta	2022-10-18 20:47:09 +00:00
Will Feng (DPER)	c54bcea793	Improve complex_memory_overlap check for Inductor CUDA graph (#87177 ) Point fix for https://github.com/pytorch/torchdynamo/issues/1620 to unblock internal models. Supersedes https://github.com/pytorch/pytorch/pull/87058. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87177 Approved by: https://github.com/ezyang	2022-10-18 20:26:33 +00:00
Nikita Shulga	ef1844a151	[CI] Move sm86 tests from periodic to trunk (#87228 ) This adds Ampere GPU testing to trunk CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/87228 Approved by: https://github.com/jansel, https://github.com/huydhn	2022-10-18 20:05:45 +00:00
Kurt Mohler	1dbc8ad3b7	Add `Warning` class and refactor C++ warnings to use it (#84101 ) Also adds `TORCH_WARN_WITH` and `TORCH_WARN_DEPRECATION` macros Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84101 Approved by: https://github.com/albanD	2022-10-18 20:02:42 +00:00
Andrew M. James	db65909255	[Docs] Update mm family ops and F.linear to note limited sparse support. (#86220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86220 Approved by: https://github.com/cpuhrsch	2022-10-18 19:55:18 +00:00
PyTorch MergeBot	a73ca6f58c	Revert "Improve readability of the extra message errors in assertEqual (#87202 )" This reverts commit 56c28ee32a78eb6f32a533d8fd64278cb9063016. Reverted https://github.com/pytorch/pytorch/pull/87202 on behalf of https://github.com/malfet due to broke test_testing, see `56c28ee32a`	2022-10-18 19:34:02 +00:00
Fabio Rocha	e4285f09b9	[inductor] new way to compile f64 libdevice calls (#87189 ) Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633) `torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants of some functions. You can request dispatch to those for float64 dtypes when using `register_pointwise` by setting `use_libdevice_for_f64=True`. Other minor changes: - In triton, sigmoid now codegens tl.sigmoid - silu now comes from decomp, not lowering - Some test skips no longer necessary, removed or made xfails Switching to `tl.sigmoid` has exactly same performance. Moving `silu` to decomp does not change anything, same triton code is generated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189 Approved by: https://github.com/ngimel	2022-10-18 19:13:11 +00:00
Jiang, Yanbing	c56be31d2e	Upgrade oneDNN to v2.7 (#87061 ) This PR is to upgrade oneDNN to v2.7. ### oneDNN v2.7 changes: Performance Optimizations - Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). - Introduced performance optimizations for [bf16 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data. Please go to https://github.com/oneapi-src/oneDNN/releases/tag/v2.7 for more detailed changes. ### oneDNN v2.6.1 & 2.6.2 changes: Functionality - Updated ITT API to 3.22.5 - Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (https://github.com/pytorch/pytorch/issues/84488) ### Performance Benchmark Use TorchBench test in ICX with 40 cores Intel OpenMP & tcmalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/196121957-656faebc-9f4a-49f0-9ef0-0784416c3a47.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87061 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/weiwangmeta	2022-10-18 19:07:58 +00:00
Andrew Gu	2485498294	[FSDP] Use `all_gather_into_tensor()` (#87077 ) Let us silence some warnings 👍🏼 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87077 Approved by: https://github.com/rohan-varma	2022-10-18 18:54:23 +00:00
lezcano	56c28ee32a	Improve readability of the extra message errors in assertEqual (#87202 ) Goes from (note the `linspace.default` is very difficult to find) ``` Mismatched elements: 15 / 50 (30.0%) Greatest absolute difference: 1 at index (17,) Greatest relative difference: 1.0 at index (17,) : linspace.default args = (0, -3, 50) kwargs = {'dtype': torch.int16, 'device': device(type='cpu'), 'pin_memory': False} ``` to ``` Mismatched elements: 15 / 50 (30.0%) Greatest absolute difference: 1 at index (17,) Greatest relative difference: 1.0 at index (17,) linspace.default args = (0, -3, 50) kwargs = {'dtype': torch.int16, 'device': device(type='cpu'), 'pin_memory': False} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87202 Approved by: https://github.com/ezyang	2022-10-18 18:53:03 +00:00
lezcano	48f0231223	Fix Scalar(bool) handling in toIValue (#87179 ) At the moment, they were casted to `int64`, which breaks quite a few casting rules for example in `ops.aten`. Quite a vintage bug, circa 2020. With this fix, the following code prints `torch.bool`, rather than `torch.int64`. ```python import torch msk = torch.tensor([False]) b = torch.tensor([False]) print(torch.ops.aten.where.ScalarSelf(msk, True, b).dtype) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87179 Approved by: https://github.com/albanD	2022-10-18 18:53:03 +00:00
PyTorch MergeBot	4540330f97	Revert "Use conda-forge in mac mps test (#87155 )" This reverts commit 74138a8daa93ec4cb08e4dd31c2773ec0c751d94. Reverted https://github.com/pytorch/pytorch/pull/87155 on behalf of https://github.com/huydhn due to Conda issue has been resolved upstream https://github.com/pytorch/pytorch/issues/87148	2022-10-18 18:29:17 +00:00
Horace He	adc7ee09dc	Added upsample_nearest3d/1d lowering to inductor (#87158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87158 Approved by: https://github.com/ngimel	2022-10-18 18:27:56 +00:00
Michael Voznesensky	d7801a6042	Add voznesenskym to CODEOWNERS (#87227 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87227 Approved by: https://github.com/jansel	2022-10-18 18:24:13 +00:00
Sherlock Huang	88b76ae9ea	Store type(module) in the module stack (#87149 ) - As requested by quantization team, it prefer storing type(module) in the module stack. - Consequently, as module stack gets verbose, we skip printing module stack in the gm.print_readable() Pull Request resolved: https://github.com/pytorch/pytorch/pull/87149 Approved by: https://github.com/jerryzh168, https://github.com/jansel	2022-10-18 18:12:37 +00:00
Nikita Shulga	d01eea6027	Do not run triton tests on sm86 (#87198 ) As its broken right now and nobody care to fix it, see this test run for example: `d36c284d14` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87198 Approved by: https://github.com/soumith, https://github.com/albanD	2022-10-18 17:19:52 +00:00
Michael Voznesensky	2b03a941f7	[dynamo] graph capture for calls to arbitrary self. methods on nn module (#87040 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87040 Approved by: https://github.com/jansel	2022-10-18 16:54:40 +00:00
hxu296	09a967d6c9	Make nested TreeSpec printing nicer (#46538 ) (#86546 ) 1. Made TreeSpec into a dataclass. 2. In `__repr__`, recursively transformed TreeSpec into dictionaries and then pretty-printed it. Fixes #46538. Hi, @ezyang. this PR is for the TreeSpec `__repr__` refactor we discussed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86546 Approved by: https://github.com/ezyang	2022-10-18 16:50:39 +00:00
Animesh Jain	440f734169	[inductor] Minifier fixes (#87062 ) Fixes https://github.com/pytorch/torchdynamo/issues/1690 This fixes the error seen in the minifiers. But does not repro the original issue that prompted the above issue. Fx minifiers work at the level of Fx-graphs, and the original issue lies outside of the Fx graph and is only visible on the second iteration. Therefore, the original issue escapes the abstraction of our existing Fx-based minifiers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87062 Approved by: https://github.com/eellison	2022-10-18 15:53:55 +00:00
Animesh Jain	c30cfb07ab	[dynamo][dashboard] Run 2 iterations for the correctness runs (#87104 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87104 Approved by: https://github.com/soumith	2022-10-18 15:53:40 +00:00
Huy Do	d29dc2b72a	Temporarily disable ios jobs (#87186 ) While investigating segfault issue: * https://app.circleci.com/pipelines/github/pytorch/pytorch/584349/workflows/6c68b0ce-023e-4f62-83bf-e77962daf8ad/jobs/17180595 * https://github.com/pytorch/pytorch/actions/runs/3269860268/jobs/5377851127 This might be related to the use of conda-forge in https://github.com/pytorch/pytorch/issues/87148, i.e. conda-forge pulls in different version of some dependencies and breaks thing. If that's the case, we could not revert conda-forge change yet because the checksum issue hasn't been fixed upstream yet (Test PR https://github.com/pytorch/pytorch/pull/87185) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87186 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2022-10-18 15:27:27 +00:00
Christian Puhrsch	ecd25df313	Add prototype warning to MaskedTensor constructor (#87107 ) When a user constructs a MaskedTensor we should signal its development status to set expecations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87107 Approved by: https://github.com/bhosmer	2022-10-18 15:24:18 +00:00
anjali411	240bba7ac8	add sym_int (#86916 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86916 Approved by: https://github.com/ezyang	2022-10-18 14:54:34 +00:00
Soumith Chintala	157310c85d	[inductor][triton] if device is a torch.device, then make cuda_properties index it correctly (#87174 ) Without this, I was running into obvious `KeyError`s that were assuming that the device was an integer when running `examples/imagenet`. ```python (pytorch) soumith@bluebox:~/code/examples/imagenet$ python main.py --gpu 0 /home/soumith/dataset/imagenet /home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: warn(f"Failed to load image Python extension: {e}") /home/soumith/code/examples/imagenet/main.py💯 UserWarning: You have chosen a specific GPU. This will completely disable data parallelism. warnings.warn('You have chosen a specific GPU. This will completely ' Use GPU: 0 for training => creating model 'resnet18' make_fallback(aten.unfold): a decomposition exists, we should switch to it make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it Traceback (most recent call last): File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 254, in call_function return lowerings[target](args, kwargs) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped return decomp_fn(args, *kwargs) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2994, in var_ diffs = square(sub(x, mean(x, axis, keepdim=True))) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped return decomp_fn(args, *kwargs) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2983, in mean sum_result = sum_(x, axis, keepdim) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped return decomp_fn(args, **kwargs) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 3211, in sum_ return fn(x, axis, keepdims, dtype=dtype) File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2953, in inner result = Reduction.create( File "/home/soumith/code/pytorch/torch/_inductor/ir.py", line 714, in create hint, split = cls.num_splits( File "/home/soumith/code/pytorch/torch/_inductor/ir.py", line 454, in num_splits num_sm = get_device_properties(device).multi_processor_count File "/home/soumith/code/pytorch/torch/_inductor/cuda_properties.py", line 43, in get_device_properties return _properties()[_device(device)] KeyError: device(type='cuda', index=0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87174 Approved by: https://github.com/yf225	2022-10-18 14:08:01 +00:00
Nikita Shulga	dbccccb7a2	[BE] Get rid of deprecation warnings in workflows (take 3) (#87152 ) - Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job) - Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning - Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well) - Update `retry` action to `3e91a01664` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152 Approved by: https://github.com/kit1980, https://github.com/izaitsevfb	2022-10-18 13:53:30 +00:00
Peter Bell	9ac2a06acf	istft: require complex input (#86628 ) Real dtype input to `torch.istft` has been deprecated since PyTorch 1.8, so it is more than passed its due date to be removed. BC-breaking message: `torch.istft` no longer supports input in the form of real tensors with shape `(..., 2)` to mimic complex tensors. Instead, convert inputs to a complex tensor first before calling `torch.istft`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86628 Approved by: https://github.com/mruberry	2022-10-18 12:03:55 +00:00
Nikita Karetnikov	b886cd15f5	[primTorch] Add a ref for NumPy-style `T` (#86850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86850 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-18 10:19:47 +00:00
Nikita Vedeneev	f2ec9fbd03	`torch.ormqr`: backward support (#86800 ) Seems good to have, especially when neither `a` nor `tau` requires grads and/or they are pretty small in number. Fixes https://github.com/pytorch/pytorch/issues/86267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86800 Approved by: https://github.com/lezcano	2022-10-18 09:07:35 +00:00
Nikita Karetnikov	841995d53b	[primTorch] Add refs for data conversion ops (#86561 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86561 Approved by: https://github.com/lezcano, https://github.com/mruberry, https://github.com/zou3519	2022-10-18 08:38:51 +00:00
PyTorch MergeBot	731b4bf0f1	Revert "Check all CUDA API calls in aten/src/ATen/test for errors (#74919 ) (#83556 )" This reverts commit a7ed398cf6bca767d93c6d81f3ecf4198e1b52e0. Reverted https://github.com/pytorch/pytorch/pull/83556 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but I think it breaks cuda tests `a7ed398cf6`. This should not have been force merged	2022-10-18 08:14:15 +00:00
Jason Ansel	8b0cc9c752	[inductor] Fix copysign issue in old msvc build (#87117 ) Should fix https://github.com/pytorch/pytorch/pull/87028#issuecomment-1281066036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87117 Approved by: https://github.com/DanilBaibak	2022-10-18 06:06:31 +00:00
PyTorch MergeBot	11915b3196	Revert "[BE] Get rid of deprecation warnings in workflows (#87152 )" This reverts commit 9da032ecee8b0c7a5ce822bb4425af9208dc2fa1. Reverted https://github.com/pytorch/pytorch/pull/87152 on behalf of https://github.com/malfet due to Regresses is_pr_labelled workflow again	2022-10-18 05:32:46 +00:00
Zachary DeVito	d36c284d14	[triton] allow cuda properties to be queried from workers (#87101 ) Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork. Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down. This just moves the needed properties from the main trainer process to the workers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101 Approved by: https://github.com/soumith	2022-10-18 04:48:29 +00:00
Nikita Shulga	9da032ecee	[BE] Get rid of deprecation warnings in workflows (#87152 ) - Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job) - Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning - Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well) - Update `retry` action to `3e91a01664` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152 Approved by: https://github.com/kit1980, https://github.com/izaitsevfb	2022-10-18 04:34:58 +00:00
PyTorch MergeBot	66658e1da7	Revert "[BE] Get rid of deprecation warnings in workflows (#87152 )" This reverts commit acaf484f0a38f6a7becf342bb3492e1de09f64e1. Reverted https://github.com/pytorch/pytorch/pull/87152 on behalf of https://github.com/malfet due to Regresses is_pr_labelled workflow	2022-10-18 04:14:01 +00:00
Yanbo Liang	8ca7820e45	[Inductor] Lift the maximum depth of the Python interpreter stack to adapt large/deep models (#87130 ) Partly fixes https://github.com/pytorch/torchdynamo/issues/1693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87130 Approved by: https://github.com/jansel	2022-10-18 03:46:01 +00:00
Nikita Shulga	acaf484f0a	[BE] Get rid of deprecation warnings in workflows (#87152 ) - Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job) - Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning - Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well) - Update `retry` action to `3e91a01664` Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152 Approved by: https://github.com/kit1980, https://github.com/izaitsevfb	2022-10-18 03:38:24 +00:00
Driss Guessous	5fb687182d	Enable sdp_forward for NestedTensors (#86720 ) # Summary This PR implements a sdp_forward for NestedTensors. This impl will call into flash and mem_efficient_attention when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86720 Approved by: https://github.com/cpuhrsch	2022-10-18 02:00:04 +00:00
Huy Do	74138a8daa	Use conda-forge in mac mps test (#87155 ) https://github.com/pytorch/pytorch/pull/87150 works, most of the jobs are ok now. However, I miss one last piece in MPS test workflow https://github.com/pytorch/pytorch/actions/runs/3269594289/jobs/5377469209. So this fixes the missing piece to use conda-forge Pull Request resolved: https://github.com/pytorch/pytorch/pull/87155 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2022-10-18 01:14:07 +00:00
ssjia	9d1a8edc0e	[vulkan] Use 2D texture types for convolution weights and biases (#86972 ) Differential Revision: [D40385500](https://our.internmc.facebook.com/intern/diff/D40385500/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86972 Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign	2022-10-18 00:55:19 +00:00
ssjia	5b588036aa	[vulkan] Enable 2D texture types (#86971 ) Adds the ability to use 2D GPU textures to represent tensors. The `StorageType` enum can be used to represent other representation modes in the future, such as buffer representations, etc. Differential Revision: [D40363112](https://our.internmc.facebook.com/intern/diff/D40363112/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86971 Approved by: https://github.com/kirklandsign	2022-10-18 00:52:00 +00:00
Richard Barnes	a7ed398cf6	Check all CUDA API calls in aten/src/ATen/test for errors (#74919 ) (#83556 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74919 Test Plan: Sandcastle Differential Revision: D35194596 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83556 Approved by: https://github.com/malfet	2022-10-18 00:35:44 +00:00
Huy Do	f02f0e3ad1	Install blas from conda-forge (#87150 ) Mitigate https://github.com/pytorch/pytorch/issues/87148 ### Testing On AWS (m1, linux) * Run `conda install blas:openblas`, it should failed with `ChecksumMismatchError`: ``` ChecksumMismatchError: Conda detected a mismatch between the expected content and downloaded content for url 'https://repo.anaconda.com/pkgs/main/linux-64/blas-1.0-openblas.conda'. download saved to: /tmp/debug/pkgs/blas-1.0-openblas.conda expected sha256: c85b5d0a336b5be0f415c71fd7fe2eca59e09f42221bfa684aafef5510ba5487 actual sha256: 5dc5483db0d9785b19e021cee418a8ee03e0ff0e5ebd0b75af4927746604e187 ``` * Run ` conda install -c conda-forge blas:openblas` works Pull Request resolved: https://github.com/pytorch/pytorch/pull/87150 Approved by: https://github.com/kit1980	2022-10-18 00:11:37 +00:00
albanD	9db7270ee7	Small update to Module note (#87142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87142 Approved by: https://github.com/cpuhrsch	2022-10-17 22:56:49 +00:00
Nirav Mehta	fb614b1871	Enable UBSAN mode for test_jit (#85735 ) # Summary Run `test_jit` executable with UBSAN flag in order to catch errors that might cause internal breakage Pull Request resolved: https://github.com/pytorch/pytorch/pull/85735 Approved by: https://github.com/dagitses	2022-10-17 22:15:50 +00:00
Catherine Lee	18cc00d399	[ci] put more logs in a folded group (#86138 ) fixes: request to not print the entire log file, but the last couple of lines since they are probably the most relevant all but last 300 lines of failing tests get put into a folded group example https://github.com/pytorch/pytorch/actions/runs/3177200444/jobs/5177703202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86138 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/lezcano	2022-10-17 22:10:23 +00:00
Catherine Lee	e3b84f6c9d	remove dynamo hash updates (#87092 ) remove workflow for updating dynamo hash as it got moved into this repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/87092 Approved by: https://github.com/huydhn	2022-10-17 22:09:56 +00:00
David Berard	4fd98dfe69	Don't only apply DDP optimizer on forward frames (#87097 ) Previously a check would only apply DDP optimizer on frames named "forward". But on hf_T5_large, a graph break causes some frames like: ``` <graph break in _shift_right> <graph break in forward> ``` So instead, apply DDP optimizer on all frames. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87097 Approved by: https://github.com/wconstab	2022-10-17 21:55:14 +00:00
Justin Chu	09d720919e	Add venv to gitignore (#86702 ) `venv` is the common directory for creating virtual environments. Adding it to gitignore to support development that does not use anaconda to manage envs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86702 Approved by: https://github.com/kit1980	2022-10-17 21:50:03 +00:00
Kevin Tse	0cb273b5d9	[DataPipe] Fixing interface generation in setup.py (#87081 ) Based on the artifact generated on this [page](https://hud.pytorch.org/pr/87081), I downloaded [[s3] linux-focal-py3.7-clang7-asan/artifacts.zip](https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3266430083/linux-focal-py3.7-clang7-asan/artifacts.zip) (1.14 GB) and unpacked it. `torch.utils.data.datapipes.datapipe.pyi` does exist. I believe this means the file should be part of the distribution. I also did `wheel unpack ***.whl` to confirm the existence of the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87081 Approved by: https://github.com/ejguan	2022-10-17 21:45:33 +00:00
Michael Suo	f5ee2d8840	[ci] fix bot comment (#87127 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87127 Approved by: https://github.com/clee2000	2022-10-17 21:27:21 +00:00
Andrew Gu	f552eee427	[Docs] Remove outdated comment for sparse all-reduce (#87018 ) https://github.com/pytorch/pytorch/pull/23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87018 Approved by: https://github.com/H-Huang	2022-10-17 21:17:07 +00:00
Catherine Lee	d023e83933	handle libomp update on circleci (#86979 ) libomp got an update and now its keg only reverts https://github.com/pytorch/pytorch/pull/86940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86979 Approved by: https://github.com/huydhn, https://github.com/malfet	2022-10-17 21:03:42 +00:00
Huy Do	5acf6e0e80	Use 12xlarge for nightly cpp doc generation job (#86859 ) The job starts to run out of memory a lot recently https://hud.pytorch.org/failure/Process%20completed%20with%20exit%20code%20137. Probably more and more docs are added, so this ups the runner for cpp doc nightly from 4xlarge to the next tier of 12xlarge. This also choose the smaller runner of 2xlarge for python and functorch docs (may be linux.large is good enough for them?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86859 Approved by: https://github.com/malfet	2022-10-17 20:57:55 +00:00
Michael Suo	4814270708	[dynamo] Introduce `get_real_value` API to TensorVariable (#87091 ) Right now, example_value is doing two jobs: - We use it to propagate metadata (e.g. return type, shapes, etc.) throughout the graph - We use it to satisfy queries for the actual value (e.g. torch.cond, `assume_constant_result`) This is further complicated by the fact that we have two modes, one where `example_value` is a fake tensor, and one where it is a real tensor (this is the `fake_tensor_propagation` config flag). This leads to scenarios where we don't support every combination of job + mode, e.g. if `fake_tensor_propagation=False`, `assume_constant_result` is broken. This is made worse by the fact that "fake tensor mode" is the default and is required if you want dynamic shapes to work. So, this PR introduces a `get_real_value` API that just runs the graph up to `node` in order to get a concrete value. This API is orthogonal to `example_value`, so it doesn't care about `fake_tensor_propagation`. When `fake_tensor_propagation=True`: `example_value` is a fake tensor, you must use the `get_real_value` API to get a concrete value. This will be the only configuration in the future. When `fake_tensor_propagation=False`: `example_value` and `get_real_value` will produce the same value. This is redundant but we will be removing this config soon. To support this, I introduce a cache for computed real values, to memoize the work involved if we're asking for real values a lot. I attached this state to `OutputGraph` because it seems to be what historically managed `example_value` lifetimes, but idk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87091 Approved by: https://github.com/wconstab	2022-10-17 20:14:43 +00:00
Jan Margeta	e85dbcc9b0	[docs] Fix ScalarTensor __repr__ in Extending PyTorch example (#86330 ) This PR fixes the __repr__ of the `ScalarTensor` class in the Extending PyTorch example to correspond with the class name instead of `DiagonalTensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86330 Approved by: https://github.com/bdhirsh	2022-10-17 20:01:10 +00:00
Michael Voznesensky	b8007742c2	[Dynamo] More robust pyop support, module properties as args (#87020 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87020 Approved by: https://github.com/jansel	2022-10-17 19:55:39 +00:00
Thiago Crepaldi	1167949b2d	[ONNX] Ignore print(Tensor) during tracing (#86223 ) Fixes #73619 Fixes https://github.com/microsoft/onnxruntime/issues/11812 This PR adds new symbolics: `aten::_conj`, `aten::conj_physical`, `aten::resolve_conj`, and `aten::resolve_neg` While the last two are always NO-OP by definition (do not change nodes), the first raises an exception as they are not supported by ONNX yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/86223 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2022-10-17 19:45:33 +00:00
Ivan Yashchuk	31931515bc	Workarounds for cudnn_batch_norm with TorchRefsNvfuserCapabilityMode (#86796 ) This PR adds workarounds to support AOT Autograd's graphs containing `aten.cudnn_batch_norm` and `aten.cudnn_batch_norm_backward` with `TorchRefsNvfuserCapabilityMode`. The problem with the decomposition of `aten.cudnn_batch_norm` is that it uses a `new_empty` call that is not supported by nvFuser and we are conservative with lowering functions to nvprims by default. The problem with the decomposition of `aten.cudnn_batch_norm_backward` is described here https://github.com/pytorch/pytorch/pull/86115#issue-1394883782, but changing the decomposition directly in that PR makes many tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86796 Approved by: https://github.com/mruberry	2022-10-17 18:46:28 +00:00
holimion	33343def0b	add XLA backend into tensor type strings (#86881 ) add XLA backend into tensor type strings Pull Request resolved: https://github.com/pytorch/pytorch/pull/86881 Approved by: https://github.com/bdhirsh	2022-10-17 18:27:49 +00:00
PyTorch MergeBot	317eeb81c3	Revert "OpInfo: Sample input cleanup (4/n) (#86324 )" This reverts commit 2a6d37d23d163a35c0b62c4319a6c2f049a27833. Reverted https://github.com/pytorch/pytorch/pull/86324 on behalf of https://github.com/peterbell10 due to Caused tolerance issues in periodic test	2022-10-17 18:26:59 +00:00
JackCaoG	8f85831fdf	Give more clear error message when gscope is non-empty (#87005 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87005 Approved by: https://github.com/alanwaketan, https://github.com/Krovatkin	2022-10-17 18:17:01 +00:00
Kevin Tse	c01c7a5e2c	[DataPipe] Fix missing functional name for FileLister (#86497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86497 Approved by: https://github.com/ejguan	2022-10-17 18:13:37 +00:00
Huy Do	c27a5171b8	Update action lint with missing new runners from scale-config (#87009 ) Using runner label like `linux.12xlarge` results in linter failure from actionlint, i.e. https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952 ``` Error (ACTIONLINT) [runner-label] label "linux.12xlarge" is unknown. available labels are "windows- latest", "windows-2022", "windows-2019", "windows-2016", "ubuntu- latest", "ubuntu-22.04", "ubuntu-20.04", "ubuntu-[18](https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952#step:7:19).04", "macos-latest", "macos-12", "macos-12.0", "macos-11", "macos-11.0", "macos-10.15", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows", "linux.[20](https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952#step:7:21)_04.4x", "linux.20_04.16x", "linux.large", "linux.2xlarge", "linux.4xlarge", "linux.4xlarge.nvidia.gpu", "linux.8xlarge.nvidia.gpu", "linux.16xlarge.nvidia.gpu", "windows.4xlarge", "windows.8xlarge.nvidia.gpu", "bm-runner", "linux.rocm.gpu", "macos-m1- 12", "macos-12-xl", "macos-12", "macos12.3-m1". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file 47 \| # an OOM issue when running the job, so this upgrades the runner from 4xlarge 48 \| # to the next available tier of 12xlarge. So much memory just to generate cpp 49 \| # doc >>> 50 \| runner: linux.12xlarge 51 \| # Nightly cpp docs take about 150m to finish, and the number is stable 52 \| timeout-minutes: 180 53 \| - docs_type: python ``` `linux.12xlarge` is a valid runner label from https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml. This also adds `linux.24xlarge` and `linux.g5.4xlarge.nvidia.gpu`, which are also not added yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/87009 Approved by: https://github.com/ZainRizvi	2022-10-17 17:39:19 +00:00
Natalia Gimelshein	1704256b10	Enables `where` to have cpu scalar args (#87022 ) This is for decompositions only, no attempt made to have good performance for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87022 Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/mruberry	2022-10-17 17:08:47 +00:00
samdow	f3969bd8b5	[functorch] Fix cross to match unbatched behavior (#86926 ) Fixes #83936 #83907 In #83936, I noticed that after I wrote cross, it's silently incorrect because I misunderstood what the fix to linalg was going to be. This fixes functorch to not be silently incorrect with `linalg.cross`. Since it's a silent correctness issue that I missed, I'm hoping to cherry pick it too Pull Request resolved: https://github.com/pytorch/pytorch/pull/86926 Approved by: https://github.com/zou3519	2022-10-17 16:56:21 +00:00
Sherlock Huang	e271e823c7	Avoid calling logging.basicConfig (#86959 ) Fixes https://github.com/pytorch/pytorch/issues/85952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86959 Approved by: https://github.com/xwang233, https://github.com/davidberard98	2022-10-17 16:45:21 +00:00
anjali411	6351220573	Add meta support for _adaptive_avg_pool2d_backward (#86359 ) (#87074 ) This reverts commit 3edf79dc03193c98b665d62231fe69a10dfab1fa. Reland of https://github.com/pytorch/pytorch/pull/86359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87074 Approved by: https://github.com/ezyang	2022-10-17 16:15:04 +00:00
PyTorch MergeBot	66715767ff	Revert "[Dynamo] More robust pyop support, module properties as args (#87020 )" This reverts commit 3c320a5613c26aa3568c330ae1c34a03dadf2b5c. Reverted https://github.com/pytorch/pytorch/pull/87020 on behalf of https://github.com/ZainRizvi due to This appears to have caused two periodic tests to fail	2022-10-17 16:02:49 +00:00
Natalia Gimelshein	8617f5f481	fix cudagraphify for inplace parameter change (#87060 ) Fixes https://github.com/pytorch/torchdynamo/issues/1687 cc @albanD, @chillee, I don't know what I'm doing. According to previous discussions, calling `detach()` on inputs can cause bugs if inputs are later inplace-resized (cc @ezyang) https://github.com/pytorch/pytorch/pull/85301/files#diff-8678402e01603e588fcf175a61de9ed578d885b1cc082e028021856190223fb7L433, but should we weed out these patterns before they are sent to cudagraphify? Pull Request resolved: https://github.com/pytorch/pytorch/pull/87060 Approved by: https://github.com/jansel, https://github.com/albanD	2022-10-17 15:59:05 +00:00
PyTorch MergeBot	2c6167c4bb	Revert "[inductor] Use decomps for unfold (#87025 )" This reverts commit 5099883f059a9b15592b8ba3b7bf83145163b966. Reverted https://github.com/pytorch/pytorch/pull/87025 on behalf of https://github.com/ZainRizvi due to Breaks periodic tests	2022-10-17 15:44:15 +00:00
Animesh Jain	2b558138cf	[inductor] Set correct strides in fallback example run (#87049 ) Fixes #ISSUE_NUMBER Helps in resolving many issues seen in https://github.com/pytorch/torchdynamo/issues/1675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87049 Approved by: https://github.com/jansel	2022-10-17 15:43:53 +00:00
Peter Bell	4e5357faf5	ATen/native (2/6): Use per-operator headers (#75572 ) Differential Revision: [D40126702](https://our.internmc.facebook.com/intern/diff/D40126702) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75572 Approved by: https://github.com/DanilBaibak, https://github.com/malfet	2022-10-17 15:27:02 +00:00
albanD	b40f4434ac	conv backward impl (#87047 ) ~~Waiting for test run to see if this backward is actually exercised. If not, I will add test before merging.~~ Test updated. Ready to go now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87047 Approved by: https://github.com/ezyang	2022-10-17 13:14:12 +00:00
albanD	1463013c85	autograd clone_obey_contract() symint support (#87044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87044 Approved by: https://github.com/ezyang	2022-10-17 13:14:12 +00:00
albanD	86c2e44cb6	meta funcs for avg_pool2d and avg_pool2d_backward (#87043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87043 Approved by: https://github.com/ezyang	2022-10-17 13:14:10 +00:00
albanD	c21dcffc00	Very limited pow support (#87042 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/87042 Approved by: https://github.com/ezyang	2022-10-17 13:14:07 +00:00
PyTorch MergeBot	37e9e89afb	[xla hash update] update the pinned xla hash (#87067 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87067 Approved by: https://github.com/pytorchbot	2022-10-17 10:55:45 +00:00
Nikita Karetnikov	91b3cd0b5a	[primTorch] Add a ref for `narrow_copy` (#86748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86748 Approved by: https://github.com/mruberry	2022-10-17 10:16:05 +00:00
Ryan Spring	847ded6db3	[primTorch] Implement NLL loss reference (#81128 ) Add Reference: - nll_loss Depends on: - expand https://github.com/pytorch/pytorch/pull/79820 - advance indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/81128 Approved by: https://github.com/mruberry	2022-10-17 06:20:31 +00:00
Jiong Gong	78e2289005	[TorchInductor] enable inplace buffers by default (#87037 ) This PR enables the inplace_buffers configuration by default after fixing issue: https://github.com/pytorch/torchdynamo/issues/1670. UT is added to cover the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87037 Approved by: https://github.com/jansel	2022-10-17 06:05:30 +00:00
Emilio Castillo	1b43883fd6	Make `AdamW`, `NAdam` & `RAdam` differentiable (#86183 ) Blocked by #86096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86183 Approved by: https://github.com/albanD	2022-10-17 04:32:08 +00:00
PyTorch MergeBot	364a9973ca	[vision hash update] update the pinned vision hash (#87021 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87021 Approved by: https://github.com/pytorchbot	2022-10-17 03:17:03 +00:00
albanD	3a4c0900c7	Reland 3 of Merge more symbolic meta kernels and symint changes from branch (#86795 ) Take 3 Contains: - symintification of split* - floor support on SymFloat - pad_backward, gather, scatter meta Pull Request resolved: https://github.com/pytorch/pytorch/pull/86795 Approved by: https://github.com/z-a-f	2022-10-17 02:09:40 +00:00
Jason Ansel	0379af681b	[inductor] Disable parallel compile (#87048 ) https://github.com/pytorch/pytorch/pull/87032 seems to have an issue that breaks our benchmark script, it might have to do with the benchmark script also using subprocess. Before this PR: ``` $ ./benchmarks/dynamo/torchbench.py --performance --inductor --raise --training --float16 ... Traceback (most recent call last): File "/home/jansel/conda/envs/pytorch/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(call_item.args, *call_item.kwargs) File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 239, in _worker_compile kernel = TritonCodeCache.load(source_code) File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 234, in load mod = PyCodeCache.load(source_code) File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 212, in load exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_jansel/ij/cij7smji4sw2a56i4yz45bjkrosd2sb2raqnxzsxxpg4kwzuo2ta.py", line 5, in <module> from torch._inductor.triton_ops.autotune import reduction File "/home/jansel/pytorch/torch/_inductor/triton_ops/__init__.py", line 3, in <module> if has_triton(): File "/home/jansel/pytorch/torch/_inductor/utils.py", line 38, in has_triton return triton is not None and torch.cuda.get_device_capability() >= (7, 0) File "/home/jansel/pytorch/torch/cuda/__init__.py", line 368, in get_device_capability prop = get_device_properties(device) File "/home/jansel/pytorch/torch/cuda/__init__.py", line 382, in get_device_properties _lazy_init() # will define _get_device_properties File "/home/jansel/pytorch/torch/cuda/__init__.py", line 228, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ``` cc @zdevito Pull Request resolved: https://github.com/pytorch/pytorch/pull/87048 Approved by: https://github.com/soumith	2022-10-17 01:02:43 +00:00
Peter Bell	3007efda08	stft: Require return_complex to be passed explicitly for real input (#86724 ) This behavior has been deprecated since PyTorch 1.8 but this step of the deprecation cycle was put on hold in #50102 waiting for JIT upgraders functionality which doesn't seem to have panned out. I'd say there has been more than enough of a deprecation period, so we should just continue. BC-breaking message: `torch.stft` takes an optional `return_complex` parameter that indicates whether the output should be a floating point tensor or a complex tensor. `return_complex` previously defaulted to `False` for real input tensors. This PR removes the default and makes `return_complex` a required argument for real inputs. However, complex inputs will continue to default to `return_complex=True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86724 Approved by: https://github.com/mruberry, https://github.com/albanD	2022-10-16 22:26:35 +00:00
Zachary DeVito	2b7236a0e1	[torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032 ) This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread for everything else. This other work couldn't be parallelized anyway since it is mostly in python. In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation. ``` 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) ``` (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).ompiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread for everything else. This other work couldn't be parallelized anyway since it is mostly in python. In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation. ``` 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) ``` (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032 Approved by: https://github.com/soumith, https://github.com/jansel	2022-10-16 21:58:26 +00:00
Jason Ansel	945d333ae4	Migrate dynamo CI test shards to torch._dynamo (#87039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87039 Approved by: https://github.com/voznesenskym	2022-10-16 21:35:57 +00:00
Jason Ansel	30f6f6903c	[inductor] Move size asserts to C++, fix bug (#87028 ) Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression). This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension. This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028 Approved by: https://github.com/anijain2305	2022-10-16 20:17:22 +00:00
Jason Ansel	d45e99acf5	[dynamo] Put printing graph breaks behind a config option (#87026 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87026 Approved by: https://github.com/soumith, https://github.com/voznesenskym	2022-10-16 19:53:42 +00:00
Peter Bell	2a6d37d23d	OpInfo: Sample input cleanup (4/n) (#86324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324 Approved by: https://github.com/mruberry	2022-10-16 19:12:44 +00:00
Jason Ansel	5099883f05	[inductor] Use decomps for unfold (#87025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87025 Approved by: https://github.com/soumith	2022-10-16 17:10:33 +00:00
Edward Z. Yang	8a8cd092c8	Add labeler with dynamo/inductor paths to start (#87024 ) The other missing ingredient is getting CC bot to work on labels on PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/87024 Approved by: https://github.com/soumith, https://github.com/jansel	2022-10-16 06:13:18 +00:00
Jason Ansel	a0c2a7f2ed	Add triton to CI (#86988 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86988 Approved by: https://github.com/malfet, https://github.com/voznesenskym, https://github.com/soumith	2022-10-16 03:35:36 +00:00
Michael Voznesensky	3c320a5613	[Dynamo] More robust pyop support, module properties as args (#87020 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/87020 Approved by: https://github.com/jansel	2022-10-16 02:15:10 +00:00
Peter Bell	5d6e831563	OpInfo: Sample input cleanup (3/n) (#86380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86380 Approved by: https://github.com/mruberry	2022-10-15 22:14:09 +00:00
Jason Ansel	054a2fd6c2	Sync changes from `pytorch/torchdynamo` (#87013 ) This updates to: `6380959be2` Generated with: https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013 Approved by: https://github.com/voznesenskym	2022-10-15 21:00:57 +00:00
Horace He	2c1bc216b8	Fixed partitioner issue with getitem and made metadata a storage more consistent (#87012 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87012 Approved by: https://github.com/ngimel	2022-10-15 17:58:55 +00:00
Jane Xu	91c7015426	[einsum] Fix opt_einsum defaults to be more reasonable (#86985 ) Fixes the confusing situation mentioned here https://github.com/pytorch/pytorch/issues/85224#issuecomment-1278628262 by - setting better OG defaults - changing warnings to errors now that we have better defaults Test plan: - Ran einsum tests locally + CI - Uninstalled opt-einsum and ran through setting - `enabled` to False (doesn't throw error) - `strategy` to anything that's not None (errors) - `strategy` to None (noops) - Installed opt-einsum and ran through setting - `enabled` to False (doesn't throw error) - `enabled` to True (doesn't throw error, no ops + defaults to 'auto') - `strategy` to random string (errors) - `strategy` to None (noops, still is 'auto') - `strategy` to 'greedy' (is set to 'greedy') Pull Request resolved: https://github.com/pytorch/pytorch/pull/86985 Approved by: https://github.com/soulitzer	2022-10-15 06:23:50 +00:00
tangleintel	7980ed95bd	Support unpacking python dictionary in torch.jit.trace() (#81623 ) # Support unpacking python dictionary in torch.jit.trace() ## Problem statement & Motivation ### Problem 1(usability): Say, if you have a model and its forward method defined as follows: `def forward(self, key1=value1, key2=value2, key3=value3)` And you have a dataset and each data point in the dataset is a python dict as follows: `data = {key1:value1, key3:value3, key2:value2}` The problem is that if you want to trace the model using the dict data by the giving dataset, you need unpack the dictionary and reorder its value manually and make up a tuple as `data_tuple = (value1, value2, value3)` as the `example_inputs` parameter of `torch.jit.trace()`. This marshalling process is not user friendly. ### Problem 2 (feasibility): Say, if you have a model and its forward method defined as follows: `def forward(self, key1=None, key2=None, key3=None)` -> The default value is None And you have a dataset and each data point in the dataset is a python dict as follows: `data = {key1:value1, key3:value3}` -> Only part of the required value by forward was given, the rest use the default value. The problem is that if you want to trace the model using the dict data by the giving dataset, it's not feasible at all. Cause neither you can pass a tuple like `T1 = (value1, value3)` nor `T2 = (value1, None, value3)`. T1 will mismatch value3 with key2 and T2 include None type which will be blocked by tracer's type checking. (Of course you can pass `T3 = (value1,)` to make the trace function finish without exception, but the traced model you get probably is not what you expect cause the different input may result in different traced result.). These problems come from the HuggingFace's PT model, especially in text-classification tasks with datasets such as [MRPC,](https://paperswithcode.com/dataset/mrpc) [MNLI](https://paperswithcode.com/dataset/multinli) etc. ## Solution To address these two issues, we propose to support a new type, that is, python dict as example_inputs parameter for torch.jit.trace(). We can base on the runtime type information of the example_inputs object to determine if we fall back to the original tuple path or go into the new dictionary path. Both problem 1 and problem 2 can be solved by utilizing the "``" operator. ## Limitation & Mitigation 1. If we use dict as example_inputs to trace the model, then we have to pass a dictionary to the traced model too. (Cause probably we will change the order of debug name of the input parameter in torchscript IR, thus we can't assume the traced model's input parameters order are the same with the original model.). We need highlight this too in the document to mitigate this problem. For example: ``` # fetch a data from dataloader, and the data is a dictionary # and the example_inputs_dict is like: {key1:value1, key3:value3, key2:value2} # the forward() is like: def forward(self, key1=value1, key2=value2, key3=value3) example_inputs_dict = next(iter(dataloader)) jit_model = model.eval() # use the dictionary to trace the model jit_model = torch.jit.trace(jit_model, example_inputs_dict, strict=False) # Now the IR will be graph(%self : __torch__.module.___torch_mangle_n.Mymodule, %key1 : type1, %key3 : type3, %key2 : type2) jit_model = torch.jit.freeze(jit_model) # It's OK to use dict as the parameter for traced model jit_model(example_inputs_dict) example_inputs_tuple = (value1, value3, value2) # It's wrong to rely on the original args order. jit_model(example_inputs_tuple) ``` ## Note 1. This PR will make some UT introduced in [39601](https://github.com/pytorch/pytorch/pull/39601) fail, which I think should be classified as unpacking a tuple containing a single dictionary element in our solution. 4. I think there is ambiguity since currently we only specify passing a tuple or a single Tensor as our example_inputs parameter in torch.jit.trace()*'s documentation, but it seems we can still passing a dictionary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81623 Approved by: https://github.com/davidberard98	2022-10-15 05:33:09 +00:00
Rohan Varma	bdefa260b2	[RFC] Separate CPU offload activation to its own wrapper (#85459 ) Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85459 Approved by: https://github.com/awgu	2022-10-15 05:19:28 +00:00
Jerry Zhang	100113b877	[quant][docs] Formatting fixes for fx graph mode quantization README (#86914 ) Summary: att Test Plan: No code changes involved Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86914 Approved by: https://github.com/vkuzo	2022-10-15 03:45:58 +00:00
PyTorch MergeBot	f6f1aefb8f	[vision hash update] update the pinned vision hash (#86758 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86758 Approved by: https://github.com/pytorchbot	2022-10-15 03:25:05 +00:00
XiaobingSuper	46aaae98c5	torchdynamo: add linear pointwise(binary) fusion kernel (#86583 ) Support binary fusion of Linear with: - add - sub - mul - div Pull Request resolved: https://github.com/pytorch/pytorch/pull/86583 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-10-15 01:57:42 +00:00
XiaobingSuper	5210fab64d	torchdynamo: add convolution pointwise(binary) fusion kernel (#86582 ) Support binary fusion of Convolution with: - add - sub - mul - div Pull Request resolved: https://github.com/pytorch/pytorch/pull/86582 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-10-15 01:55:08 +00:00
XiaobingSuper	9a7a49b254	torchdynamo: add convolution pointwise(unary) fusion kernel (#86581 ) Support unary fusion of Convolution with: - relu - sigmoid - tanh - hardswish - leaky_relu - hardtanh - gelu Pull Request resolved: https://github.com/pytorch/pytorch/pull/86581 Approved by: https://github.com/jgong5, https://github.com/jansel	2022-10-15 01:51:01 +00:00
Peter Bell	d5a7e6db38	ATen/native (1/6): Use per-operator headers (#75571 ) Differential Revision: [D40126698](https://our.internmc.facebook.com/intern/diff/D40126698) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75571 Approved by: https://github.com/malfet	2022-10-15 01:43:26 +00:00
edward-io	4584d06e76	[data] add autocompletion to datapipes (#86960 ) In REPLs (e.g. jupyter notebook) autocomplete now works: <img width="750" alt="image" src="https://user-images.githubusercontent.com/53842584/195776448-f33180da-d1cd-4e47-b9a0-4fd9eb2f78b7.png"> even with custom data pipes: <img width="804" alt="image" src="https://user-images.githubusercontent.com/53842584/195776957-5c51895e-f469-4b13-81ba-c9b507022555.png"> Unfortunately I wasn't able to figure out how to get autocomplete to work for non-REPLs (e.g. VSCode) - may need to generate fake pyi stubs, which 1) won't work for custom datapipes and 2) is a larger project to tackle :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86960 Approved by: https://github.com/NivekT	2022-10-15 00:25:26 +00:00
Nikita Shulga	3924aa75b1	[BE] Extend linter to detect DOS newlines (#86973 ) Fix DOS newlines in `onednn/decompose_silu.[cpp\|h]` introduced by https://github.com/pytorch/pytorch/pull/85591 as well as one in `.github/PULL_REQUEST_TEMPLATE.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86973 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb	2022-10-15 00:20:42 +00:00
Jerry Zhang	b8aa1767cd	[quant][be] Remove unused helper functions in convert.py (#86913 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86913 Approved by: https://github.com/vkuzo	2022-10-15 00:08:36 +00:00
Jerry Zhang	761ca20dd8	[quant][be] Rename qconfig_map to node_name_to_qconfig (#86861 ) Summary: att, with the introduction of QConfigMapping, this name is now very confusing, so renamed it to something clearer Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86861 Approved by: https://github.com/vkuzo	2022-10-15 00:08:36 +00:00
Jason Ansel	8f71e8de7e	Sync changes from pytorch/torchdynamo, enable tests (#86950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86950 Approved by: https://github.com/Chillee	2022-10-14 23:08:58 +00:00
Will Constable	78ef40973c	Set -Werror=braced-scalar-init (#86911 ) - `vector<T>({0})` would give you the vector(size, ...) ctor and produce an empty vector of T, along with the scalar-init warning - `vector<T>({T(0)})` would give you the vector of a single T(0) as you might have intended, and bypasses the warning/error - the warning can easily be missed but can have serious consequences, so make it an error Pull Request resolved: https://github.com/pytorch/pytorch/pull/86911 Approved by: https://github.com/albanD	2022-10-14 22:34:36 +00:00
maxren	155b885806	[xnnpack][lite-int] preprocess (#86980 ) Split up original preprocess diff: This diff introduces the skeleton structure of the delegate APIs. first introducing the method compile spec error handling. For now it just outputs an empty tensor object upon execute. But just proves that delegate apis is working and a new xnnpack delegate backend has been added. Differential Revision: [D38562918](https://our.internmc.facebook.com/intern/diff/D38562918/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38562918/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/86980 Approved by: https://github.com/salilsdesai, https://github.com/cccclai	2022-10-14 22:07:12 +00:00
shubhambhokare1	7c73b45621	[onnx] Add support for autograd function inlining in ONNX_ATEN_FALLBACK mode (#85736 ) Solution to #85027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85736 Approved by: https://github.com/BowenBao	2022-10-14 21:58:01 +00:00
Catherine Lee	d29c8c0ffa	enable optim tests on dynamo to test flaky bot (#86976 ) will link the issue that disabled them if this gets approved Pull Request resolved: https://github.com/pytorch/pytorch/pull/86976 Approved by: https://github.com/albanD	2022-10-14 21:44:13 +00:00
maxren	1a7409c771	[CoreML][ios_crash] Use special throw macro when encountering CoreML API errors (#86938 ) Error messages from TORCH_CHECK are stripped during production builds via -DSTRIP_ERROR_MESSAGES. This diff introduces a new macro COREML_CHECK which will always preserve the error message. This macro is used when encountering errors produced by CoreML API calls so that we can heve enough context to debug. Differential Revision: [D40351013](https://our.internmc.facebook.com/intern/diff/D40351013/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86938 Approved by: https://github.com/salilsdesai	2022-10-14 21:06:25 +00:00
Brian Hirsh	34c86adec4	symintify all of derivatives.yaml (#86610 ) Big-bang PR to symintify all .sizes() calls in derivatives.yaml, which will be needed for symbolic tracing. * with the exception of `split()`, which is tougher to land because it requires internal changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86610 Approved by: https://github.com/albanD	2022-10-14 20:15:48 +00:00
Brian Hirsh	d7bbb61f6b	min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86609 ) This PR shouldn't matter too much, but I figured I'd land it instead of deleting. `PySymInt.min/max` are technically broken today, and this fixes them - but it doesn't matter (yet) because nobody is calling `min()` / `max()` on symints from python (they all happen using `std::min/max` in C++, which desugar to lt / gt calls). Pull Request resolved: https://github.com/pytorch/pytorch/pull/86609 Approved by: https://github.com/albanD	2022-10-14 20:15:48 +00:00
Sean Ross-Ross	1bb609ad47	Added new test test_compare_cpu that checks if cpu and gpu results are consistent (#85011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85011 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-14 20:15:16 +00:00
Lukas Mührke	e027740e77	Chore: Add 'mps' to the docs of tensor_attributes (#86585 ) Since PyTorch supports 'mps' (Apple metal) devices it should be reflected in the documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86585 Approved by: https://github.com/albanD	2022-10-14 19:59:33 +00:00
Ivan Yashchuk	fc3afc8407	Remove empty_like+fill from AOT Autograd graphs for nvFuser (#86908 ) AOT Autograd records C++ code `1 - tensor` as a sequence of empty_like, fill, and sub (see https://github.com/pytorch/pytorch/issues/86612). Both empty_like and fill are not supported yet. This PR is a workaround for enabling fusions of `silu_backward`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86908 Approved by: https://github.com/ngimel	2022-10-14 19:49:39 +00:00
Justin Chu	56a744bf47	[ONNX] Reland: Update training state logic to support ScriptedModule (#86745 ) In https://github.com/pytorch/pytorch/issues/86325, it was reported that ScriptedModule do not have a training attribute and will fail export because we don't expect them as input. Also - Parameterized the test_util_funs test Thanks @borisfom for the suggestion! Fixes #86325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86745 Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao	2022-10-14 19:44:47 +00:00
Andrew M. James	527ebedbff	Sparse support for ReLU (#86749 ) ReLU support for all sparse layouts, including backward. Fixes #85208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86749 Approved by: https://github.com/cpuhrsch, https://github.com/nikitaved	2022-10-14 19:16:26 +00:00
Sherlock Huang	ef045695e0	Fix decomp for huber_loss_backward (#86955 ) Fixes https://github.com/pytorch/pytorch/issues/86846 aten.huber_loss_backward calls aten.huber_loss_backward.out in its CompositeExplicitAutograd kernel. The decomp was mistaken registered for both aten.huber_loss_backward.default and aten.huber_loss_backward.out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86955 Approved by: https://github.com/Chillee	2022-10-14 18:53:02 +00:00
Richard Zou	7da018b2f8	[functorch] fix fbcode tests (#86936 ) Differential Revision: [D40358418](https://our.internmc.facebook.com/intern/diff/D40358418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86936 Approved by: https://github.com/samdow	2022-10-14 18:42:38 +00:00
Peter Bell	f17b3e9b7a	Vectorize tensor lerp kernel (#84845 ) Fixes #86964 In a simple timeit benchmark I see 1.7x speedup for complex64, from 6.7 us to 3.9 us; and a 3.2x speedup for float32, from 6.2 us to 1.9 us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84845 Approved by: https://github.com/lezcano, https://github.com/malfet	2022-10-14 18:29:02 +00:00
Nikita Shulga	13cff2ee8e	[MPS] Copy from CPU always add storageOffset (#86958 ) Because why wouldn't it? Fixes https://github.com/pytorch/pytorch/issues/86052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86958 Approved by: https://github.com/kulinseth	2022-10-14 17:35:18 +00:00
Catherine Lee	1ece1ab6c2	[ci] print rerun stacktraces for pytest (#86831 ) example: https://github.com/pytorch/pytorch/actions/runs/3238428826/jobs/5306808276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86831 Approved by: https://github.com/huydhn	2022-10-14 17:31:31 +00:00
Huy Do	d393a463ff	Fix functorch test selection logic (#86944 ) I realize that `run_test.py` doesn't take into account functorch test selection logic at the moment, for example `python test/run_test.py --functorch -i functorch/test_ops --verbose` stills run all functorch tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/86944 Approved by: https://github.com/clee2000, https://github.com/malfet	2022-10-14 17:26:52 +00:00
PyTorch MergeBot	bbd7b38d55	Revert "symintify nll loss fns (#86915 )" This reverts commit 0ece7c86d829e2515e8b7d5df13cf0279b70c0e9. Reverted https://github.com/pytorch/pytorch/pull/86915 on behalf of https://github.com/anjali411 due to test_autocast_nn_fp32 fails	2022-10-14 17:22:55 +00:00
anjali411	0ece7c86d8	symintify nll loss fns (#86915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86915 Approved by: https://github.com/albanD	2022-10-14 17:06:56 +00:00
Chien-Chin Huang	a86278b08c	[FSDP] Consolidate FSDP state_dict offload_to_cpu settings (#86211 ) Consolidate FSDP state_dict offload_to_cpu settings. All state_dict_types now have offload_to_cpu options. Differential Revision: [D40065969](https://our.internmc.facebook.com/intern/diff/D40065969/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86211 Approved by: https://github.com/rohan-varma	2022-10-14 16:23:28 +00:00
Catherine Lee	c9a8d309bd	add super setup to test to enable disabling in test_dims.py (#86953 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86953 Approved by: https://github.com/huydhn	2022-10-14 16:04:04 +00:00
PyTorch MergeBot	8eb579e362	Revert "[Profiler] Move legacy profiler out of `torch/csrc/autograd` (#85512 )" This reverts commit 157a3d2a7cd25779258f3e3dcef14633f1930103. Reverted https://github.com/pytorch/pytorch/pull/85512 on behalf of https://github.com/DanilBaibak due to Due to files were deleted, the internal build failed. Please re-submit via codev.	2022-10-14 14:56:59 +00:00
Nikita Karetnikov	4460e40db4	[primTorch] Add a ref for `addcmul` (#86731 ) Based on: https://github.com/pytorch/pytorch/pull/79827 https://github.com/pytorch/pytorch/pull/72949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86731 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-14 14:26:23 +00:00
PyTorch MergeBot	746500d58d	Revert "[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948 )" This reverts commit 427e0a6b4ebc691f1fa98662d04d5c431a75107f. Reverted https://github.com/pytorch/pytorch/pull/84948 on behalf of https://github.com/malfet due to Broke SM86 sanity	2022-10-14 14:25:51 +00:00
Ivan Yashchuk	2cfc4cb367	Add optional recomputable_ops argument for the min cut partitioner (#86686 ) `min_cut_rematerialization_partition` has a default set of hard-coded operations that are allowed to be recomputed in the backward pass. This PR adds customization ability to this function allowing users to control the behavior by passing `recomputable_ops` instead of relying on the default setting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86686 Approved by: https://github.com/Chillee	2022-10-14 12:15:30 +00:00
Ivan Yashchuk	fd80684784	Add nvFuser support for torch.Tensor.view (#84634 ) This is an alternative to https://github.com/pytorch/pytorch/pull/83739. While PrimTorch has `view` as a reference, we would like to use nvFuser's implementation for `view` for now. Later we might transition to PrimTorch's `torch._refs.view`. See `test_nvprims_view` for examples of things that are now sent to nvFuser. Note that nvFuser's `view` is a copy-like operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84634 Approved by: https://github.com/kevinstephano, https://github.com/mruberry	2022-10-14 12:08:02 +00:00
Alvaro Gaona	b48deedb77	Set up new module torch.signal.windows (#85599 ) Resolves #85366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85599 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-14 11:33:32 +00:00
PyTorch MergeBot	056cfb0464	Revert "[ONNX] Update training state logic to support ScriptedModule (#86745 )" This reverts commit 960b98128e475b15b66119f325232039799852cd. Reverted https://github.com/pytorch/pytorch/pull/86745 on behalf of https://github.com/janeyx99 due to `960b98128e` broke onnx tests on trunk	2022-10-14 05:40:20 +00:00
Taylor Robie	157a3d2a7c	[Profiler] Move legacy profiler out of `torch/csrc/autograd` (#85512 ) The legacy profiler is an eyesore in the autograd folder. At this point the implementation is almost completely decoupled from the rest of profiler, and it is in maintaince mode pending deprecation. As a result, I'm moving it to `torch/csrc/profiler/standalone`. Unfortuantely BC requires that the symbols remain in `torch::autograd::profiler`, so I've put some basic forwarding logic in `torch/csrc/autograd/profiler.h`. One strange bit is that `profiler_legacy.h` forward declares `torch::autograd::Node`, but doesn't seem to do anything with it. I think we can delete it, but I want to test to make sure. (Note: this should not land until https://github.com/pytorch/torchrec/pull/595 is landed.) Differential Revision: [D39108648](https://our.internmc.facebook.com/intern/diff/D39108648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85512 Approved by: https://github.com/aaronenyeshi	2022-10-14 05:38:48 +00:00
Taylor Robie	35fb007749	[Profiler][Minor] Separate standalone profilers from the main PyTorch profiler. (#85511 ) There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT) They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder. Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511 Approved by: https://github.com/albanD	2022-10-14 05:38:48 +00:00
Taylor Robie	b8f14b7877	[Profiler][Minor] Group and consolidate stub APIs (#85510 ) There is a concept in profiler of a stub that wraps a profiling API. It was introduced for CUDA profiling before Kineto, and ITT has adopted it to call into VTune APIs. However for the most part we don't really interact with them when developing the PyTorch profiler. Thus it makes sense to unify the fallback registration mechanism and create a subfolder to free up real estate in the top level `torch/csrc/profiler` directory. Differential Revision: [D39108647](https://our.internmc.facebook.com/intern/diff/D39108647/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85510 Approved by: https://github.com/aaronenyeshi	2022-10-14 05:38:46 +00:00
Chien-Chin Huang	bc4ca4c2c4	[FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters (#86524 ) `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86524 Approved by: https://github.com/rohan-varma	2022-10-14 05:19:16 +00:00
Justin Chu	960b98128e	[ONNX] Update training state logic to support ScriptedModule (#86745 ) In https://github.com/pytorch/pytorch/issues/86325, it was reported that ScriptedModule do not have a training attribute and will fail export because we don't expect them as input. Also - Parameterized the test_util_funs test Thanks @borisfom for the suggestion! Fixes #86325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86745 Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao	2022-10-14 01:31:40 +00:00
PyTorch MergeBot	f451e824f3	Revert " C10D extension to enable per-thread PG (#86348 )" This reverts commit 97abc21f2bda38e73de2a86da7f43c8126930681. Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests `97abc21f2b`	2022-10-14 01:26:46 +00:00
Huy Do	c16c4a37ab	Remove functorch copy of conftest.py (#86927 ) Now that its tests have been moved to PyTorch test. This was a left over from https://github.com/pytorch/pytorch/pull/86623 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86927 Approved by: https://github.com/clee2000	2022-10-14 00:47:16 +00:00
Horace He	b3b9786fdd	Unified symbolic shape variables between AOTAutograd and Inductor (#86659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86659 Approved by: https://github.com/wconstab	2022-10-14 00:24:43 +00:00
Jason Ansel	c7c09722ad	Move TorchDynamo into PyTorch core (#86461 ) Context: https://github.com/pytorch/torchdynamo/issues/1588 This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core. - `torchdynamo` becomes `torch._dynamo` - `torchinductor` becomes `torch._inductor` This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461 Approved by: https://github.com/voznesenskym	2022-10-13 23:18:06 +00:00
Rodrigo Kumpera	97abc21f2b	C10D extension to enable per-thread PG (#86348 ) Move a bunch of globals to instance methods and replace all use to them. We move all PG related globals under World and use a singleton instance under _world. This creates an undocumented extension point to inject full control of how how c10d state behaves. One simple hack is to change _world to an implementation that uses a threadlocal and enable per-thread PGs. It almost get DDP working and the PG is missing an implementation of all_reduce. This enables notebook usage of PTD, which is a big deal for learning it: https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68 This change ensures BC by keeping the global variables around and have the default _World wrap it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348 Approved by: https://github.com/rohan-varma	2022-10-13 22:23:28 +00:00
Peter Bell	66979fbfaa	Improve complex lerp performance (#84844 ) The complex lerp kernel uses `std::abs(z) < 0.5` which involves computing a sqrt. Instead compare the square against 0.25 has much lower latency and so performs much better overall. In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096 element complex lerp, from 84 us to 6.7 us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84844 Approved by: https://github.com/ngimel	2022-10-13 21:56:37 +00:00
Catherine Lee	ae45dab57e	disable failing circleci test jobs (#86940 ) should revert later when fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/86940 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2022-10-13 21:27:52 +00:00
sanchitintel	974ad8fa6c	Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591 ) ## BFloat16 dtype support for faster inference with TorchScript using oneDNN Graph Intel Xeon Cooper Lake platform & beyond support the `AVX512_BF16` ISA, which is essentially native BFloat16 support. oneDNN Graph delivers high inference performance with BFloat16 on such machines. While oneDNN Graph can still be used with BFloat16 on older machines that lack `avx512_bf16` ISA but support `avx512bw`, `avx512vl` & `avx512dq` ISAs, the BF16 performance on these older machines will be significantly poorer (probably even poorer than Float32), as they lack native BF16 support. Currently, [AMP support for eager mode & JIT mode is divergent in PyTorch](https://github.com/pytorch/pytorch/issues/75956). So, for using oneDNN Graph with BFloat16, eager-mode AMP should be leveraged by turning off AMP for JIT mode, using `torch._C._jit_set_autocast_mode(False)` in python code, so as to avoid conflicts. Please use the following environment variable to view JIT logs - `PYTORCH_JIT_LOG_LEVEL=">>graph_helper:>>graph_fuser:>>kernel:>>interface"` ## Changes being made in this PR 1. This PR does NOT change the `oneDNN` commit or the `ideep` files. While the `ideep` commit is being updated, only files pertaining to oneDNN Graph are being updated. oneDNN Graph is being upgraded to version 0.5.2 (alpha patch release 2). To put things into perspective, `ideep` is a git submodule of PyTorch. `oneDNN Graph` is a git submodule of `ideep` (`ideep/mkl-dnn`), and oneDNN is a git submodule of oneDNN Graph (`ideep/mkl-dnn/third_party/oneDNN`). 2. Unit-tests are being updated. We now use the [existing dtypes decorator](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L123-L131). 3. Suggestions made by @eellison in the [FP32 PR](https://github.com/pytorch/pytorch/pull/68111#pullrequestreview-896719477) are being incorporated/addressed - \| Action-item \| Status \| \| :--- \| ---: \| \|checkInputCompatibility follow up \| Fixed \| \|the mayConvertScalarInputToTensor logic we can consider \| Added type promotion code \| \|fix up fixConvOptionalBias\| The current approach seems correct \| \|Use opinfo tests\| using dtypes decorator. Will use `OpInfo` in a subsequent PR, if that'd be possible. Should we create a list of ops from opDB that are supported by oneDNN Graph, and add it to `common_methods_invocations.py`? \| \|inferDevice torch_check call \| not necessary now, perhaps, as only CPU is supported, for now? We'd add it by the beta release of oneDNN Graph, though, so that by then, users might be able to use other fusers with oneDNN Graph (NNC/TensorExpr are already compatible with the oneDNN Graph fuser). We can still add it, if you'd insist. \| \|not checking shapes of input mkldnn tensor to llga guard \| Those checks should not be present because oneDNN Graph may use blocked or channels-last layout, so those strides would be different. They're only skipped if an LLGA subgraph's output is input to another LLGA subgraph, which enables LLGA to choose an optimal layout between them. \| \|fix test failures with respect to unsupported inputs \| We'll address them with the upcoming release of oneDNN Graph beta version\| 4. More PyTorch ops are being been mapped to oneDNN Graph ## Example of using oneDNN Graph with BFloat16 ```python # Assuming we have a model of the name 'model' example_input = torch.rand(1, 3, 224, 224) # enable oneDNN Graph torch.jit.enable_onednn_fusion(True) # Disable AMP for JIT torch._C._jit_set_autocast_mode(False) with torch.no_grad(), torch.cpu.amp.autocast(): model = torch.jit.trace(model, (example_input)) model = torch.jit.freeze(model) # 2 warm-ups (2 for tracing/scripting with an example, 3 without an example) model(example_input) model(example_input) # speedup would be observed in subsequent runs. model(example_input) ``` ## TorchBench based Benchmarks URL: https://github.com/sanchitintel/benchmark/tree/onednn_graph_benchmark (instructions present at URL). Batch-size(s): TorchBench-default for each model Baseline : PyTorch JIT OFI FP32 Machine: Intel(R) Xeon(R) Platinum 8371HC (Cooper Lake) Sockets used: 1 Number of cores on one socket: 26 Intel OpenMP & tcmalloc were preloaded #### Benchmark results with single thread \| name \| latency of PyTorch JIT OFI FP32 (s) \| Latency of oneDNN Graph BF16 (s) \| % change \| \| :--- \| ---: \| ---: \| ---: \| \| test_eval[alexnet-cpu-jit] \| 1.063851 \| 0.509820 \| -52.1% \| \| test_eval[mnasnet1_0-cpu-jit] \| 0.218435 \| 0.107100 \| -51.0% \| \| test_eval[mobilenet_v2-cpu-jit] \| 0.114467 \| 0.058359 \| -49.0% \| \| test_eval[mobilenet_v3_large-cpu-jit] \| 0.233873 \| 0.117614 \| -49.7% \| \| test_eval[resnet18-cpu-jit] \| 0.160584 \| 0.075854 \| -52.8% \| \| test_eval[resnet50-cpu-jit] \| 1.652846 \| 0.713373 \| -56.8% \| \| test_eval[resnext50_32x4d-cpu-jit] \| 0.471174 \| 0.209431 \| -55.6% \| \|test_eval[shufflenet_v2_x1_0-cpu-jit] \| 0.310306 \| 0.167090 \| -46.2% \| \| test_eval[squeezenet1_1-cpu-jit] \| 0.161247 \| 0.045684 \| -71.7% \| \| test_eval[timm_efficientnet-cpu-jit] \| 1.643772 \| 0.800099 \| -51.3% \| \| test_eval[timm_regnet-cpu-jit] \| 5.732272 \| 2.333417 \| -59.3% \| \| test_eval[timm_resnest-cpu-jit] \| 1.366464 \| 0.715252 \| -47.7% \| \| test_eval[timm_vision_transformer-cpu-jit] \| 0.508521 \| 0.271598 \| -46.6% \| \| test_eval[timm_vovnet-cpu-jit] \| 2.756692 \| 1.125033 \| -59.2% \| \| test_eval[vgg16-cpu-jit] \| 0.711533 \| 0.312344 \| -56.1% \| #### Benchmark results with 26 threads: \| name \| latency of PyTorch JIT OFI FP32 (s) \| Latency of oneDNN Graph BF16 (s) \| % change \| \| :--- \| ---: \| ---: \| ---: \| \| test_eval[alexnet-cpu-jit] \| 0.062871 \| 0.034198 \| -45.6% \| \| test_eval[mnasnet1_0-cpu-jit] \| 0.022490 \| 0.008172 \| -63.7% \| \| test_eval[mobilenet_v2-cpu-jit] \| 0.012730 \| 0.005866 \| -53.9% \| \| test_eval[mobilenet_v3_large-cpu-jit] \| 0.025948 \| 0.010346 \| -60.1% \| \| test_eval[resnet18-cpu-jit] \| 0.011194 \| 0.005726 \| -48.9% \| \| test_eval[resnet50-cpu-jit] \| 0.124662 \| 0.045599 \| -63.4% \| \| test_eval[resnext50_32x4d-cpu-jit] \| 0.034737 \| 0.015214 \| -56.2% \| \|test_eval[shufflenet_v2_x1_0-cpu-jit] \| 0.028820 \| 0.012517 \| -56.6% \| \| test_eval[squeezenet1_1-cpu-jit] \| 0.012557 \| 0.003876 \| -69.1% \| \| test_eval[timm_efficientnet-cpu-jit] \| 0.203177 \| 0.051879 \| -74.5% \| \| test_eval[timm_regnet-cpu-jit] \| 0.452050 \| 0.151113 \| -66.6% \| \| test_eval[timm_resnest-cpu-jit] \| 0.117072 \| 0.052848 \| -54.9% \| \| test_eval[timm_vision_transformer-cpu-jit] \| 0.046048 \| 0.023275 \| -49.5% \| \| test_eval[timm_vovnet-cpu-jit] \| 0.213187 \| 0.077482 \| -63.7% \| \| test_eval[vgg16-cpu-jit] \| 0.044726 \| 0.021998 \| -50.8% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/85591 Approved by: https://github.com/jgong5, https://github.com/frank-wei, https://github.com/chunyuan-w	2022-10-13 20:36:59 +00:00
Rodrigo Kumpera	14dd5db2f5	[fsdp] Fix test for 2d parallel integration to trigger the load hooks. (#86272 ) nit: replaced empty array bool test with explicit test for its length. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86272 Approved by: https://github.com/awgu	2022-10-13 20:28:44 +00:00
Jerry Zhang	18f58e2df1	[quant][be] Rename node_name_to_target_dtype to node_name_to_target_dtype_info (#86860 ) Summary: att, renaming to improve readability Test Plan: no functionality changes Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86860 Approved by: https://github.com/jcaip	2022-10-13 20:24:05 +00:00
inisis	158a071034	add _freeze for embedding op (#86769 ) Fixes #86663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86769 Approved by: https://github.com/albanD	2022-10-13 20:12:52 +00:00
Nikolay Korovaiko	e737f2d81c	set the correct size of aten tensor in presence of mkldnn padding (#86767 ) This fixes https://github.com/pytorch/pytorch/issues/86556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86767 Approved by: https://github.com/eellison	2022-10-13 19:35:31 +00:00
BowenBao	860ad04990	[ONNX] Fix FindCommonAncestor in function_extraction (#86650 ) One line fix to get absolute value of `diff` before looping over. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86650 Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock	2022-10-13 18:33:32 +00:00
BowenBao	af1dcef79c	[ONNX] Fix triu/tril export with diagonal input (#86843 ) Investigation with @thiagocrepaldi discovered this bug with triu/tril export when `diagonal` is passed in as input. Previously assumption was made that `diagonal` is always provided a constant value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86843 Approved by: https://github.com/thiagocrepaldi, https://github.com/abock	2022-10-13 18:09:37 +00:00
Ivan Yashchuk	dbdfb8dd8b	Skip test_nvfuser_extremal_values for native_batch_norm (#86897 ) New tests were introduced with `68a6113248`. This PR explicitly skips the problematic tests. Fixes https://github.com/pytorch/pytorch/issues/86176 Fixes https://github.com/pytorch/pytorch/issues/86177 Fixes https://github.com/pytorch/pytorch/issues/86178 Fixes https://github.com/pytorch/pytorch/issues/86179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86897 Approved by: https://github.com/soulitzer	2022-10-13 18:09:00 +00:00
BowenBao	2ce6150d23	[ONNX] Fix scalar_type_analysis metadata for copied constant (#86716 ) Fix the source of metadata for copied constant. Since the constant is being implicitly casted, it makes more sense to assign code location and etc with the user node. This issue was discovered in https://github.com/pytorch/pytorch/issues/86627. This PR also adds unit test coverage for scope information of nodes when they are altered by CSE and related passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86716 Approved by: https://github.com/thiagocrepaldi, https://github.com/malfet	2022-10-13 18:01:44 +00:00
Sheil Kumar	4839f73f32	Fix incorrect tensor storage check (#86845 ) Fix incorrect tensor storage check This change contains an incorrect check for storage: https://github.com/pytorch/pytorch/pull/86557 self.storage is not None should have been: not torch._C._has_storage(self) These fixes were run through the DirectML test suite, and confirm the check is now working correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86845 Approved by: https://github.com/martinb35, https://github.com/bdhirsh	2022-10-13 17:54:28 +00:00
Frankie Robertson	afc9963865	Fix path to nested_tensor in example (#86891 ) This appears to be a typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86891 Approved by: https://github.com/H-Huang	2022-10-13 17:42:32 +00:00
Kshiteej K	54ee95c8ec	[nn] module: full_backward_pre_hook (#86700 ) Fixes https://github.com/pytorch/pytorch/issues/42824 * [x] Test * [x] Doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/86700 Approved by: https://github.com/soulitzer	2022-10-13 17:36:39 +00:00
mikael10j	7dcfbedce0	Fix LinearLR scheduler start_factor (#86695 ) Fixes #86454 The `start_factor` must be comprised in ]0;1] instead of [0;1] to avoid division by 0. This PR changes the lower limit checking of the parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86695 Approved by: https://github.com/albanD	2022-10-13 17:31:36 +00:00
samdow	6ee94b572a	[functorch] Add shard to run functorch tests with asan (#82164 ) This adds asan testing for functorch. It was running really long (>4hrs) with test ops, so we decided that those tests are probably redundant and skipped those. This brings this test's time down to ~30 min Pull Request resolved: https://github.com/pytorch/pytorch/pull/82164 Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/huydhn	2022-10-13 17:26:56 +00:00
Eddie Yan	427e0a6b4e	[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948 ) #58414 Opening this PR for testing for now to check CI status. 🤞 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/84948 Approved by: https://github.com/ngimel	2022-10-13 17:26:36 +00:00
BowenBao	b0d80f4355	[ONNX] Clarify phrasing of skipScriptTest/skipTraceTest decorators (#86216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86216 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock	2022-10-13 17:20:35 +00:00
BowenBao	0ee0999608	[ONNX] Renable assert diagnostic test (#85999 ) Fix to properly clear 'background_context' of export diagnostic 'engine' in `clear`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85999 Approved by: https://github.com/abock	2022-10-13 17:19:36 +00:00
Tugsbayasgalan Manlaibaatar	cff333bdb5	Enable max.unary_out (#86855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86855 Approved by: https://github.com/jerryzh168, https://github.com/bdhirsh	2022-10-13 17:14:53 +00:00
Colin Taylor	25811663af	[FSDP] restricts meta model check to non ignored modules in FSDP (#86766 ) Summary: as title Test Plan: see test plan D40287799 Differential Revision: D40287890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86766 Approved by: https://github.com/awgu	2022-10-13 16:48:24 +00:00
Mikayla Gawarecki	ab69550678	Add nested squeeze.dim and unsqueeze (#86813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86813 Approved by: https://github.com/drisspg	2022-10-13 16:05:36 +00:00
HDCharles	e531cf7b2e	[ao] fixing public v private for fx.backend_config_utils.py (#86037 ) Summary: just added a missing function to __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86037 Approved by: https://github.com/jerryzh168	2022-10-13 16:04:42 +00:00
PyTorch MergeBot	d169f950da	Revert "Use CUTLASS GEMM for NT bmm [OSS-only] (#85894 )" This reverts commit ef58a132f223d5abf2bd3f8bee380aca6c29d17f. Reverted https://github.com/pytorch/pytorch/pull/85894 on behalf of https://github.com/DanilBaibak due to Break internal build	2022-10-13 15:28:09 +00:00
Will Constable	b97ae59e29	Change legacy wrap_dim to work with symint == (#86842 ) - previously, sizes == vector<T>({0}) failed to hit SymInt::operator==, causing a the loop to bail out too early and make an invalid call to downstream maybe_wrap_dim helper Pull Request resolved: https://github.com/pytorch/pytorch/pull/86842 Approved by: https://github.com/Chillee, https://github.com/malfet, https://github.com/albanD	2022-10-13 15:10:46 +00:00
Richard Zou	3d9fd060f4	[functorch] Add more details to the functorch install page (#86823 ) Added some details about: - `pip uninstall functorch` being helpful if there are problems - `pip install functorch` still working for BC reasons. Test Plan: - wait for docs preview Pull Request resolved: https://github.com/pytorch/pytorch/pull/86823 Approved by: https://github.com/samdow	2022-10-13 14:53:04 +00:00
Peter Bell	cbc01c4344	OpInfo: Sample input cleanup (2/n) (#86379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86379 Approved by: https://github.com/mruberry	2022-10-13 14:50:03 +00:00
Peter Bell	2efc56d9d7	OpInfo: Sample input cleanup (1/n) (#86231 ) This rewrites various sample and error input functions to: - use the convention of `make_arg = functools.partial(make_tensor, ...)` - use the new natural syntax for `SampleInput` construction - yield instead of returning a lists, to reduce memory consumption Pull Request resolved: https://github.com/pytorch/pytorch/pull/86231 Approved by: https://github.com/mruberry	2022-10-13 14:50:03 +00:00
BowenBao	45274c56a4	[ONNX] Partially re-enable RoiAlign and RoiPool unit tests (#86169 ) This PR depends on https://github.com/pytorch/vision/pull/6685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86169 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock	2022-10-13 14:39:44 +00:00
Brian Hirsh	e17732b234	[test] add cross-ref tests for python meta kernels (#86228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86228 Approved by: https://github.com/albanD	2022-10-13 14:14:26 +00:00
Brian Hirsh	0feccda7d7	fix aliasing bug in pixel shuffle/unshuffle (#86608 ) Fixes https://github.com/pytorch/pytorch/issues/82235 cc @albanD - `at::pixel_shuffle` and `at::pixel_unshuffle` advertise as being non-aliasing, but they have a C++ decomposition that internally uses reshape(), which means that it might return an alias. I happened to notice this because a bunch of tests in `test/test_ops.py` failed when I ran locally with a `DEBUG=1` build. (P.S.: when are we finally gonna get a debug build test in CI? 😃) I fixed by adding an extra clone, which... is going to be an unnecessary perf hit in the case where the `reshape()` already properly cloned the input. My hope is that this is fine, because this only impacts the composite kernel- we already have a "fast" CPU kernel that does the right thing. Is `pixel_shuffle/unshuffle` commonly used with cuda? Maybe we should just add a fast cuda kernel for it if that's the case. Alternatively, it seems like it would be nice if `reshape()` accepted an optional argument to unconditionally return a copy. That seems like a rabbit hole that isn't worth going down for now though - I remember a discussion a while ago about making `reshape()` copy-on-write Pull Request resolved: https://github.com/pytorch/pytorch/pull/86608 Approved by: https://github.com/albanD	2022-10-13 14:14:26 +00:00
Brian Hirsh	3376050543	fix type promotion for group_norm composite C++ kernel (#86607 ) python decomp for `native_group_norm` is correct in more cases than the C++ composite. Updating the tests to fail properly in this case was more annoying than just fixing the C++ decomp, so I fixed it here. When the input tensor had a dtype with less precision than float32, the C++ decomp would unconditionally set the mean/variance to float32, which was wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86607 Approved by: https://github.com/albanD	2022-10-13 14:14:22 +00:00
Brian Hirsh	6907db3f95	fix aliasing for primtorch view meta kernels (#86285 ) Fixes https://github.com/pytorch/pytorch/issues/86284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86285 Approved by: https://github.com/albanD, https://github.com/mruberry	2022-10-13 14:14:20 +00:00
Michael Andreas Dagitses	77e68b16cc	suggest rebasing through @pytorchbot if PR is stale (#86898 ) Summary: Test Plan: Testing on GitHub with `stale_pr_days` set to zero. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86898 Approved by: https://github.com/malfet	2022-10-13 14:04:08 +00:00
Richard Zou	8fffb79771	Add vmap support for slogdet; fix regression from functorch 0.2.1 (#86815 ) This PR adds vmap support for slogdet -- slogdet just decomposes into linalg.slogdet. This fixes a regression from functorch 0.2.1 (slogdet had a batching rule then, and doesn't anymore). We didn't catch the regression because it seems like slogdet doesn't have an OpInfo (I'm not sure if it had one before). Test Plan: - new one-off test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86815 Approved by: https://github.com/samdow	2022-10-13 14:03:22 +00:00
Syed Tousif Ahmed	77d94ac5ab	Sets CUDA_MODULE_LOADING to LAZY when not set by the user (#85692 ) This PR sets CUDA_MODULE_LOADING if it's not set by the user. By default, it sets it to "LAZY". It was tested using the following commands: ``` python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)" ``` which shows a memory usage of: 287,047,680 bytes vs ``` CUDA_MODULE_LOADING="DEFAULT" python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)" ``` which shows 666,632,192 bytes. C++ implementation is needed for the libtorch users (otherwise it could have been a pure python functionality). cc: @ptrblck @ngimel @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/85692 Approved by: https://github.com/malfet	2022-10-13 14:03:01 +00:00
Tugsbayasgalan Manlaibaatar	30a8a87c80	Fix autogen for _ctc_loss.Tensor (#86871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86871 Approved by: https://github.com/larryliu0820	2022-10-13 07:23:13 +00:00
Salil Desai	dc6ce1485e	Use Variable Size Indices in Sparse Qlinear Code (#85247 ) Final changes to enable sparse weight packing with variable size indices pack_block_sparse.cc is deleted because all functions in it have a template added, so they are moved to pack_block_sparse.h Differential Revision: [D39025651](https://our.internmc.facebook.com/intern/diff/D39025651/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39025651/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/85247 Approved by: https://github.com/digantdesai	2022-10-13 05:50:04 +00:00
Salil Desai	d3afd49c85	Enable 16bit and 8bit Row/Col Indices in Qnnpack Fully Connected Sparse Op (#85246 ) This diff enables using the 16bit and 8bit kernels added in the previous diff. (This change used to be in D38954842 v11 but was moved into its own diff) Differential Revision: [D39403164](https://our.internmc.facebook.com/intern/diff/D39403164/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85246 Approved by: https://github.com/kimishpatel	2022-10-13 05:46:52 +00:00
Salil Desai	6c6e06619f	Add 16bit and 8bit row/col indices q8gemm sparse kernels (#85245 ) TLDR: see D39003528 to see the actual changes in this diff more clearly, which will make reviewing easier ___ The 32bit versions were changed to be created with a macros which are also used to create 16bit and 8bit versions This diff shows that almost all of the lines in the .s files were modified, but most changes are just adding spaces to the front and ;/ to the end so they can be contained in the macro. To generate these changes, I first wrote the macros without the spaces and ;/, and then I ran a script (see the python file in D39003528) to get the final version. To review this diff more easily, if you want to see the code changes before I ran the script, which makes it much easier to see which lines were changed, see D39003528. Each version of this diff is synched with the same number version of that diff (so if I change this diff I will mirror the changes to the same version on that diff) Differential Revision: [D39003527](https://our.internmc.facebook.com/intern/diff/D39003527/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85245 Approved by: https://github.com/kimishpatel	2022-10-13 05:43:51 +00:00
Salil Desai	6c6a32c223	Enable Running Variable Size Row/Col Indices q8gemm Sparse Kernels in QNNPACK (#85244 ) For aarch32 and aarch64, the 16bit and 8bit versions of the kernels are left empty. I will be adding them in a future diff (D39003527) to avoid having this diff be too cluttered. Differential Revision: [D38954842](https://our.internmc.facebook.com/intern/diff/D38954842/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85244 Approved by: https://github.com/kimishpatel	2022-10-13 05:40:09 +00:00
Salil Desai	4c0e1dc980	Update Qnnpack Fully Connected Sparse Op to Store Variable Size Indices (#85243 ) Only uint32_t is supported for now, but uint16_t and uint8_t support will be added in future diffs. Differential Revision: [D38828545](https://our.internmc.facebook.com/intern/diff/D38828545/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85243 Approved by: https://github.com/kimishpatel	2022-10-13 05:03:07 +00:00
Nikita Shulga	1a87c25fe1	Add functorch shard to sm86-periodic workflow (#86820 ) After https://github.com/pytorch/pytorch/pull/86799 was landed there shouldn't be a need to increase tolerances Pull Request resolved: https://github.com/pytorch/pytorch/pull/86820 Approved by: https://github.com/zou3519	2022-10-13 04:25:41 +00:00
Emilio Castillo	cb4867a71a	Make `ASGD` & `RProp` differentiable (#86258 ) Blocked by #86183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86258 Approved by: https://github.com/albanD	2022-10-13 04:06:13 +00:00
Huy Do	5224906749	Spread distributed backends among all distributed shards (#86837 ) So that they can be run in parallel without stepping on each other toe Pull Request resolved: https://github.com/pytorch/pytorch/pull/86837 Approved by: https://github.com/clee2000	2022-10-13 03:31:33 +00:00
Peter Bell	48c648d75d	Fix typo TORCH_ONLY_METHOD_OPERATORS -> TORCH_ASSERT_ONLY_... (#86661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86661 Approved by: https://github.com/malfet	2022-10-13 03:12:59 +00:00
HDCharles	67fbd940ba	[ao] fixing public v private for fx.quantization_types (#86036 ) Summary: this file doesn't actually exist anymore so its just a case of removing the exception for it Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86036 Approved by: https://github.com/jerryzh168	2022-10-13 01:57:16 +00:00
HDCharles	b00cdb5b34	[ao] fixing public v private for quantization_patterns.py (#86034 ) Summary: no significant changes, just addded __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86034 Approved by: https://github.com/jerryzh168	2022-10-13 01:57:00 +00:00
Khushi Agrawal	77d29bcee2	[primTorch] special: ndtr, ndtri, log_ndtr, erfcx (#86077 ) - Adds prims and _refs for `erfcx` and `ndtri`. - Adds _refs for `ndtr`, and `log_ndtr`. cc @kshitij12345 @lezcano @mruberry Pull Request resolved: https://github.com/pytorch/pytorch/pull/86077 Approved by: https://github.com/mruberry	2022-10-13 01:18:30 +00:00
Michael Voznesensky	ea586c0579	Fix up cond a bit to make it work w/ fake tensor (#86727 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86727 Approved by: https://github.com/zou3519	2022-10-13 00:54:17 +00:00
Mikayla Gawarecki	2a75152537	[easy] Add nested tanh (#86826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86826 Approved by: https://github.com/cpuhrsch	2022-10-13 00:48:08 +00:00
CaoE	b79bac0e4d	Make the data types of output and input consistenst for batchnorm (#84410 ) The model TTS will crash due to the issue:: when input of BN is not contiguous and the data type of input is different with that of parameters, BN will raise error `RuntimeError: !needs_dynamic_casting<func_t>::check(iter) INTERNAL ASSERT FAILED at "xxx/pytorch/aten/src/ATen/native/cpu/Loops.h":311, please report a bug to PyTorch`. Make the data types of output and input consistenst for batchnorm to fix the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84410 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-10-13 00:42:46 +00:00
Catherine Lee	c2f29e75cd	[flakybot] add dynamo as platform (#86701 ) corresponding pr in test-infra https://github.com/pytorch/test-infra/pull/874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86701 Approved by: https://github.com/huydhn	2022-10-13 00:42:40 +00:00
Zain Rizvi	9470059766	Allow viable/strict promotion even if periodic or docker-release-builds jobs are failing (#86827 ) Allow `viable/strict` promotion even if `periodic` or `docker-release-builds` jobs are failing Why? Those jobs only run occasionally and for all we know the current viable/strict commit may already include the errors that the above cron based workflows may have later detected. Blocking the viable/strict upgrade because of these scheduled jobs doesn't really offer any value, it just leads to people getting older PRs when they try to fork off of viable/strict without guaranteeing an improvement in test quality Though frankly, the current situation is worse than that. Assume the branch history looks like A -> B A is the current `viable/strict` commit B is a commit that failed some `periodic` test, so `viable/strict` wasn't upgraded to B Now lets say there's a commit C that gets merged. C neither contains a fix for the failing periodic build, nor does a scheduled periodic workflow run against C. The branch becomes A -> B -> C In the above scenario, today we will promote `viable/strict` to C since there was no failing workflow there!!! Even though it didn't actually fix what was broken with B! In short, avoiding the upgrade to B really doesn't make any sense today and we shouldn't do it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86827 Approved by: https://github.com/janeyx99	2022-10-13 00:38:48 +00:00
albanD	66cab5245f	Reland 2 min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86797 Approved by: https://github.com/bdhirsh	2022-10-13 00:31:19 +00:00
Eli Uriegas	894c4218dd	ci: Just use regular checkout (#86824 ) checkout-pytorch seems to have issues and is purpose made for our PR testing and appears to conflict with what we're trying to do for binary builds. For builds like https://github.com/pytorch/pytorch/actions/runs/3207520052/jobs/5242479607 there is a confusion over where the reference is pulled and I believe it is root caused by the checkout logic in checkout-pytorch. So with that in mind I suggest we just use the upstream checkout action for this job Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86824 Approved by: https://github.com/atalman	2022-10-13 00:24:02 +00:00
Emilio Castillo	aacb9f3ac6	Make `Adadelta`,`Adagrad` & `Adamax` differentiable (#86096 ) Continuing the differentiable optimizers support Pull Request resolved: https://github.com/pytorch/pytorch/pull/86096 Approved by: https://github.com/janeyx99	2022-10-12 23:16:29 +00:00
Shawn Zhong	e552cf1050	[DOC] Use type hints to show annotation in the docs (#79086 ) Fixes #44964 Use type hints in the code to show type annotations in the parameters section of the docs. For the parameters already documented in the docstring, but lack the type annotation, the type hints from the code are used: \| [Before](https://pytorch.org/docs/master/generated/torch.nn.AdaptiveMaxPool1d.html) \| [After](https://docs-preview.pytorch.org/79086/generated/torch.nn.AdaptiveMaxPool1d.html) \| \| --- \| --- \| \| <img width="462" alt="image" src="https://user-images.githubusercontent.com/6421097/172954756-96d2d8a6-7df9-4c0f-ad34-c12912a5a740.png"> \| <img width="479" alt="image" src="https://user-images.githubusercontent.com/6421097/172954770-a6ce2425-99a6-4853-ac2c-e182c3849344.png"> \| \| [Before](https://pytorch.org/docs/master/generated/torch.nn.Linear.html) \| [After](https://docs-preview.pytorch.org/79086/generated/torch.nn.Linear.html) \| \| --- \| --- \| \| <img width="482" alt="image" src="https://user-images.githubusercontent.com/6421097/172954992-10ce6b48-44a2-487e-b855-2a15a50805bb.png"> \| <img width="471" alt="image" src="https://user-images.githubusercontent.com/6421097/172954839-84012ce6-bf42-432c-9226-d3e81500e72d.png"> \| Ref: - PR https://github.com/pytorch/pytorch/pull/49294 removed type annotations from signatures in HTML docs. - Sphinx version was bumped to 5.0.0 in PR #70309 - Duplicated (closed) issues: #78311 and #77501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79086 Approved by: https://github.com/malfet	2022-10-12 22:31:48 +00:00
Mikayla Gawarecki	a77f2a95a7	Improve NestedTensor documentation (#85186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85186 Approved by: https://github.com/cpuhrsch	2022-10-12 22:03:04 +00:00
Huy Do	be81f3d8d4	Revert distributed test parallelization (#86756 ) Revert an old commit and resolve some conflicts Fixes https://github.com/pytorch/pytorch/issues/86418 Fixes https://github.com/pytorch/pytorch/issues/86419 Fixes https://github.com/pytorch/pytorch/issues/86415 Fixes https://github.com/pytorch/pytorch/issues/86420 Fixes https://github.com/pytorch/pytorch/issues/86416 Fixes https://github.com/pytorch/pytorch/issues/86392 Fixes https://github.com/pytorch/pytorch/issues/86391 Fixes https://github.com/pytorch/pytorch/issues/86397 Fixes https://github.com/pytorch/pytorch/issues/86390 Fixes https://github.com/pytorch/pytorch/issues/86398 Fixes https://github.com/pytorch/pytorch/issues/86396 Fixes https://github.com/pytorch/pytorch/issues/86395 Fixes https://github.com/pytorch/pytorch/issues/86393 Fixes https://github.com/pytorch/pytorch/issues/86394 Fixes https://github.com/pytorch/pytorch/issues/86440 Fixes https://github.com/pytorch/pytorch/issues/86442 Fixes https://github.com/pytorch/pytorch/issues/86439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86756 Approved by: https://github.com/mrshenli	2022-10-12 21:17:28 +00:00
Antonio Kim	09a676f639	Add hooks for register_buffer/module/parameter (#86148 ) As described in the issue, this PR adds hooks to be run when `register_parameter`, `register_buffer` and `register_module` are called. Fixes #85837 cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are Pull Request resolved: https://github.com/pytorch/pytorch/pull/86148 Approved by: https://github.com/albanD	2022-10-12 20:57:22 +00:00
Zain Rizvi	c08cbfccd9	Let retried jobs advance viable/strict (#86821 ) Today, even if we retry a failed workflow it succeeds on the retry, viable/strict doesn't advance forward. Success on retry is proof that the error wasn't with the current commit and that we should in fact promote viable/strict. This PR points to an updated rockset query which will only look at the success status of the most recent job in each workflow Here's the query edited: Original query: https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query/versions/15aba20837ae9d75?tab=sql Updated query: https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query/versions/8003fdfd18b64696?tab=sql Testing: Tested the old and new query against commits known to have succeeded on retry Pull Request resolved: https://github.com/pytorch/pytorch/pull/86821 Approved by: https://github.com/huydhn, https://github.com/malfet	2022-10-12 20:43:42 +00:00
vfdev	3b26680222	Update _torch_docs / ldexp (#86721 ) Fixes a typo on ldexp docstring. https://pytorch.org/docs/master/generated/torch.ldexp.html?highlight=ldexp#torch.ldexp <img width="976" alt="image" src="https://user-images.githubusercontent.com/2459423/195191117-15b4e1f3-dfd5-466c-b5aa-72851f0c2393.png"> https://livesphinx.herokuapp.com/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/86721 Approved by: https://github.com/samdow	2022-10-12 20:33:14 +00:00
Jerry Zhang	363b108e39	[quant][fx] Fix weight_dtype and bias_dtype backend_config checks (#86719 ) Summary: This PR adds checks for the existence of "weight_dtype" and "bias_dtype" in the node_name_to_dtype dictionary before accessing it, the corner case is hit when we check the compatibility of qconfig and backend_config for weight and bias that appears before activation (e.g. torch.addmm) Test Plan: python test/test_quantization.py -k test_backend_config_check_for_weight_and_bias Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86719 Approved by: https://github.com/andrewor14	2022-10-12 20:20:02 +00:00
HDCharles	d6bfbdf50c	[ao] fixing public v private for fx.pattern_utils.py (#86033 ) Summary: added __all__, one issue with QuantizeHandler is that since its defined as 'Any' it can't be set as a public module although it should be, i've set it to private here but when the circular dependency gets fixed, it will probably be removed. Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86033 Approved by: https://github.com/jerryzh168	2022-10-12 20:06:30 +00:00
HDCharles	bf0116d1f0	[ao] fixing public v private for fx.graph_module.py (#86032 ) Summary: no significant changes, just added __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86032 Approved by: https://github.com/jerryzh168	2022-10-12 20:06:30 +00:00
HDCharles	25476f2e4b	[ao] fixing public v private for quantization_types (#86031 ) Summary: the main problem with this was that the different objects defined simply as 'Any' should theoretically be public but making them public either A) results in an error about the module being 'typing' rather than whatever module it should be or B) you set the module manually, thereby changing the module for the original 'Any' class. note: QuantizeHandler has a similar issue where its simply defined as 'Any' Pattern was defined in multiple places which was causing issues so i just moved it to a single place given the note at the top of quantization_types.py indicating these definitions should be moved to utils at some point anyway. Finally i changed any references to these objects to point at the correct locations. Note: i didn't see any fb internal references to NodePattern or QuantizerCls that would cause issues. Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86031 Approved by: https://github.com/jerryzh168	2022-10-12 20:06:30 +00:00
Christian Puhrsch	ef58a132f2	Use CUTLASS GEMM for NT bmm [OSS-only] (#85894 ) OSS-only copy of https://github.com/pytorch/pytorch/pull/85710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894 Approved by: https://github.com/drisspg	2022-10-12 20:03:28 +00:00
Peter Bell	73c43ce2e2	Display unexpected exceptions raised from test_dtypes (#86599 ) Currently `test_dtypes` swallows all exceptions which can make debugging failures more tricky. This changes the test to save the exceptions and print only the unexpected ones at the end e.g. ``` AssertionError: The supported dtypes for nn.functional._scaled_dot_product_attention on device type cuda are incorrect! The following dtypes did not work in backward but are listed by the OpInfo: {torch.bfloat16}. Unexpected failures raised the following errors: torch.bfloat16 - CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling [...] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86599 Approved by: https://github.com/mruberry	2022-10-12 19:51:58 +00:00
Amadeusz Skrzypczak	6be9d9a630	Add AutocastHPU support (#84927 ) New dispatch key and necessary functions are added to PyTorch. Backend implementation will be added in the external library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84927 Approved by: https://github.com/bdhirsh	2022-10-12 19:37:16 +00:00
Richard Zou	553eaaba7c	Disable tf32 in functorch transform tests (#86799 ) This PR applies a large hammer and disables TF32 in specific functorch transform tests. TF32 isn't precise enough to test correctness. We could have applied a smaller hammer by disabling TF32 per-OpInfo, but that doesn't seem to have too much additional benefit (e.g. if a convolution batching rule is correct on fp32 then I would expect it to be correct under TF32 modulo precision issues because the actual sequence of PyTorch operators we invoke has not changed, only the backend did). Test Plan: - I tested this locally on a machine with A100 GPUs. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86799 Approved by: https://github.com/malfet	2022-10-12 19:27:17 +00:00
Nikita Karetnikov	d56017a14f	[primTorch] Add ref for `triplet_margin_loss`, improve `triplet_margin_with_distance_loss` (#85614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85614 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-12 18:37:58 +00:00
Daniel Dale	ce56ee11fd	Extend torch.cuda.is_available() to attempt an NVML-based CUDA availability assessment when explicitly requested by the user (#85951 ) Fixes #83973 (This is a substitute PR for https://github.com/pytorch/pytorch/pull/85024) First of all, thanks for your invaluable contributions to PyTorch everyone! Given how extensively `torch.cuda.is_available` is used in the PyTorch ecosystem, IMHO it's worthwhile to provide downstream libraries/frameworks/users the ability to alter the default behavior of `torch.cuda.is_available` in the context of their PyTorch usage. I'm confident there are many current and future such use cases which could benefit from leveraging a weakened, NVML-based `torch.cuda.is_available` assessment at a downstream framework's explicit direction (thanks @malfet `81da50a972` !). Though one could always patch out the `torch.cuda.is_available` function with another implementation in a downstream library, I think this environmental variable based configuration option is more convenient and the cost to including the option is quite low. As discussed in https://github.com/pytorch/pytorch/pull/85024#issuecomment-1261542045, this PR gates new non-default NVML-based CUDA behavior with an environmental variable (PYTORCH_NVML_BASED_CUDA_CHK) that allows a user/framework to invoke non-default, NVML-based `is_available()` assessments if desired. Thanks again for your work everyone! @ngimel @malfet @awaelchli Pull Request resolved: https://github.com/pytorch/pytorch/pull/85951 Approved by: https://github.com/ngimel	2022-10-12 18:37:50 +00:00
Ivan Yashchuk	cd7c86eaa4	Add prims.clone (#86705 ) This simple PR adds `clone` as a primitive. Current implementation of `clone` is not supported with nvFuser executor because of `empty_like` + `copy_to`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86705 Approved by: https://github.com/mruberry	2022-10-12 18:22:00 +00:00
Howard Huang	3356d0385f	[BE] Store helper functions C++ for python API parity (#82136 ) Add helper functions for `store.set()`, `store.compare_set()` to accept string arguments instead of vector<uint_8> and refactored some usages internally Pull Request resolved: https://github.com/pytorch/pytorch/pull/82136 Approved by: https://github.com/rohan-varma	2022-10-12 17:49:38 +00:00
BowenBao	cc7ea93c2c	[ONNX] Support device().type() string comparison with constant (#86168 ) Fixes #86168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86168 Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock	2022-10-12 17:25:45 +00:00
HDCharles	58542eb256	[ao] fixing public v private for backend_config.native.py (#86030 ) Summary: no significant changes, just added some things to __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86030 Approved by: https://github.com/jerryzh168	2022-10-12 16:06:42 +00:00
Vladimír Aubrecht	409efebab8	Added define to fix issue with compatibility with latest Windows SDK (#85408 ) Fixes #83820. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85408 Approved by: https://github.com/ezyang	2022-10-12 15:44:28 +00:00
Sheil Kumar	f24d174fff	Allow PrivateUse1 backends to not have Storage (#86557 ) Allow PrivateUse1 backends to not have Storage To unblock the DirectML backend, this change would be needed for 1.13 as well. The DirectML backend creates tensors using the open registration pattern documented here: https://pytorch.org/tutorials/advanced/extend_dispatcher.html [registration example](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbdhirsh%2Fpytorch_open_registration_example&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ivYLNmuC1WMitwu8n%2B1RAmeKkRM4ssb7EvhhGKJDFwk%3D&reserved=0) However, DirectML tensors are opaque, and do not have Storage. The DirectML Tensor Impl derives from OpaqueTensorImpl, which does not have a storage. Because of this various places in the code fail that expect storage to be present. We had made various changes in-tree to accommodate this: a. def __deepcopy__(self, memo): [`b5acba8895/torch/_tensor.py (L119)`](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Ftorch%2F_tensor.py%23L119&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ajg23nMCzgRDwlinqSxS%2BRmOkAcDCr3LW%2BBEfNCn5hw%3D&reserved=0) or self.device.type in ["lazy", "xla", "mps", "ort", "meta", "hpu", 'dml'] b. def _reduce_ex_internal(self, proto): [`b5acba8895/torch/_tensor.py (L275)`](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Ftorch%2F_tensor.py%23L275&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xDW6LwPSe2F396OJ6QSJY6mVzJVDeQiJgA0G347y2pw%3D&reserved=0) if self.device.type in ["xla", "ort", "hpu", "dml"]: c. TensorIteratorBase::build has an unsupported list for tensors without storage. [`b5acba8895/aten/src/ATen/TensorIterator.cpp (L1497)`](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Faten%2Fsrc%2FATen%2FTensorIterator.cpp%23L1497&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qAdgNgzKl0xrtOvsABpw1VGkSoGUpe7jwDPhHw3XjgU%3D&reserved=0) Using the PrivateUse1 backend, similar exemptions need to be made in order to relax requirements on Storage so that the DirectML backend tensors can work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86557 Approved by: https://github.com/bdhirsh, https://github.com/martinb35	2022-10-12 15:26:29 +00:00
Philip Meier	61a5898675	use cff standard for citation information (#86200 ) GH picks up on our `CITATION` file in the root of the repository. ![Screenshot from 2022-10-04 11-34-54](https://user-images.githubusercontent.com/6849766/193811617-b71ef606-a043-498b-bb2d-14b6c05e79e7.png) However, [the preferred way](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files) is use a `CITATION.cff` file instead since GH supports the [citation file format (CFF) standard](https://github.com/citation-file-format/citation-file-format). With this PR, the prompt changes to ![Screenshot from 2022-10-04 13-48-21](https://user-images.githubusercontent.com/6849766/193812010-026bfad7-7c4e-4b59-a90a-1d3ad47303d0.png) with the following auto-generated bibtex entry: ```bibtex @inproceedings{Paszke_PyTorch_An_Imperative_2019, author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith}, booktitle = {Advances in Neural Information Processing Systems 32}, pages = {8024--8035}, publisher = {Curran Associates, Inc.}, title = {{PyTorch: An Imperative Style, High-Performance Deep Learning Library}}, url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}, year = {2019} } ``` Comparing with what we currently have the only significant difference is that the editors are no longer listed although the metadata is there. This is an issue with GH's automatic conversion and might be fixed in the future. Plus, the cite key was changed from `NEURIPS2019_9015` to `Paszke_PyTorch_An_Imperative_2019`, but this has no effect on the rendered result. Do we also want to adopt the CFF standard? Pull Request resolved: https://github.com/pytorch/pytorch/pull/86200 Approved by: https://github.com/dagitses	2022-10-12 13:03:48 +00:00
Fabio Rocha	493ded249e	[primTorch] decomposition for bucketize (#86366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86366 Approved by: https://github.com/mruberry	2022-10-12 12:25:42 +00:00
jjsjann123	f903f1ab34	Patching getitem in partitioner (#86713 ) 1. rejecting getitem operator in backends fusion query getitem is merged in a special post partition pass, backends that takes getitem shouldn't affect the logic 2. added test for failing cases Fixes #86698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86713 Approved by: https://github.com/SherlockNoMad	2022-10-12 07:50:46 +00:00
Khushi	2344135179	[primTorch] special: entr, expit (#86592 ) Add _refs for `entr` & `expit`. cc @mruberry @kshitij12345! Pull Request resolved: https://github.com/pytorch/pytorch/pull/86592 Approved by: https://github.com/mruberry	2022-10-12 07:00:40 +00:00
Sherlock Huang	a47f93b6c9	Add type and shape annotation for gm.print_readable() (#86562 ) For ``` def f(a, b): dim0 = a.shape[0] + b.shape[0] dim1 = a.shape[1] + b.shape[1] d = a.new_empty(dim0, dim1) return d fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(5, 3), torch.randn(4, 3)) fx_g.print_readable() ``` Tracing with 'real' and 'fake' mode yields ``` class f(torch.nn.Module): def forward(self, a_1: Tensor<f32>[5, 3], b_1: Tensor<f32>[4, 3]): # No stacktrace found for following nodes new_empty: Tensor<f32>[9, 6] = torch.ops.aten.new_empty.default(a_1, [9, 6], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False); a_1 = None return new_empty ``` Tracing with 'symbolic' mode yields ``` def forward(self, a_1: Tensor<f32>[t0.size(0), t0.size(1)], b_1: Tensor<f32>[t1.size(0), t0.size(1)]): # No stacktrace found for following nodes sym_size: Symint(t0.size(0)) = torch.ops.aten.sym_size(a_1, 0) sym_size_1: Symint(t1.size(0)) = torch.ops.aten.sym_size(b_1, 0) add: Symint(t0.size(0) + t1.size(0)) = sym_size + sym_size_1; sym_size = sym_size_1 = None sym_size_2: Symint(t0.size(1)) = torch.ops.aten.sym_size(a_1, 1) sym_size_3: Symint(t0.size(1)) = torch.ops.aten.sym_size(b_1, 1); b_1 = None add_1: Symint(2t0.size(1)) = sym_size_2 + sym_size_3; sym_size_2 = sym_size_3 = None new_empty: Tensor<f32>[t0.size(0) + t1.size(0), 2t0.size(1)] = torch.ops.aten.new_empty.default(a_1, [add, add_1], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False); a_1 = add = add_1 = None return new_empty ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86562 Approved by: https://github.com/Chillee	2022-10-12 05:39:54 +00:00
PyTorch MergeBot	e0d6898cbd	Revert "Backport currently dont work with some models if: (#86510 )" This reverts commit 4bfb7341819b3bfcaf65ddc136f25d23983740a7. Reverted https://github.com/pytorch/pytorch/pull/86510 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-10-12 04:12:43 +00:00
Eddie Yan	25725fd624	(Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682 ) Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682 Approved by: https://github.com/ngimel	2022-10-12 03:44:21 +00:00
Nikita Shulga	a216f4700c	Add testing on A10G GPU to periodic workflow (#85524 ) This enables testing on lots of modern CUDA features on sm_86 capable GPU While migrating to that platform, discovered that `functorch` tests for `nn.functional.conv.transpose3d` produce garbage on sm_80+ as well as 2 `nvfuser` tests unexpectedly pass and one unexpectedly fails. TODO: - Investigate unexpected success for `test_vmapvjp_linalg_householder_product_cuda_float32` and add `functorch` shard Pull Request resolved: https://github.com/pytorch/pytorch/pull/85524 Approved by: https://github.com/ngimel	2022-10-12 01:48:24 +00:00
Elias Ellison	c4f0b93f86	Disable autocast in aot autograd (#86515 ) Fix for https://github.com/pytorch/torchdynamo/issues/1368 From comment: > When we invoke a Composite Implicit autograd operator that has an autocast rule, such as Einsum, autocast is disabled during its invocation. When we trace out the operators in an implicit op, re-applying on autocast rules on those operators might yield divergence from what was executed at runtime. This pass checks for divergence. If divergence is found, we will disable autocast. We would like to avoid disabling autocast if possible because accessing TLS is slow. Concretely, the problem found was when invoked `sum` in `einsum`: As seen by the following divergence: ``` >>> with torch.cuda.amp.autocast(enabled=True): ... print(torch.ops.aten.sum.dim_IntList(torch.rand([2, 2, 2], device="cuda", dtype=torch.half), [1, 2]).dtype) ... torch.float32 >>> print(torch.ops.aten.sum.dim_IntList(torch.rand([2, 2, 2], device="cuda", dtype=torch.half), [1, 2]).dtype) torch.float16 ``` Edit: we've decided to accept the overhead of universally disabling autocast instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/86515 Approved by: https://github.com/bdhirsh, https://github.com/Chillee	2022-10-12 01:43:35 +00:00
Christian Puhrsch	d598290baa	Basic SDP benchmark harness (#86729 ) Basic benchmark for reference and discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86729 Approved by: https://github.com/drisspg	2022-10-12 01:27:59 +00:00
Han Qi (qihqi)	4bfb734181	Backport currently dont work with some models if: (#86510 ) Backport currently dont work with some models if: * model is originally exported with interface call enabled (backport would disable it) * model is flatbuffer (flatbuffer support is soft enabled via link time registry), so we manually trigger it Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86510 Approved by: https://github.com/cccclai	2022-10-12 00:39:25 +00:00
Bin Bao	ce48df9e93	Re-enable torchdynamo unit tests (#86658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86658 Approved by: https://github.com/jansel	2022-10-12 00:37:14 +00:00
Nikita Shulga	692b525b71	[MPS] Extend unary ops to int64 (#86615 ) Most of them are already supported for `int64` except for: - rounding operations (`floor`, `ceil` and `round`), which are no-ops for integral types anyway - sign operation, when it can be emulated by clamping it tensor to [-1, 1] range Test new types by test MPS Fixes https://github.com/pytorch/pytorch/issues/86319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86615 Approved by: https://github.com/DenisVieriu97, https://github.com/huydhn	2022-10-12 00:32:53 +00:00
PyTorch MergeBot	f912b58544	Revert "Enable max.unary_out (#85926 )" This reverts commit 16a0fa1204edb118800261a26281e624988eb239. Reverted https://github.com/pytorch/pytorch/pull/85926 on behalf of https://github.com/osalpekar due to The internal diff for this commit shows a number of pytorch quantization test failures. Here is a sample output: AssertionError: Tensor-likes are not close! Mismatched elements: 319 / 320 (99.7%). Greatest absolute difference: 0.056652069091796875 at index (0, 0, 4, 5) (up to 1e-05 allowed). Link to the diff: [D40232598](https://www.internalfb.com/diff/D40232598). Link to the Sandcastle job that is failing: https://www.internalfb.com/intern/sandcastle/job/18014399302908587/	2022-10-11 23:53:12 +00:00
PyTorch MergeBot	2aa981ab74	Revert "Reland 2 of Merge more symbolic meta kernels and symint changes from branch (#86334 ) (#86488 )" This reverts commit 978b46d7c96627e3b3553ad70ad21cb161d05f90. Reverted https://github.com/pytorch/pytorch/pull/86488 on behalf of https://github.com/osalpekar due to Broke executorch builds internally with the following message: RuntimeError: Missing out variant for functional op: aten::split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[] . Make sure you have loaded your custom_ops_generated_lib	2022-10-11 23:39:50 +00:00
Nikita Shulga	9eb4f9dd17	Tweak test tolerances to be compatible with A10G (#86538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538 Approved by: https://github.com/ngimel	2022-10-11 23:31:48 +00:00
Nikita Shulga	7fa601b1a7	Skip chalf.mean in test_reductions_large_half_tensors (#86747 ) As `mean_reduce` is not implemented for complex half Fixes https://github.com/pytorch/pytorch/issues/86743 and unblock A10G testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/86747 Approved by: https://github.com/ngimel	2022-10-11 23:27:30 +00:00
PyTorch MergeBot	811b8e012b	Revert "min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86643 )" This reverts commit 86f914e9966e91b3d3e7c1504f5b1f00a9498d88. Reverted https://github.com/pytorch/pytorch/pull/86643 on behalf of https://github.com/osalpekar due to Need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/86488. This should be safe to re-land later	2022-10-11 23:12:40 +00:00
Jason Ansel	f1fdb6efbd	Manual changes for moving dynamo to core (#86621 ) This is the subset of the changes in #86461 not auto-generated by `copy_to_core.sh`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86621 Approved by: https://github.com/albanD	2022-10-11 23:01:21 +00:00
Nikita Shulga	09364f4298	Compile C10 with `Wshadow` (#86666 ) This should prevent further regressions like https://github.com/pytorch/pytorch/pull/86646 Update `fmt` to `7.1.0` to fix variable shadowing in that library Pull Request resolved: https://github.com/pytorch/pytorch/pull/86666 Approved by: https://github.com/seemethere	2022-10-11 22:39:58 +00:00
Zain Rizvi	0337f0ad47	Add error checking to flaky test bot platform parser (#86632 ) If an invalid platform is specified when disabling a test with flaky test bot, the CI crashes, skipping all tests that come after it. This turns it into a console message instead. Not erroring out here since it'll affect random PRs. Actual error message should go into the bot that parses the original issue so that it can respond on that issue directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/86632 Approved by: https://github.com/huydhn	2022-10-11 21:56:01 +00:00
Partho	42bd275233	[doc] LR scheduler example fix (#86629 ) Fixes issue #86208 As suggested in the issue, updated the LR scheduler example to use a regular nn.Module like the other examples on the same page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86629 Approved by: https://github.com/soulitzer	2022-10-11 21:41:50 +00:00
jimku9	32152ce328	Add original sources/references to Wishart.py in distributions (#86543 ) @fritzo As discussed, add original sources/references to Wishart.py in distributions and corrected typos in the error messages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86543 Approved by: https://github.com/fritzo	2022-10-11 21:21:53 +00:00
Sherlock Huang	50af1ace5e	Mark aten ops as canonical (#86215 ) This is the first batch of canonical aten ops. 87 in total. More to come in the future PRs. native_dropout abs add.Tensor add.Scalar arange.start_step bitwise_not bmm cat clamp constant_pad_nd convolution convolution_backward div.Tensor div.Scalar embedding_dense_backward erf exp expand fill.Scalar grid_sampler_2d native_group_norm native_group_norm_backward native_layer_norm native_layer_norm_backward log _log_softmax max.dim amax mean.dim min.dim amin mm mul.Tensor mul.Scalar native_batch_norm permute scalar_tensor reciprocal neg repeat relu gelu rsqrt sigmoid slice.Tensor slice_scatter _softmax squeeze.dim sum.dim_IntList sqrt tanh unsqueeze var.dim where.self clone sub.Tensor sub.Scalar addmm _to_copy view scatter_add bitwise_and.Tensor bitwise_or.Tensor eq.Scalar ge.Scalar le.Scalar gt.Scalar lt.Scalar index_select nonzero gather maximum minimum pow.Tensor_Scalar hardtanh leaky_relu _adaptive_avg_pool2d _adaptive_avg_pool2d_backward avg_pool2d avg_pool2d_backward max_pool2d_with_indices max_pool2d_with_indices_backward upsample_bilinear2d.vec upsample_bilinear2d_backward.vec upsample_nearest2d.vec upsample_nearest2d_backward.vec col2im Pull Request resolved: https://github.com/pytorch/pytorch/pull/86215 Approved by: https://github.com/suo, https://github.com/anjali411	2022-10-11 21:12:53 +00:00
Jeff Daily	8db30255c3	[ROCm] set nvfuser default to disabled, keep CI (#86369 ) Bug fix. nvfuser is functional for ROCm on gfx906, but some tests are failing for other gfx targets. Disable nvfuser until all features are verified. Users may still opt-in by setting the known env var PYTORCH_JIT_ENABLE_NVFUSER=1. This PR sets this env var for the github actions workflow for ROCm since all current CI hosts are gfx906. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86369 Approved by: https://github.com/huydhn	2022-10-11 20:55:58 +00:00
Stephen Jia	5ffe24fca4	[vulkan][ez] fix always printing out a warning when retrieving the global context (#86697 ) Summary: D40151818 (`82ed5ca340`) replaces the `TORCH_CHECK` with a `TORCH_WARN` but since it does not check if the context is valid the message gets printed every time. This diff fixes that. Test Plan: Referring to [Pytorch Vulkan Testing Procedures](https://fb.quip.com/fZALAc9zhlcU) On Mac: 1. `vulkan_api_test` on Mac 2. model comparison binary on Mac On Android: 1. `vulkan_api_test` on Android 2. benchmark binary on Android Reviewed By: salilsdesai Differential Revision: D40266820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86697 Approved by: https://github.com/kirklandsign	2022-10-11 20:16:56 +00:00
Han Qi (qihqi)	f32aeeae00	Set interface_call to true be default (#86668 ) Summary: ASR models need it Test Plan: existing unit tests Reviewed By: cccclai Differential Revision: D40251788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86668 Approved by: https://github.com/cccclai	2022-10-11 20:07:58 +00:00
Huy Do	7f02f2ac0c	[Experimentation] Add TSAN build and test (#85313 ) Some parts of the PR are adopted from the previously abandoned https://github.com/pytorch/pytorch/pull/36694. This PR is the first part to setup TSAN jobs in the CI. The data race warnings from TSAN will need to be reviewed later in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85313 Approved by: https://github.com/osalpekar	2022-10-11 19:34:44 +00:00
胡玮文	92562046e9	Optimize __dlpack_device__ performance (#86665 ) This can be critical when processing a large number of tensors ```bash python -m timeit --setup 'import torch; t = torch.empty(1000, device="cuda")' 't.__dlpack_device__()' ``` based on 1.12.1: before: 100000 loops, best of 5: 2.32 usec per loop after: 500000 loops, best of 5: 844 nsec per loop Pull Request resolved: https://github.com/pytorch/pytorch/pull/86665 Approved by: https://github.com/SunDoge, https://github.com/soulitzer	2022-10-11 19:03:46 +00:00
Jerry Zhang	c12f829cce	[nn] Add remove_duplicate flag to named_buffers (#674 ) (#85903 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84984 this is to allow named_buffers to return the same buffer objects with different names multiple times, needed by internal use cases ghstack-source-id: 168589597 Test Plan: python test/test_nn.py -k test_buffers_and_named_buffers Imported from OSS Reviewed By: albanD Differential Revision: D39493161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85903 Approved by: https://github.com/albanD	2022-10-11 18:49:09 +00:00
David	693250ac85	Docs: fx.Node docs incorrectly state that the self argument is included in args for module calls (#86685 ) It seems like the [torch.fx.Node docs](https://pytorch.org/docs/stable/fx.html#torch.fx.Node) are incorrect regarding the inclusion of the self argument for module call nodes. While the docs state that self (the module) is included in `args`, it is in fact not, as demonstrated by this code: ```python import torch from torch import fx, nn class Net(nn.Module): def __init__(self): super().__init__() self.submod = nn.Linear(10, 10) def forward(self, x): x = x.flatten() return self.submod(x) graph_module = fx.symbolic_trace(Net()) print(graph_module.graph) # doesn't show self for the submodule call submod_node = list(graph_module.graph.nodes)[2] print(submod_node.op) # call_module print(submod_node.args) # (flatten,) => would need to have len 2 if self was included flatten_node = list(graph_module.graph.nodes)[1] print(flatten_node.op) # call_method print(flatten_node.args) # (x,) => here self is included (and docs are correct) ``` Since [torch.fx.Interpreter also uses `args` as if self was is not included](`2fe5808590/torch/fx/interpreter.py (L288)`), I assume the docs are incorrect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86685 Approved by: https://github.com/soulitzer	2022-10-11 18:05:56 +00:00
Fang Wang	160118d72a	Add test case for matrix multiply-add with large inputs (#85550 ) Summary: - Added test case for addmm, baddbmm and linear with large inputs - Testing with torch types: float32, float16, bfloat16 Test Plan: Run unit tests with: `buck2 run mode/opt //caffe2/test:linalg_re_cuda` ``` ... test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda' test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok ---------------------------------------------------------------------- Ran 24 tests in 63.224s OK (skipped=12) ``` Differential Revision: D39718256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550 Approved by: https://github.com/IvanYashchuk, https://github.com/malfet	2022-10-11 17:52:21 +00:00
vfdev	212fa874ce	Fix torch histogramdd docstring (#86593 ) Fixed torch histogramdd docsting with missing common_args Pull Request resolved: https://github.com/pytorch/pytorch/pull/86593 Approved by: https://github.com/soulitzer	2022-10-11 17:52:18 +00:00
Jane Xu	f26292d91e	[BE] Fix python docs typos up till torch.chunk (#86642 ) Was doing the Views lab linked https://github.com/pytorch/pytorch/wiki/Tensor-and-Operator-Basics and noticed a few typos, which led to this PR. Test plan: verified in preview Pull Request resolved: https://github.com/pytorch/pytorch/pull/86642 Approved by: https://github.com/soulitzer	2022-10-11 17:42:53 +00:00
albanD	86f914e996	min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86643 Approved by: https://github.com/anjali411	2022-10-11 17:37:30 +00:00
Jane Xu	6923dc3b59	Add module: decompositions as an owner to test_decomp.py (#86703 ) so flaky tests can be attributed to @SherlockNoMad too 😛 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86703 Approved by: https://github.com/albanD	2022-10-11 17:23:36 +00:00
Richard Zou	109f4d4453	Move functorch tests from functorch/test/* to test/functorch/* (#86623 ) This is the first step described in https://github.com/pytorch/pytorch/issues/86618 . test/functorch/* is the final location for these tests. Test Plan: - Check that the functorch shards in CI are still running tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86623 Approved by: https://github.com/huydhn	2022-10-11 17:20:45 +00:00
Ivan Yashchuk	51ea441862	Upcast to fp32 in test_addmm_block ref_half_bfloat16 (#86682 ) Fixes https://github.com/pytorch/pytorch/issues/86681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86682 Approved by: https://github.com/nikitaved	2022-10-11 16:39:57 +00:00
PyTorch MergeBot	3edf79dc03	Revert "Add meta support for _adaptive_avg_pool2d_backward (#86359 )" This reverts commit a56a8c0fc0251bb4cd24b366a290db2e4beea747. Reverted https://github.com/pytorch/pytorch/pull/86359 on behalf of https://github.com/clee2000 due to causing unexpected success for functorch on master but PR is green (landrace?) https://github.com/pytorch/pytorch/actions/runs/3227306657/jobs/5282180524 `a56a8c0fc0`	2022-10-11 16:33:41 +00:00
Nicolas Hug	97de281176	Improve interpolate() speed for channels_last CPU images and masks (#86361 ) This PR improves the speed of `interpolate()`: - on CPU - on images and masks (`num_channels < 4`, `channels_last=True`) - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float) - for both upsampling and downsampling The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size. In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details): ``` (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms ``` An immediate follow-up to this PR would be to do the same changes for the 3D kernels. Thanks a ton @fmassa for the help! ### Speedup benchmarks: Results: <details> ``` ---------------------------------------------------------------------------------------------------- (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=1 1.6X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 1.7X 1.0ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=1 8X 0.806ms vs 0.097ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.056ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=1 10X 0.828ms vs 0.084ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.914ms vs 0.057ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.900ms vs 0.086ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=2 1.6X 1.1ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=2 1.6X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 1.7X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 1.7X 0.5ms vs 0.3ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=2 9X 0.800ms vs 0.088ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=2 11X 0.459ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=2 7X 0.424ms vs 0.064ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.503ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.461ms vs 0.059ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=12 3X 1.1ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=12 1.6X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=12 5X 0.8ms vs 0.2ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=12 10X 0.445ms vs 0.047ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=12 7X 0.432ms vs 0.062ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 7X 0.470ms vs 0.063ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=32 1.8X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=32 11X 0.815ms vs 0.074ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.045ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=32 7X 0.436ms vs 0.061ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.061ms ---------------------------------------------------------------------------------------------------- (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=1 1.5X 0.9ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 1.6X 1.0ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=1 8X 0.808ms vs 0.099ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.058ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=1 9X 0.820ms vs 0.087ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.909ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.898ms vs 0.088ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=2 1.4X 0.9ms vs 0.7ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=2 1.5X 0.5ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 1.5X 0.5ms vs 0.4ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=2 9X 0.799ms vs 0.090ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=2 10X 0.459ms vs 0.045ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=2 7X 0.427ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 11X 0.501ms vs 0.044ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.460ms vs 0.060ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=12 2.9X 1.0ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=12 1.2X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 1.1X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=12 12X 0.809ms vs 0.068ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=12 8X 0.432ms vs 0.055ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.480ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.464ms vs 0.056ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=32 1.3X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=32 11X 0.813ms vs 0.075ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=32 7X 0.433ms vs 0.061ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.062ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=1 0.9X 4.5ms vs 5.2ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=1 1.5X 4.2ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=1 1.8X 4.1ms vs 2.3ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 1.6X 4.5ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 1.9X 4.4ms vs 2.3ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=1 9X 3.8ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=1 17X 4.0ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=1 11X 3.9ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 19X 4.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 12X 4.3ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=2 1.5X 4.5ms vs 3.1ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=2 1.4X 2.3ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=2 1.7X 2.1ms vs 1.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 1.6X 2.5ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 1.8X 2.2ms vs 1.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=2 15X 3.8ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=2 15X 2.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=2 7X 2.0ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 16X 2.4ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 8X 2.2ms vs 0.3ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=12 8X 5.2ms vs 0.7ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=12 1.3X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 1.4X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=12 36X 3.9ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=12 10X 0.526ms vs 0.051ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=12 7X 0.514ms vs 0.069ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 11X 0.569ms vs 0.052ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 8X 0.557ms vs 0.070ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=32 9X 4.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=32 0.5X 0.2ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=32 44X 3.864ms vs 0.087ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=32 10X 0.527ms vs 0.053ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=32 7X 0.516ms vs 0.070ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 10X 0.567ms vs 0.055ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 8X 0.558ms vs 0.072ms ---------------------------------------------------------------------------------------------------- (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=1 1.0X 1.9ms vs 1.9ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=1 2.0X 1.8ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=1 1.7X 1.8ms vs 1.0ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 2.1X 1.9ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 1.9X 1.9ms vs 1.0ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=1 9X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=1 16X 1.7ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=1 10X 1.7ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 17X 1.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 11X 1.8ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=2 1.7X 1.9ms vs 1.1ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=2 2.0X 1.0ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=2 1.7X 0.9ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 2.3X 1.1ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 1.8X 1.0ms vs 0.5ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=2 8X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=2 14X 0.931ms vs 0.067ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=2 7X 0.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 15X 1.016ms vs 0.069ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 9X 0.9ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=12 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=12 20X 1.630ms vs 0.081ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=12 10X 0.457ms vs 0.044ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=12 7X 0.439ms vs 0.060ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 11X 0.485ms vs 0.045ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 8X 0.474ms vs 0.061ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=32 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=32 2.0X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 1.4X 0.2ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=32 21X 1.628ms vs 0.078ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=32 9X 0.453ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=32 7X 0.445ms vs 0.063ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 11X 0.535ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 8X 0.502ms vs 0.063ms ---------------------------------------------------------------------------------------------------- (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=1 1.0X 13.8ms vs 14.0ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=1 1.8X 13.1ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=1 1.8X 11.1ms vs 6.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 1.9X 13.9ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 1.9X 11.8ms vs 6.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=1 10X 10.2ms vs 1.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=1 19X 10.8ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=1 11X 10.4ms vs 0.9ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 20X 11.6ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 12X 11.4ms vs 0.9ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=2 1.8X 13.7ms vs 7.7ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=2 2.6X 7.3ms vs 2.8ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=2 1.8X 5.6ms vs 3.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 1.9X 7.9ms vs 4.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 1.9X 6.0ms vs 3.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=2 18X 10.1ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=2 19X 5.8ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=2 10X 5.3ms vs 0.5ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 20X 6.3ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 11X 5.7ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=12 8X 13.8ms vs 1.6ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=12 2.9X 1.5ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=12 1.7X 1.0ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 1.5X 1.5ms vs 1.0ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 1.8X 1.0ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=12 80X 10.1ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=12 13X 0.928ms vs 0.072ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=12 8X 0.9ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 13X 1.001ms vs 0.074ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 9X 1.0ms vs 0.1ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=32 18X 14.0ms vs 0.8ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=32 1.9X 1.0ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=32 2.9X 0.7ms vs 0.2ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 1.7X 0.9ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 1.8X 0.4ms vs 0.2ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=32 111X 10.254ms vs 0.092ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=32 14X 0.784ms vs 0.056ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=32 7X 0.551ms vs 0.075ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 11X 0.607ms vs 0.057ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 8X 0.596ms vs 0.076ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.077ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.074ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 0.9X 0.078ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.076ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.075ms vs 0.074ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.082ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.080ms vs 0.083ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.070ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.073ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.071ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.079ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.077ms vs 0.079ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.077ms vs 0.075ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.083ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.076ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.073ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.078ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.074ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.077ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.076ms vs 0.079ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=1 1.8X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=1 1.6X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 2.0X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 1.7X 0.3ms vs 0.2ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=1 6X 0.265ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=1 10X 0.280ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=1 7X 0.273ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 11X 0.303ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 8X 0.297ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=2 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=2 1.8X 0.163ms vs 0.093ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=2 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 1.9X 0.180ms vs 0.096ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=2 6X 0.264ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=2 10X 0.278ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=2 7X 0.270ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 11X 0.298ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 8X 0.293ms vs 0.037ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=12 1.7X 0.158ms vs 0.095ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 1.7X 0.170ms vs 0.100ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=12 6X 0.269ms vs 0.043ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=12 11X 0.291ms vs 0.027ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=12 8X 0.281ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 8X 0.306ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=32 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=32 1.6X 0.160ms vs 0.098ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 1.7X 0.171ms vs 0.099ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=32 6X 0.269ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=32 10X 0.282ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=32 7X 0.276ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 8X 0.299ms vs 0.038ms ---------------------------------------------------------------------------------------------------- (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=1 1.0X 1.2ms vs 1.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=1 2.0X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=1 1.7X 1.1ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 2.1X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 1.9X 1.2ms vs 0.7ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=1 8X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=1 15X 1.109ms vs 0.073ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=1 10X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 16X 1.192ms vs 0.074ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 11X 1.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=2 1.7X 1.2ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=2 2.0X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=2 1.7X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 2.2X 0.7ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 1.8X 0.6ms vs 0.3ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=2 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=2 11X 0.598ms vs 0.052ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=2 8X 0.556ms vs 0.072ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 12X 0.649ms vs 0.053ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 8X 0.598ms vs 0.073ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=12 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=12 1.3X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=12 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=12 12X 0.572ms vs 0.048ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=12 8X 0.560ms vs 0.068ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 13X 0.617ms vs 0.049ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 9X 0.604ms vs 0.068ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=32 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=32 13X 1.042ms vs 0.081ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=32 12X 0.586ms vs 0.050ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=32 8X 0.562ms vs 0.069ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 12X 0.621ms vs 0.051ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 9X 0.609ms vs 0.070ms ---------------------------------------------------------------------------------------------------- (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=2 1.9X 0.5ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 2.1X 0.5ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=2 10X 0.808ms vs 0.084ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=2 10X 0.462ms vs 0.046ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=2 7X 0.429ms vs 0.062ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.504ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 7X 0.461ms vs 0.063ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=12 4X 1.0ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=12 12X 0.820ms vs 0.067ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=12 8X 0.431ms vs 0.056ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.482ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.467ms vs 0.056ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=32 4X 1.0ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=32 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=32 12X 0.824ms vs 0.070ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=32 7X 0.438ms vs 0.059ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 11X 0.479ms vs 0.045ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.059ms ---------------------------------------------------------------------------------------------------- (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=1 1.0X 4.7ms vs 4.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=1 2.0X 4.4ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=1 1.8X 4.3ms vs 2.5ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 2.1X 4.7ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 1.9X 4.6ms vs 2.5ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=1 9X 4.0ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=1 17X 4.2ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=1 11X 4.1ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 19X 4.6ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 12X 4.5ms vs 0.4ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=2 1.7X 4.7ms vs 2.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=2 2.1X 2.4ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=2 1.8X 2.2ms vs 1.3ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 2.3X 2.6ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 1.9X 2.3ms vs 1.3ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=2 15X 4.0ms vs 0.3ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=2 16X 2.3ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=2 9X 2.1ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 17X 2.5ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 10X 2.3ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=12 10X 4.7ms vs 0.5ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=12 41X 3.969ms vs 0.096ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=12 11X 0.545ms vs 0.051ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=12 8X 0.532ms vs 0.070ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 11X 0.590ms vs 0.052ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 8X 0.578ms vs 0.071ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=32 17X 4.7ms vs 0.3ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=32 2.0X 0.3ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 1.9X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=32 45X 4.028ms vs 0.090ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=32 10X 0.549ms vs 0.053ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=32 7X 0.536ms vs 0.072ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 11X 0.592ms vs 0.055ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 8X 0.581ms vs 0.074ms ``` </details> Code: <details> I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((224, 224), (64, 64)), ((224, 224), (128, 128)), ((600, 400), (224, 224)), ((320, 320), (256, 256)), ((800, 800), (500, 500)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` with ``` for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file ``` and this very ugly helper ```py import re with open("main") as f: main = f.readlines() with open("new") as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("$.?$", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 10 == 0 and i % 40 != 0: print() if i % 40 == 0: print("-" 100) print(l) ``` </details> Closes https://github.com/pytorch/pytorch/issues/83840 When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361 Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa	2022-10-11 16:17:36 +00:00
Nikita Shulga	a4ee6956ff	Pin numpy version during MPS tests (#86691 ) numpy-1.23.1 for some reason can not be loaded on M1 Fixes https://github.com/pytorch/pytorch/issues/86688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86691 Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/seemethere	2022-10-11 16:11:47 +00:00
eqy	352d926482	[CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#86645 ) re-opening (again) in hopes of working around failed/stuck CLA check CC @ptrblck @ngimel @huydhn Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645 Approved by: https://github.com/zdevito	2022-10-11 16:03:49 +00:00
Richard Zou	937d677d9f	Add version selector back to functorch docs (#86602 ) I accidentally deleted it in https://github.com/pytorch/pytorch/pull/85856/ . This brings the version selector back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86602 Approved by: https://github.com/samdow	2022-10-11 14:49:42 +00:00
anjali411	a56a8c0fc0	Add meta support for _adaptive_avg_pool2d_backward (#86359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86359 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-11 13:37:25 +00:00
Ivan Yashchuk	03d8ab4dec	Skip forward AD tests for torch.native_batch_norm (#86206 ) `test_forward_mode_AD` has problems with `torch.native_batch_norm` when computing Jacobian using finite-differences. Weirdly this test unexpectedly passed on periodic CI. Let's skip this test instead of xfailing. Fixes https://github.com/pytorch/pytorch/issues/86175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86206 Approved by: https://github.com/soulitzer	2022-10-11 13:03:20 +00:00
Andrew Gu	6ab07febce	[FSDP][Easy] Rename `_prefixed_param_names` -> `_fqns` for consistency (#86653 ) This renames `_prefixed_param_names` to `_fqns` to help converge on the terminology. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86653 Approved by: https://github.com/rohan-varma	2022-10-11 12:49:45 +00:00
albanD	2fe5808590	Symintify NLL loss, copy and squeeze (#86606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86606 Approved by: https://github.com/anjali411	2022-10-11 12:00:40 +00:00
albanD	be8627827e	More symintification of get/set item (#86605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86605 Approved by: https://github.com/anjali411	2022-10-11 12:00:40 +00:00
albanD	f841442252	symintify autograd view chaining (#86604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86604 Approved by: https://github.com/anjali411	2022-10-11 12:00:38 +00:00
albanD	49c9b0a154	symintify einsum (#86603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86603 Approved by: https://github.com/anjali411	2022-10-11 12:00:35 +00:00
PyTorch MergeBot	3a2cfbb813	Revert "Improve interpolate() speed for channels_last images and masks (#86361 )" This reverts commit 93b2d991581db86074dd8011fdc903bd554466b1. Reverted https://github.com/pytorch/pytorch/pull/86361 on behalf of https://github.com/DanilBaibak due to Break the internal import process	2022-10-11 10:17:27 +00:00
Jianyu Huang	17074389de	index op with int32 support (#86318 ) Differential Revision: D40089960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86318 Approved by: https://github.com/malfet	2022-10-11 06:12:17 +00:00
kshitij12345	88a8a900b9	fix: half reduction with multiple sub-iterators (#85596 ) Fixes #74438 TODO: * [x] Add test Pull Request resolved: https://github.com/pytorch/pytorch/pull/85596 Approved by: https://github.com/ngimel	2022-10-11 05:40:12 +00:00
Louis Feng	55479fe80e	Enable capturing of comm collective parameters (#98 ) (#85368 ) Summary: X-link: https://github.com/facebookresearch/torch_ucc/pull/98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368 Approved by: https://github.com/H-Huang	2022-10-11 04:38:26 +00:00
PyTorch MergeBot	ad2b04c39c	[torchdynamo hash update] update the pinned torchdynamo hash (#86651 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned torchdynamo hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86651 Approved by: https://github.com/pytorchbot	2022-10-11 03:29:01 +00:00
PyTorch MergeBot	bd381121b9	[vision hash update] update the pinned vision hash (#86652 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86652 Approved by: https://github.com/pytorchbot	2022-10-11 03:24:32 +00:00
PyTorch MergeBot	deb414a43f	Revert "Use FindCUDAToolkit to find cuda dependencies (#82695 )" This reverts commit fb9b96593c784b86b3d913ef8799ee120c203207. Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/malfet due to Break cublas packaging into wheel	2022-10-11 02:50:47 +00:00
Jianyu Huang	577070ff96	update fbgemm commit ID in PyTorch (#86577 ) Summary: Update after https://github.com/pytorch/FBGEMM/pull/1388 . Previous issue: D40216348 Test Plan: CI Differential Revision: D40219252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86577 Approved by: https://github.com/malfet	2022-10-11 02:15:53 +00:00
Will Constable	d8b971ed25	Fixes for partitioner with symbolic shapes (#86425 ) - supports saving symint (and symfloat..) values between fw/bwd, using sketchy logic that probably needs to be improved but seems to work so far - sets a correct weight=1 for sym nodes for cost purposes - lets user functions return symints/floats (but if the same symfloat is saved for backward, that gets duplicated annoyingly) - makes partitioning decisions based on observed trace-time sizes without guarding! (this is sketchy, but it isn't clear that it will lead to bad partitioning choices either) - improves infra for tracking symint-family of types: is_sym_node() and _py_sym_types Pull Request resolved: https://github.com/pytorch/pytorch/pull/86425 Approved by: https://github.com/ezyang	2022-10-11 01:42:28 +00:00
Driss Guessous	16f65f178a	Nested tensor forward only chunk operations (#85645 ) # Summary Taking over this pr: https://github.com/pytorch/pytorch/pull/83736 Adding support for chunk without autograd support Pull Request resolved: https://github.com/pytorch/pytorch/pull/85645 Approved by: https://github.com/cpuhrsch	2022-10-11 01:21:39 +00:00
Alan Lin	4fc0d5341c	[PyTorch][Fix] Improve numerical stability of HistogramObserver (#86522 ) Summary: As titled, HistogramObserver may fail in a certain scenario. Specifically, we originally compute `hist_bin_width` as `(self.max_val - self.min_val) / (self.bins * upsample_rate)`. It's possible that the numerator part is close the the FP32 threshold (1.4e-45) and conducting the division will cause overflow. Bring some redundent computations to avoid such scenario. Test Plan: https://pxl.cl/2ggD4 (`04490e90ea`) Differential Revision: D40149594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86522 Approved by: https://github.com/jerryzh168	2022-10-11 01:21:16 +00:00
Jerry Zhang	8a47a49d5e	[quant] Move the order of x86 engine to avoid changing the default qengine (#86631 ) since the default qengine is the last element of the engine in supported_engines list, adding x86 qengine in the end of the list changes the default quantized engine as well. this PR will be a short term fix to revert the changes. We have an issue here to track the proper fix: https://github.com/pytorch/pytorch/issues/86404 Motivation: a meta internal team found that the inference failed in onednn prepacking with error: "could not create a primitive descriptor for a reorder primitive." in a COPPER_LAKE machine, we are working with intel to repro and fix the problem. in the mean time, we'll revert the changes of default option back to fbgemm Pull Request resolved: https://github.com/pytorch/pytorch/pull/86631 Approved by: https://github.com/vkuzo	2022-10-11 00:07:41 +00:00
Nikita Shulga	224ae0da10	[BE] Fix variable shadowing in CUDACachingAllocator.cpp (#86646 ) Test Plan: CI Differential Revision: D40245365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86646 Approved by: https://github.com/seemethere	2022-10-10 23:52:28 +00:00
jjsjann123	2cb330ab15	Acyclic partition patch (#86511 ) Fixes #86159 and #86108 Refactored graph partition to check for cyclic dependency on each partition merge, instead of relying on a pre-baked dependency map. The previous implementation suffers from not updating dependency on existing partition. When a fusion happens, the updated dependency map needs to be propagated to all nodes in the graph, so each node in a partition shares an identical dependency set. Previous implementation suffers from the not identifying cyclic dependency in issue #86159. Updated implementation does a cyclic check on partitioned graph before attempting a merge of two partitions. - [x] python repro added with cyclic dependency after partition `TestFXGraphPasses.forward12` - [x] fix dependency map with updated implementation using cyclic check Pull Request resolved: https://github.com/pytorch/pytorch/pull/86511 Approved by: https://github.com/SherlockNoMad	2022-10-10 23:48:52 +00:00
jjsjann123	dd6dd03ff2	Enable output allocation cache (#86100 ) Cherry-picked from devel branch: https://github.com/csarofeen/pytorch/pull/2010 turns on accidentally disabled output allocation cache [#2002](https://github.com/csarofeen/pytorch/issues/2002) Updated check for safety regarding allocation cache by iterating all IterDomain on outputs and enables cache re-use only when no extent value is a consumer of fusion inputs (output sizes is not dependent on scalar inputs). Pull Request resolved: https://github.com/pytorch/pytorch/pull/86100 Approved by: https://github.com/csarofeen	2022-10-10 23:31:21 +00:00
Akshit Khurana	82ed5ca340	[Vulkan] Don't crash immediately if Vulkan context could not be retrieved (#86485 ) Test Plan: Internal AIBench test Reviewed By: SS-JIA Differential Revision: D40151818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86485 Approved by: https://github.com/kimishpatel	2022-10-10 22:32:44 +00:00
Elias Ellison	b409d1f65b	Turn on Data Dependent Throwing (#86480 ) This was already enabled in TorchDynamo, but was staged to make sure things don't break. Also makes backward single threaded for tests to fix a memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86480 Approved by: https://github.com/bdhirsh	2022-10-10 21:58:29 +00:00
Andrew Gu	ce7751188a	[DDP] Add `PackedSequence` support when `device_ids` is specified (#86614 ) Before this PR, if a user runs DDP with `device_ids` specified and with a `PackedSequence` input, then the execution will error with something like: ``` raise ValueError( ValueError: batch_sizes should always be on CPU. Instances of PackedSequence should never be created manually. They should be instantiated by functions like pack_sequence and pack_padded_sequences in nn.utils.rnn. https://pytorch.org/docs/stable/nn.html... ``` This is because the DDP forward calls `_to_kwargs()`, which calls `_recursive_to()`, which moves the inputs to GPU. However, `_is_namedtuple(packed_sequence)` returns `True`, leading to the branch `return [type(obj)(args) for args in zip(map(to_map, obj))]`, which tries to construct a `PackedSequence` directly via `type(obj)(*args)`, leading to the error. Repro for `_is_namedtuple(packed_sequence)` returning `True`: ``` import random import torch import torch.nn.utils.rnn as rnn_utils from torch.nn.parallel.scatter_gather import _is_namedtuple def _ordered_sequence(tensor_type): seqs = [tensor_type(random.randint(1, 256)) for _ in range(32)] seqs = [s.random_(-128, 128) for s in seqs] ordered = sorted(seqs, key=len, reverse=True) return ordered def _padded_sequence(tensor_type): ordered = _ordered_sequence(tensor_type) lengths = [len(i) for i in ordered] padded_tensor = rnn_utils.pad_sequence(ordered) return padded_tensor, lengths padded, lengths = _padded_sequence(torch.Tensor) packed = rnn_utils.pack_padded_sequence( padded, lengths, enforce_sorted=False) print(type(packed), packed.data.device) print(_is_namedtuple(packed)) ``` Test Plan: ``` python test/distributed/test_c10d_nccl.py -k test_ddp_packed_sequence ``` Without the fix, the added unit test fails with the expected error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86614 Approved by: https://github.com/rohan-varma	2022-10-10 21:50:59 +00:00
Nikita Shulga	b7b5bd47ae	[MPS] Implement `frac` operator (#86625 ) As combination if self-trunc Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86625 Approved by: https://github.com/kulinseth, https://github.com/albanD	2022-10-10 20:36:22 +00:00
David Reiss	885122b7dc	Move PadNd from ATen/native to ATen (#82379 ) Summary: This header is being included from both aten/native and torch/csrc, but some of our build configurations don't allow direct dependencies from torch/csrc to atent/native, so put the header in aten where it's always accessible. Resolves https://github.com/pytorch/pytorch/issues/81198 Test Plan: CI. ``` ./scripts/build_android.sh env ANDROID_ABI="x86_64" ANDROID_NDK=".../ndk-bundle" CMAKE_CXX_COMPILER_LAUNCHER=ccache CMAKE_C_COMPILER_LAUNCHER=ccache USE_VULKAN=0 ./scripts/build_android.sh echo '#include <torch/torch.h>' > test.cpp g++ -E -I $PWD/build_android/install/include/ -I $PWD/build_android/install/include/torch/csrc/api/include test.cpp >/dev/null ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/82379 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-10-10 20:26:57 +00:00
anjali411	e2a4dfa468	Add correct __all__ for torch.distributed and torch.cuda submodules (#85702 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85702 Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/rohan-varma	2022-10-10 19:15:24 +00:00
Rohan Varma	d93b1b9c4e	Address feedback from previous PR (#86622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86622 Approved by: https://github.com/albanD	2022-10-10 18:53:41 +00:00
Jerry Zhang	d792d75091	[quant][fix] Fix the call to get_executorch_backend_config (#86338 ) Summary: previously the call failed because there was an infinite loop in _get_share_qparams_ops_configs Test Plan: python test/test_quantization.py -k test_get_executorch_backend_config Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86338 Approved by: https://github.com/andrewor14	2022-10-10 18:52:26 +00:00
Sean Ross-Ross	2288a1c806	Added new option any_common_cpu_cuda_one to OpDTypes (#86286 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86286 Approved by: https://github.com/lezcano, https://github.com/mruberry	2022-10-10 17:47:11 +00:00
Alex	8f2dda5bf2	[CI] Build MacOS M1 binaries without distributed support (#86451 ) Partial fix for #86448 which causes the broken code to be exercised in CI. If this demonstrates the break, I'm not sure whether there should be a fix forward of https://github.com/pytorch/pytorch/pull/85781 or a revert Pull Request resolved: https://github.com/pytorch/pytorch/pull/86451 Approved by: https://github.com/malfet	2022-10-10 17:42:13 +00:00
Driss Guessous	dcc3ae98b7	[NestedTensor] Add a contiguous checks to get_buffer (#86496 ) # Summary Many NestedTensor ops are implemented using a connivence function named get_buffer. This returns a dense, contiguous tensor that is a view of the underlying storage of the NestedTensor. This function allows NestedTensor ops to piggy back off of the implementations for dense tensor under certain scenarios. This PR adds a TORCH_CHECK() to get buffer to insure that the calling NT is in fact contiguous. It also adds an "unsafe" version for a few ops that are designed to handle contiguity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86496 Approved by: https://github.com/albanD, https://github.com/cpuhrsch	2022-10-10 17:37:19 +00:00
Howard Huang	ad449b338f	[8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations (#84423 ) ### Changes - Updates for the allgather collective ### Context https://github.com/pytorch/pytorch/issues/86225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84423 Approved by: https://github.com/kwen2501	2022-10-10 17:18:48 +00:00
anjali411	9eb771583c	symintify rand and randint functions and meta suport for randint (#86358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86358 Approved by: https://github.com/ezyang, https://github.com/albanD	2022-10-10 17:07:11 +00:00
David	67358ee124	MaxPool: correct pooling description (#86559 ) In the documentation of `nn.MaxPool2d` and `nn.MaxPool3d`, the argument description of `padding` incorrectly states that zero padding is applied. The remainder of the documentation correctly states that negative infinity padding is applied. The documentation of `padding` in `nn.MaxPool1d`, `nn.functional.max_pool1d/2d/3d` is correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86559 Approved by: https://github.com/albanD	2022-10-10 16:57:54 +00:00
Tugsbayasgalan Manlaibaatar	16a0fa1204	Enable max.unary_out (#85926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85926 Approved by: https://github.com/bdhirsh	2022-10-10 16:53:33 +00:00
Kshiteej K	e18d466f35	[test_nn] split lazy_modules from test_nn (#86526 ) Ref: #63085 NOTE: We don't need an accompanying XLA PR as these tests run only on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86526 Approved by: https://github.com/albanD	2022-10-10 16:29:56 +00:00
Howard Huang	8a1fc5d2f8	[7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations (#83916 ) ### Changes - Updates for the reduce collective ### Context https://github.com/pytorch/pytorch/issues/86225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83916 Approved by: https://github.com/kwen2501	2022-10-10 15:58:37 +00:00
albanD	978b46d7c9	Reland 2 of Merge more symbolic meta kernels and symint changes from branch (#86334 ) (#86488 ) symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops add meta_bernoulli_ meta kernel for at::gather get pytorch_struct to pass: meta for scatter_add, fix backward symintify split ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/86488 Approved by: https://github.com/ezyang	2022-10-10 15:54:28 +00:00
albanD	55663b7f81	Reland 3 of Symintify getitem and add the required helper functions (#86207 ) (#86487 ) Note that this might not cover every use of the function (we know it doesn't) But this is enough to get few models passing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86487 Approved by: https://github.com/ezyang	2022-10-10 15:54:28 +00:00
Brian Hirsh	4a5fdc56ec	fix some composite compliance ops for functionalization (#86470 ) Confirmed that this fixes https://github.com/pytorch/pytorch/issues/86384 cc @tugsbayasgalan Functionalization should be included in the "isSubclass" checks that we run, for composite operators that have a different path for composite compliance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86470 Approved by: https://github.com/ezyang, https://github.com/zou3519	2022-10-10 14:27:18 +00:00
Andrew Gu	5102f0cffc	[FSDP][1/N] Retire `FlattenParamsWrapper` (#86117 ) This deprecates `FlattenParamsWrapper`'s usage for "unflattening" the original parameters. After this PR, FPW only serves to register and de-register its `FlatParameter` for the parent `FullyShardedDataParallel` instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86117 Approved by: https://github.com/zhaojuanmao	2022-10-10 11:38:44 +00:00
PyTorch MergeBot	bf7c46facf	[xla hash update] update the pinned xla hash (#86099 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86099 Approved by: https://github.com/pytorchbot	2022-10-10 10:47:40 +00:00
Andrew Gu	5844f00bbf	[FSDP] Add `low_prec` prefix to param and reduce dtype varnames (#86512 ) This PR renames `param_dtype` and `reduce_dtype` in `HandleConfig` to `low_prec_param_dtype` and `low_prec_reduce_dtype` to emphasize that they are meant to be of the low precision (if not `None`). (In my mind, mixed precision refers to the paradigm of using both full and low precision together during training. "Reduced" and "low precision" mean the same thing, but I prefer the term "low precision" in the code since it is shorter. A particular dtype can be a low precision dtype or a full precision dtype.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86512 Approved by: https://github.com/zhaojuanmao	2022-10-10 09:33:33 +00:00
Andrew Gu	cc5de7f1ac	[FSDP] Remove `utils.py` (moved to `_utils.py`) (#86528 ) I messed up my git with an earlier PR, where I did not actually remove `utils.py` when moving it to `_utils.py`. This removes `utils.py`, which is now outdated and unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86528 Approved by: https://github.com/H-Huang	2022-10-10 09:31:01 +00:00
chunyuan	c6b7c33885	torchdynamo: add linear eltwise fusion kernel (#85622 ) Support fusion of linear with: - relu - sigmoid - tanh - hardswish - leaky_relu - hardtanh - gelu Pull Request resolved: https://github.com/pytorch/pytorch/pull/85622 Approved by: https://github.com/EikanWang, https://github.com/jansel	2022-10-10 05:47:11 +00:00
PyTorch MergeBot	ec2d22ece0	[torchdynamo hash update] update the pinned torchdynamo hash (#86567 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned torchdynamo hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86567 Approved by: https://github.com/pytorchbot	2022-10-10 03:26:25 +00:00
Peter Bell	753536b7a5	BlasKernel: Improve gemm's inner dot product when a is transposed (#80977 ) `gemm_transab_` accumulates the sum in the output, despite the inner loop being over a single output element. This changes it to accumulate in a register, which also avoids early truncation for bfloat16. I've also factored out a generic `sum` function that can be shared with `gemm_transa_` to handle unrolling and multiple accumulators. I have benchmarked addmm for bfloat16 with shapes (320,600) X (600,320) and for both layouts I see a significant speedup. \| layout \| Before (ms) \| After (ms) \| \|----------\|-------------\|------------\| \| transa \| 71.5 \| 31 \| \| transab \| 249 \| 35 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/80977 Approved by: https://github.com/ngimel	2022-10-09 22:56:29 +00:00
Peter Bell	a45fead623	mkl: Use per-operator headers (#75570 ) Differential Revision: [D40126703](https://our.internmc.facebook.com/intern/diff/D40126703) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75570 Approved by: https://github.com/malfet	2022-10-09 20:12:55 +00:00
anjali411	c89d286af6	symintify unbind_backward and tensor_split (#86357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86357 Approved by: https://github.com/albanD	2022-10-09 16:25:55 +00:00
anjali411	a6c0442cce	Add __all__ to torch.{autograd, fx, cuda} submodules (#85343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85343 Approved by: https://github.com/albanD	2022-10-09 14:46:54 +00:00
Nikita Shulga	6aec0d3ddb	[BE] Remove remaining cuda-11.3 builds (#86540 ) `linux-bionic-cuda11_3-py3_7-clang9-build` is redundant is is covered by `linux-jammy-cuda11.6-cudnn8-py3.8-clang12` And migrate no-per-operator header build (which mimics internal behavior) from `linux-xenial-cuda11.3` to `linux-bionic-cuda11.7` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86540 Approved by: https://github.com/weiwangmeta, https://github.com/atalman	2022-10-09 14:20:46 +00:00
Peter Bell	7134b9bc7b	Quantized: Use per-operator headers (#75569 ) Differential Revision: [D40126700](https://our.internmc.facebook.com/intern/diff/D40126700) Pull Request resolved: https://github.com/pytorch/pytorch/pull/75569 Approved by: https://github.com/malfet	2022-10-09 07:46:13 +00:00
Nikita Shulga	67434c70df	[MPS] Fix printTensor() for MPS (#86534 ) MPS does not support double type, so tensor need to be cast to CPU first before it can be cast to double. Also, do a little bit of BE, by initializing values and marking unused range variables with C10_UNUSED Fixes https://github.com/pytorch/pytorch/issues/86410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86534 Approved by: https://github.com/weiwangmeta	2022-10-09 06:47:36 +00:00
PyTorch MergeBot	9998f9100b	[vision hash update] update the pinned vision hash (#86490 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86490 Approved by: https://github.com/pytorchbot	2022-10-09 03:30:07 +00:00
PyTorch MergeBot	92ac84c98a	[torchdynamo hash update] update the pinned torchdynamo hash (#86489 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned torchdynamo hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86489 Approved by: https://github.com/pytorchbot	2022-10-09 03:28:37 +00:00
Peter Bell	492d1be5d2	QuantizedCPU: Use per-operator headers (#71217 ) Differential Revision: [D33949895](https://our.internmc.facebook.com/intern/diff/D33949895) Pull Request resolved: https://github.com/pytorch/pytorch/pull/71217 Approved by: https://github.com/malfet	2022-10-09 03:27:50 +00:00
Peter Bell	4bfe2a2450	cuDNN/miopen: Use per-operator headers (#71216 ) Differential Revision: [D33949898](https://our.internmc.facebook.com/intern/diff/D33949898) Pull Request resolved: https://github.com/pytorch/pytorch/pull/71216 Approved by: https://github.com/malfet	2022-10-08 19:37:20 +00:00
Edward Z. Yang	33f0e98a49	Re-land*4 "SymIntify cat and narrow" (#86468 ) This re-lands https://github.com/pytorch/pytorch/pull/86289 but with more wrappers. Contains implicit inclusion of <ATen/native/NonSymbolicBC.h> in internal usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86468 Approved by: https://github.com/albanD	2022-10-08 07:17:37 +00:00
PyTorch MergeBot	8ea2ed0fc7	Revert "Re-enable torchdynamo tests (#86297 )" This reverts commit e61028813007518bd6be0e6482a8742b84c30da7. Reverted https://github.com/pytorch/pytorch/pull/86297 on behalf of https://github.com/malfet due to Reverting to return trunk back to green, dynamo shard2 started failing shortly after the merge	2022-10-08 05:14:40 +00:00
Elias Ellison	d3f7c34cb3	Enable aten-aten decomps (#85921 ) Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921 Approved by: https://github.com/ezyang	2022-10-08 05:12:42 +00:00
Andrew Gu	af9c6bc851	[FSDP] Add `keep_low_precision_grads` support when CPU offloading (#86495 ) When CPU offloading, FSDP uses `_cpu_grad`, not `_saved_grad_shard`. This adds support for `keep_low_precision_grads` for that case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86495 Approved by: https://github.com/rohan-varma	2022-10-08 03:26:40 +00:00
PyTorch MergeBot	7ec12a559c	Revert "Enable aten-aten decomps (#85921 )" This reverts commit 62e4f51efdf98a3a91d29efa55e5665d5398b464. Reverted https://github.com/pytorch/pytorch/pull/85921 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. I think it breaks a dynamo test in trunk `62e4f51efd`	2022-10-08 01:59:54 +00:00
ssjia	b0ceb8ea1c	[vulkan] Add buffer to buffer copies (#86424 ) Differential Revision: [D40112702](https://our.internmc.facebook.com/intern/diff/D40112702/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86424 Approved by: https://github.com/kimishpatel	2022-10-08 01:32:17 +00:00
ssjia	511d81cd2a	[vulkan] Clean up convolution code (#86423 ) Differential Revision: [D39553863](https://our.internmc.facebook.com/intern/diff/D39553863/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86423 Approved by: https://github.com/kimishpatel	2022-10-08 01:28:56 +00:00
Cody Ohlsen	b645c237bc	make g2p ~30% faster on mobile by suppressing a log (#85907 ) Summary: using the tool from D39559248 i was able to make g2p faster on mobile by taking a look at profiles on stella frames. It turned out that the pytorch interpreter code does some logging that ends up being a pretty big bottleneck. Differential Revision: D39901455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85907 Approved by: https://github.com/dzdang	2022-10-08 01:25:03 +00:00
David Berard	bac26155e7	[JIT] Allow freezing modules that contain mutable interfaces (#86039 ) This PR allows freezing modules like the one below: ```python # Ex. 1 @torch.jit.interface class ModuleInterface(torch.nn.Module): def forward(self, inp: torch.Tensor) -> torch.Tensor: pass class ImplementsInterface(torch.nn.Module): def __init__(self): super(ImplementsInterface, self).__init__() self.sum = torch.zeros((2, 2)) def forward(self, inp: torch.Tensor) -> torch.Tensor: self.sum += inp.relu() # this makes the interface-implementing module mutable # and previously this would prevent freezing return self.sum class WrapperModule(torch.nn.Module): impl: ModuleInterface def __init__(self): super().__init__() self.impl = ImplementsInterface() def forward(self, x: torch.Tensor) -> torch.Tensor: return self.impl.forward(x) ``` Previously during freezing, we handle interfaces as shown below: 1. we inline interfaces in any preserved method graphs 2. during `cleanupFrozenModule`, we try to simplify the module data structure (<- this part is unrelated to freezing so far). During this step, if we found that a interface type was mutable, we'd error out; because of the possibility of a module that _swaps out the value of an interface-typed attribute at runtime_. Below is an example of a module that swaps out the value of an interface-typed attribute at runtime: ```python # Ex. 2 class MyBadModule(torch.nn.Module): impl: MyInterface option1: IfaceImpl1 option2: IfaceImpl2 .... def forward(self, x): if x > 0: self.impl = self.option1 else: self.impl = self.option2 .... ``` ^ this type of situation cannot be supported by freezing (or at least would be difficult to do correctly) because it greatly complicates the details of handling types and simplifying the module data structure. But we can still support the first example without _too_ much work: 1. inline the interface code as before 2. check to see if we have any setattrs on interface types; if so, error out 3. otherwise, replace the type of the interface types with the concrete type implementation 4. continue simplifying the module data structure as if we never had any interfaces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86039 Approved by: https://github.com/eellison	2022-10-08 00:38:11 +00:00
Hongxia Yang	04490e90ea	better error message fix (#86422 ) Summary: A user had a problem with fx-scripting and the error message can be improved. Error was shown as: RuntimeError: Keys for dictionaries used as an argument cannot contain a Node. Got key: {k} which is obvious not quite helpful. Test Plan: Test in a notebook: {F778667593} Reviewed By: xunnanxu, SherlockNoMad Differential Revision: D40157518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86422 Approved by: https://github.com/SherlockNoMad	2022-10-08 00:06:05 +00:00
zaf	3a02873183	[quant][ao_migration] nn.intrinsic.quantized migration to ao (#86172 ) All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.quantized`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline. ``` python test/test_quantization.py -- TestAOMigrationNNIntrinsic ``` Internal: ``` buck2 test @mode/dev-nosan //caffe2/test:quantization -- TestAOMigrationNNIntrinsic ``` Differential Revision: [D39425515](https://our.internmc.facebook.com/intern/diff/D39425515/) Differential Revision: [D39425515](https://our.internmc.facebook.com/intern/diff/D39425515) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86172 Approved by: https://github.com/jerryzh168	2022-10-08 00:01:38 +00:00
Zachary DeVito	91b1bae1df	Caching allocator tracing (#86241 ) We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241 Approved by: https://github.com/ngimel	2022-10-07 23:19:54 +00:00
Sherlock Huang	8a3a54e012	Fix index_select decomp (#86469 ) For decomposing index_select with 0-dim tensor, we cannot write `x.unsqueeze(0)[index].squeeze(0).clone()` , as tensor[index] will trigger index.item() if index is a 0-dim tensor, and .item() cannot be symbolically traced with FakeTensor. We use `torch.ops.aten.index(x.unsqueeze(0), [index]).squeeze(0).clone()` as a workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86469 Approved by: https://github.com/ngimel	2022-10-07 22:59:49 +00:00
albanD	a079dad7cf	Skip dynamo for all optim test as they are all flaky otherwise (#86482 ) Fixes https://github.com/pytorch/pytorch/issues/86433 Fixes https://github.com/pytorch/pytorch/issues/86435 Fixes https://github.com/pytorch/pytorch/issues/86432 Fixes https://github.com/pytorch/pytorch/issues/86389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86482 Approved by: https://github.com/ezyang	2022-10-07 22:47:48 +00:00
soulitzer	ba3fde6aa0	Add multi-grad hooks (#86260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86260 Approved by: https://github.com/albanD	2022-10-07 21:16:45 +00:00
albanD	97e56c176d	Try to fix shutdown test in edge cases (#86464 ) Fixes https://github.com/pytorch/pytorch/issues/85259 See the issue for debugging details. tl;dr: when a worker thread is actually used, make sure it is initialized before exiting. Yes, it is very unlikely it will take >10s to initialize but it is what seems to happen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86464 Approved by: https://github.com/soulitzer, https://github.com/ezyang	2022-10-07 21:09:40 +00:00
Elias Ellison	62e4f51efd	Enable aten-aten decomps (#85921 ) Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921 Approved by: https://github.com/ezyang	2022-10-07 21:04:39 +00:00
Andrew Gu	a95889ba7c	[FSDP] Add initial `summon_full_params(with_grads=True)` (#85738 ) This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well. Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85738 Approved by: https://github.com/rohan-varma	2022-10-07 21:03:18 +00:00
kshitij12345	82229d1e33	[optim] fix: empty grad support for SparseAdam (#86459 ) Fixes #82486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86459 Approved by: https://github.com/albanD	2022-10-07 19:24:59 +00:00
PyTorch MergeBot	66d480d314	Revert "Disable mac m1 jobs (#86463 )" This reverts commit ac632b437489b4c0c2714d5ad37517bb60e09750. Reverted https://github.com/pytorch/pytorch/pull/86463 on behalf of https://github.com/huydhn due to Queue is decreasing, re-enable the jobs	2022-10-07 18:55:01 +00:00
Huy Do	ac632b4374	Disable mac m1 jobs (#86463 ) There is a queue and some runners are not accessible. This is to mitigate the Sev https://github.com/pytorch/pytorch/issues/86466 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86463 Approved by: https://github.com/clee2000	2022-10-07 18:28:47 +00:00
HDCharles	ac74976a56	[ao] fixing public v private for fuser_method_mappings.py (#86029 ) Summary: no significant changes, just added __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86029 Approved by: https://github.com/jerryzh168	2022-10-07 18:11:42 +00:00
Andrew Gu	be682befbc	[FSDP] Add `use_orig_params` (#84911 ) Overview This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor. - This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups. - This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy. For more detailed design explanation, refer to the Quip shared internally. Follow-Ups See 85831 (removing link to avoid spamming the issue whenever I update this PR). `test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84911 Approved by: https://github.com/rohan-varma	2022-10-07 18:07:17 +00:00
Chengqi Deng	b43ae1c411	Add reference counter in FileStore (#85601 ) Fixes #67566. This diff added a reference counter in the FileStore object. The underlying file would be removed only if the reference counter became 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85601 Approved by: https://github.com/H-Huang	2022-10-07 17:59:29 +00:00
zaf	efccb6401c	[quant][ao_migration] nn.intrinsic.qat migration to ao (#86171 ) All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.qat`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline. ``` python test/test_quantization.py TestAOMigrationNNIntrinsic ``` Differential Revision: [D39419993](https://our.internmc.facebook.com/intern/diff/D39419993/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39419993/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/86171 Approved by: https://github.com/jerryzh168	2022-10-07 17:29:42 +00:00
Yanbo Liang	e610288130	Re-enable torchdynamo tests (#86297 ) We temporarily skipped torchdynamo tests due to many failures, now we fix the problems and re-enable tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86297 Approved by: https://github.com/anijain2305	2022-10-07 17:16:40 +00:00
HDCharles	e8d3b7201c	[ao] fixing public v private for fuse_modules.py (#86028 ) Summary: no significant changes, just added __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86028 Approved by: https://github.com/jerryzh168	2022-10-07 17:12:33 +00:00
HDCharles	d29912cc06	[ao] fixing public v private for torch/ao/quantization (#86027 ) Summary: no significant changes, just needed to add __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86027 Approved by: https://github.com/jerryzh168	2022-10-07 17:12:18 +00:00
PyTorch MergeBot	65b408074f	Revert "Relandx3 "SymIntify cat and narrow" (#86289 )" This reverts commit a00f8489df5586178d7b5f83928bf8049ce32f24. Reverted https://github.com/pytorch/pytorch/pull/86289 on behalf of https://github.com/malfet due to @seemether unlanded the rest of the stack and it will fail intern import anyway	2022-10-07 16:29:27 +00:00
PyTorch MergeBot	5b69b87d5a	Revert "Symintify getitem and add the required helper functions (#86207 )" This reverts commit fd5085c445c3f1a4c90e55154cf26fe30f52a0ab. Reverted https://github.com/pytorch/pytorch/pull/86207 on behalf of https://github.com/seemethere due to Fails internal tests, see: https://www.internalfb.com/intern/sandcastle/job/22517998926071860/insights	2022-10-07 16:10:30 +00:00
PyTorch MergeBot	75df4b5e3d	Revert "Merge more symbolic meta kernels and symint changes from branch (#86334 )" This reverts commit 08e3999fa494238f8f62346a140da36bd43864e7. Reverted https://github.com/pytorch/pytorch/pull/86334 on behalf of https://github.com/seemethere due to Trying to revert https://github.com/pytorch/pytorch/pull/86207, this PR causes merge conflicts with the initial revert so will have to revert this as well	2022-10-07 16:03:30 +00:00
Check Deng	b3fdb02fb2	Fix memory leak in _LRScheduler.step() (#85602 ) Fixes #85410 This diff removed the cyclic references in `_LRScheduler.step()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602 Approved by: https://github.com/albanD	2022-10-07 15:55:55 +00:00
PyTorch MergeBot	0e639ff45c	Revert "Cleanup PT-D imports (#85781 )" This reverts commit 9a170b24f64d7cfdd887ff122c241ac6ff85f4c6. Reverted https://github.com/pytorch/pytorch/pull/85781 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2022-10-07 14:55:44 +00:00
Nikita Vedeneev	9b2ea41f48	COO intersection primitives : fusing value selection with value intersection. (#86269 ) As per title. This one fuses 3 kernels into 1 with about 20-10% performance improvement. This kernel is also useful for union-like operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86269 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2022-10-07 14:50:48 +00:00
Richard Zou	e125baf90b	[autocast] Clean up registrations using new macros (#86403 ) This PR cleans up m.impl(...) calls to use the new KERNEL / KERNEL_CPU macros. That saves us the trouble of writing out the signatures. Test Plan: - code reading - wait for tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86403 Approved by: https://github.com/ezyang	2022-10-07 14:14:38 +00:00
Richard Zou	9b74267eb6	[autocast] Make it easier to register rules (#86402 ) On the way to resolving https://github.com/pytorch/pytorch/issues/86294 Previously, there were three macros used to register autocast rules: - KERNEL - KERNEL_DIFFERENT_REDISPATCH_SIGNATURE - KERNEL_CPU This PR makes the KERNEL and KERNEL_CPU macros less redundant for users. KERNEL_DIFFERENT_REDISPATCH_SIGNATURE is weird and only used three times, so I didn't change them. Concretely, KERNEL(OP, OP_NAME, SIGNATURE, POLICY) is redundant: - op/op_name are similar, and the signature can be decltype'd. PR changes it so that instead, one uses either: - KERNEL(OP, POLICY) - KERNEL2(OP, OVERLOAD, POLICY) depending on whether the operator name has an overload. This PR also gives the same treatment to the KERNEL_CPU macro, which is used for registering autocast cpu rules: it splits KERNEL_CPU into KERNEL_CPU(OP, POLICY) AND KERNEL_CPU2(OP, OVERLOAD, POLICY). I will do some more cleanup of things that are implemented via `m.impl(...)` in a follow-up PR so that I don't get confused when I need to rebase. Test Plan: - wait for tests (how good are our autocast tests?) - code reading Pull Request resolved: https://github.com/pytorch/pytorch/pull/86402 Approved by: https://github.com/ezyang	2022-10-07 14:14:38 +00:00
Supraj Bachawala	55f5e0de8d	remove unused arg from `impl_func_cum_ops` (#86364 ) Fixes #86224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86364 Approved by: https://github.com/bdhirsh	2022-10-07 14:13:15 +00:00
Edward Z. Yang	a00f8489df	Relandx3 "SymIntify cat and narrow" (#86289 ) This reverts commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86289 Approved by: https://github.com/wconstab	2022-10-07 14:04:10 +00:00
Howard Huang	cc9183eb4c	Update distributed.rst backend collective support chart (#86406 ) NCCL `scatter` was added by Wanchao in https://github.com/pytorch/pytorch/pull/70029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86406 Approved by: https://github.com/wanchaol	2022-10-07 12:59:09 +00:00
kshitij12345	b74ca31bf6	[fix] sum_to_size: MathBits test - don't reuse same input tensor (#86378 ) Fixes #85409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86378 Approved by: https://github.com/anjali411	2022-10-07 12:12:03 +00:00
Salil Desai	facbddb9ff	Override Quantized Backend to use Fbgemm in Qlinear Packed Params Test (#86236 ) Summary: After D39934051, we must explicitly ```override_quantized_engine('fbgemm')``` for this test to work Test Plan: ``` buck test //caffe2/test:ao -- TestQlinearPackedParams ``` Before: ``` Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5629499663624574 ✓ ListingSuccess: caffe2/test:ao : 72 tests discovered (32.830) ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_qnnpack (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (25.085) ✗ Fail: caffe2/test:ao - test_qlinear_packed_params (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (26.706) Test output: > RuntimeError: Didn't find engine for operation ao::sparse::qlinear_prepack X86 ``` After: ``` Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7599824485968786 ✓ ListingSuccess: caffe2/test:ao : 72 tests discovered (31.082) ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_fbgemm (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (100.409) ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_qnnpack (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (100.544) Summary Pass: 2 ListingSuccess: 1 ``` Differential Revision: D40078176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86236 Approved by: https://github.com/jmdetloff, https://github.com/z-a-f	2022-10-07 11:58:41 +00:00
Seonglyong Gong	dbea07b6aa	[Profiler] record gradient from nnModule (#86355 ) Summary: - catch .grad tensor info - update data type and `check_and_store`, etc - update unit test case Test Plan: buck run mode/opt //caffe2/test:profiler Reviewed By: chaekit Differential Revision: D39711295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86355 Approved by: https://github.com/chaekit	2022-10-07 09:58:50 +00:00
lezcano	28a0b3fb18	Fix col2im and im2col decompositions (#86426 ) I threw in some tests for good measure. Fixes https://github.com/pytorch/pytorch/issues/86332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86426 Approved by: https://github.com/ngimel	2022-10-07 08:14:06 +00:00
Nicolas Hug	93b2d99158	Improve interpolate() speed for channels_last images and masks (#86361 ) This PR improves the speed of `interpolate()`: - on images and masks (`num_channels < 4`, `channels_last=True`) - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float) - for both upsampling and downsampling The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size. In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details): ``` (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms ``` An immediate follow-up to this PR would be to do the same changes for the 3D kernels. Thanks a ton @fmassa for the help! ### Speedup benchmarks: Results: <details> ``` ---------------------------------------------------------------------------------------------------- (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=1 1.6X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 1.7X 1.0ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=1 8X 0.806ms vs 0.097ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.056ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=1 10X 0.828ms vs 0.084ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.914ms vs 0.057ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.900ms vs 0.086ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=2 1.6X 1.1ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=2 1.6X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 1.7X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 1.7X 0.5ms vs 0.3ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=2 9X 0.800ms vs 0.088ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=2 11X 0.459ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=2 7X 0.424ms vs 0.064ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.503ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.461ms vs 0.059ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=12 3X 1.1ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=12 1.6X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=12 5X 0.8ms vs 0.2ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=12 10X 0.445ms vs 0.047ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=12 7X 0.432ms vs 0.062ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 7X 0.470ms vs 0.063ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=32 1.8X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=32 11X 0.815ms vs 0.074ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.045ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=32 7X 0.436ms vs 0.061ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.061ms ---------------------------------------------------------------------------------------------------- (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=1 1.5X 0.9ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 1.6X 1.0ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=1 8X 0.808ms vs 0.099ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.058ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=1 9X 0.820ms vs 0.087ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.909ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.898ms vs 0.088ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=2 1.4X 0.9ms vs 0.7ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=2 1.5X 0.5ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 1.5X 0.5ms vs 0.4ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=2 9X 0.799ms vs 0.090ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=2 10X 0.459ms vs 0.045ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=2 7X 0.427ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 11X 0.501ms vs 0.044ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.460ms vs 0.060ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=12 2.9X 1.0ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=12 1.2X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 1.1X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=12 12X 0.809ms vs 0.068ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=12 8X 0.432ms vs 0.055ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.480ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.464ms vs 0.056ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=32 1.3X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=32 11X 0.813ms vs 0.075ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=32 7X 0.433ms vs 0.061ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.062ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=1 0.9X 4.5ms vs 5.2ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=1 1.5X 4.2ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=1 1.8X 4.1ms vs 2.3ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 1.6X 4.5ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 1.9X 4.4ms vs 2.3ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=1 9X 3.8ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=1 17X 4.0ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=1 11X 3.9ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 19X 4.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 12X 4.3ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=2 1.5X 4.5ms vs 3.1ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=2 1.4X 2.3ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=2 1.7X 2.1ms vs 1.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 1.6X 2.5ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 1.8X 2.2ms vs 1.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=2 15X 3.8ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=2 15X 2.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=2 7X 2.0ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 16X 2.4ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 8X 2.2ms vs 0.3ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=12 8X 5.2ms vs 0.7ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=12 1.3X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 1.4X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=12 36X 3.9ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=12 10X 0.526ms vs 0.051ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=12 7X 0.514ms vs 0.069ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 11X 0.569ms vs 0.052ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 8X 0.557ms vs 0.070ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=32 9X 4.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=32 0.5X 0.2ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=32 44X 3.864ms vs 0.087ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=32 10X 0.527ms vs 0.053ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=32 7X 0.516ms vs 0.070ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 10X 0.567ms vs 0.055ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 8X 0.558ms vs 0.072ms ---------------------------------------------------------------------------------------------------- (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=1 1.0X 1.9ms vs 1.9ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=1 2.0X 1.8ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=1 1.7X 1.8ms vs 1.0ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 2.1X 1.9ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 1.9X 1.9ms vs 1.0ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=1 9X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=1 16X 1.7ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=1 10X 1.7ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 17X 1.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 11X 1.8ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=2 1.7X 1.9ms vs 1.1ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=2 2.0X 1.0ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=2 1.7X 0.9ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 2.3X 1.1ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 1.8X 1.0ms vs 0.5ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=2 8X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=2 14X 0.931ms vs 0.067ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=2 7X 0.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 15X 1.016ms vs 0.069ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 9X 0.9ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=12 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=12 20X 1.630ms vs 0.081ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=12 10X 0.457ms vs 0.044ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=12 7X 0.439ms vs 0.060ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 11X 0.485ms vs 0.045ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 8X 0.474ms vs 0.061ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=32 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=32 2.0X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 1.4X 0.2ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=32 21X 1.628ms vs 0.078ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=32 9X 0.453ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=32 7X 0.445ms vs 0.063ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 11X 0.535ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 8X 0.502ms vs 0.063ms ---------------------------------------------------------------------------------------------------- (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=1 1.0X 13.8ms vs 14.0ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=1 1.8X 13.1ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=1 1.8X 11.1ms vs 6.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 1.9X 13.9ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 1.9X 11.8ms vs 6.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=1 10X 10.2ms vs 1.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=1 19X 10.8ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=1 11X 10.4ms vs 0.9ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 20X 11.6ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 12X 11.4ms vs 0.9ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=2 1.8X 13.7ms vs 7.7ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=2 2.6X 7.3ms vs 2.8ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=2 1.8X 5.6ms vs 3.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 1.9X 7.9ms vs 4.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 1.9X 6.0ms vs 3.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=2 18X 10.1ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=2 19X 5.8ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=2 10X 5.3ms vs 0.5ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 20X 6.3ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 11X 5.7ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=12 8X 13.8ms vs 1.6ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=12 2.9X 1.5ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=12 1.7X 1.0ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 1.5X 1.5ms vs 1.0ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 1.8X 1.0ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=12 80X 10.1ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=12 13X 0.928ms vs 0.072ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=12 8X 0.9ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 13X 1.001ms vs 0.074ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 9X 1.0ms vs 0.1ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=32 18X 14.0ms vs 0.8ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=32 1.9X 1.0ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=32 2.9X 0.7ms vs 0.2ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 1.7X 0.9ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 1.8X 0.4ms vs 0.2ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=32 111X 10.254ms vs 0.092ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=32 14X 0.784ms vs 0.056ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=32 7X 0.551ms vs 0.075ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 11X 0.607ms vs 0.057ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 8X 0.596ms vs 0.076ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.077ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.074ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 0.9X 0.078ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.076ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.075ms vs 0.074ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.082ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.080ms vs 0.083ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.070ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.073ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.071ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.079ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.077ms vs 0.079ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.077ms vs 0.075ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.083ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.076ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.073ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.078ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.074ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.077ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.076ms vs 0.079ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=1 1.8X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=1 1.6X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 2.0X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 1.7X 0.3ms vs 0.2ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=1 6X 0.265ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=1 10X 0.280ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=1 7X 0.273ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 11X 0.303ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 8X 0.297ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=2 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=2 1.8X 0.163ms vs 0.093ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=2 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 1.9X 0.180ms vs 0.096ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=2 6X 0.264ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=2 10X 0.278ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=2 7X 0.270ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 11X 0.298ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 8X 0.293ms vs 0.037ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=12 1.7X 0.158ms vs 0.095ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 1.7X 0.170ms vs 0.100ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=12 6X 0.269ms vs 0.043ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=12 11X 0.291ms vs 0.027ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=12 8X 0.281ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 8X 0.306ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=32 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=32 1.6X 0.160ms vs 0.098ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 1.7X 0.171ms vs 0.099ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=32 6X 0.269ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=32 10X 0.282ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=32 7X 0.276ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 8X 0.299ms vs 0.038ms ---------------------------------------------------------------------------------------------------- (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=1 1.0X 1.2ms vs 1.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=1 2.0X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=1 1.7X 1.1ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 2.1X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 1.9X 1.2ms vs 0.7ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=1 8X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=1 15X 1.109ms vs 0.073ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=1 10X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 16X 1.192ms vs 0.074ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 11X 1.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=2 1.7X 1.2ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=2 2.0X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=2 1.7X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 2.2X 0.7ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 1.8X 0.6ms vs 0.3ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=2 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=2 11X 0.598ms vs 0.052ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=2 8X 0.556ms vs 0.072ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 12X 0.649ms vs 0.053ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 8X 0.598ms vs 0.073ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=12 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=12 1.3X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=12 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=12 12X 0.572ms vs 0.048ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=12 8X 0.560ms vs 0.068ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 13X 0.617ms vs 0.049ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 9X 0.604ms vs 0.068ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=32 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=32 13X 1.042ms vs 0.081ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=32 12X 0.586ms vs 0.050ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=32 8X 0.562ms vs 0.069ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 12X 0.621ms vs 0.051ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 9X 0.609ms vs 0.070ms ---------------------------------------------------------------------------------------------------- (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=2 1.9X 0.5ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 2.1X 0.5ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=2 10X 0.808ms vs 0.084ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=2 10X 0.462ms vs 0.046ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=2 7X 0.429ms vs 0.062ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.504ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 7X 0.461ms vs 0.063ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=12 4X 1.0ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=12 12X 0.820ms vs 0.067ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=12 8X 0.431ms vs 0.056ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.482ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.467ms vs 0.056ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=32 4X 1.0ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=32 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=32 12X 0.824ms vs 0.070ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=32 7X 0.438ms vs 0.059ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 11X 0.479ms vs 0.045ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.059ms ---------------------------------------------------------------------------------------------------- (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=1 1.0X 4.7ms vs 4.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=1 2.0X 4.4ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=1 1.8X 4.3ms vs 2.5ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 2.1X 4.7ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 1.9X 4.6ms vs 2.5ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=1 9X 4.0ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=1 17X 4.2ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=1 11X 4.1ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 19X 4.6ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 12X 4.5ms vs 0.4ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=2 1.7X 4.7ms vs 2.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=2 2.1X 2.4ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=2 1.8X 2.2ms vs 1.3ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 2.3X 2.6ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 1.9X 2.3ms vs 1.3ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=2 15X 4.0ms vs 0.3ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=2 16X 2.3ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=2 9X 2.1ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 17X 2.5ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 10X 2.3ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=12 10X 4.7ms vs 0.5ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=12 41X 3.969ms vs 0.096ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=12 11X 0.545ms vs 0.051ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=12 8X 0.532ms vs 0.070ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 11X 0.590ms vs 0.052ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 8X 0.578ms vs 0.071ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=32 17X 4.7ms vs 0.3ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=32 2.0X 0.3ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 1.9X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=32 45X 4.028ms vs 0.090ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=32 10X 0.549ms vs 0.053ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=32 7X 0.536ms vs 0.072ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 11X 0.592ms vs 0.055ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 8X 0.581ms vs 0.074ms ``` </details> Code: <details> I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((224, 224), (64, 64)), ((224, 224), (128, 128)), ((600, 400), (224, 224)), ((320, 320), (256, 256)), ((800, 800), (500, 500)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` with ``` for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file ``` and this very ugly helper ```py import re with open("main") as f: main = f.readlines() with open("new") as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("$.?$", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 10 == 0 and i % 40 != 0: print() if i % 40 == 0: print("-" 100) print(l) ``` </details> Closes https://github.com/pytorch/pytorch/issues/83840 When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361 Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa	2022-10-07 07:52:36 +00:00
Wang, Eikan	70c6a988d6	Fix the performance issue that the for-loop before ExternallCall could not be parallelized. (#85056 ) Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`. ```c++ stmt1: For: stmt2: For: stmt3: ExternalCall ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/85056 Approved by: https://github.com/frank-wei, https://github.com/bertmaher	2022-10-07 07:36:28 +00:00
PyTorch MergeBot	2110c89443	Revert "Revert "Revert "SymIntify cat and narrow (#86191 )"" (#86289 )" This reverts commit e778fbf5197638d6196c5d5acf6f9588a1e83368. Reverted https://github.com/pytorch/pytorch/pull/86289 on behalf of https://github.com/seemethere due to Fails internal tests see: https://www.internalfb.com/intern/sandcastle/job/27021598552487548/	2022-10-07 05:20:36 +00:00
eqy	6c604c9262	[CuDNN v8 API][Quantization]fix alignment function in quantized cuDNN V8 path (#86253 ) This bug was in the native cuDNN V8 API integration and was fixed a while ago, but the change was never ported here. Previously the returned alignment could be twice the actual alignment of the data if the alignment was smaller than 16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86253 Approved by: https://github.com/dzdang	2022-10-07 05:13:37 +00:00
Sherlock Huang	455b873919	Introduce a match filter for SubgraphRewriter (#86430 ) This PR introduces an interface for user defined function that filters the matches in SubgraphRewriter. The function will have the following signature. callable(match: InternalMatch, original_graph: Graph, pattern_graph: Graph) -> bool This filter is applied after SubgraphMatcher returns the matches, and before replacement takes place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86430 Approved by: https://github.com/jerryzh168	2022-10-07 05:09:40 +00:00
PyTorch MergeBot	b5fd845fdf	[torchdynamo hash update] update the pinned torchdynamo hash (#86399 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned torchdynamo hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86399 Approved by: https://github.com/pytorchbot	2022-10-07 04:44:19 +00:00
Nikita Shulga	10aead9adc	[MPS] Cache multinomial_with_replacement graph (#86437 ) Reuse existing RandomCachedGraph to keep RNG state as part of the graph Add `CreateCachedGraphAs` convenience wrapper Addresses https://github.com/pytorch/pytorch/pull/86342#pullrequestreview-1132197848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86437 Approved by: https://github.com/kulinseth	2022-10-07 04:39:30 +00:00
Elias Ellison	9ceadcadb2	Fix unfold backward decomp aliasing for 0 dim input (#86428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86428 Approved by: https://github.com/ngimel, https://github.com/ezyang	2022-10-07 03:55:31 +00:00
Kevin Stephano	b14f1d7bb8	Add Skip List for Aten Ops that are fused in nvFuser. (#86101 ) This Skip List (tuple) is added under the nvprims context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86101 Approved by: https://github.com/jjsjann123, https://github.com/mruberry	2022-10-07 03:55:13 +00:00
Driss Guessous	c5a4844085	Xformer SDP forward/backward kernel (#86157 ) # Summary Include xformer kernel code and make header updates to successfully build. Need to update the kernel calling code and dispatch system to clean this up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86157 Approved by: https://github.com/cpuhrsch	2022-10-07 03:52:46 +00:00
PyTorch MergeBot	ca39e3679f	[vision hash update] update the pinned vision hash (#86173 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86173 Approved by: https://github.com/pytorchbot	2022-10-07 03:19:31 +00:00
Sherlock Huang	2fec853c87	Fix SubgraphMatcher for case of no anchor found (#86421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86421 Approved by: https://github.com/jerryzh168	2022-10-07 02:05:42 +00:00
Michael Voznesensky	b73f0e98d5	Fix cond tests after CI was disabled for a bit (#86321 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86321 Approved by: https://github.com/zou3519	2022-10-07 01:46:51 +00:00
Alex	ca69ddb4f7	Fix broadcasting to implicit leading dimensions in `torch.where` on MPS (#86240 ) Fixes #86239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86240 Approved by: https://github.com/kulinseth	2022-10-07 01:38:57 +00:00
Zafar	0e30da3f2f	[refactor] Renaming ao.sparsity to ao.pruning (#84867 ) `Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development. This change will reflect the changes in the documentation as well. TODO: - [ ] Change the tutorials - [ ] Confirm no bc-breakages - [ ] Reflect the changes in the trackers and RFC docs Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867 Approved by: https://github.com/supriyar	2022-10-07 00:58:41 +00:00
Dennis van der Staay	9a170b24f6	Cleanup PT-D imports (#85781 ) Summary: The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail. Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail... TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required. Test Plan: existing unit tests Differential Revision: D39842273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781 Approved by: https://github.com/mrshenli	2022-10-07 00:29:32 +00:00
Masaki Kozuki	a241963837	[nll_loss] Avoid unnecessary type casts (#86086 ) follow-up #85395 `AT_DISPATCH_NLL_LOSS_INDEX_TYPES` should not be removed in favor of #59765 and there's a testcase `99ca25e6eb/test/test_nn.py (L16832)` Besides the dispatcher, I wanted to sanity check `int64_t ignore_index` because `int64_t` can be inappropriate considering that `target` can be `Byte`. However, given that the default value is -100 as in `0a75c42f36/aten/src/ATen/native/native_functions.yaml (L9949)` it's not easy to add a check while keeping the backward compatibility. Thus I decided to not add a check. cc @lezcano @t-vi Pull Request resolved: https://github.com/pytorch/pytorch/pull/86086 Approved by: https://github.com/lezcano	2022-10-07 00:10:27 +00:00
Nikita Shulga	2232db7fc1	Replacement is irrelevant for 1-sample multinomial (#86342 ) So use fast path, both on CPU and on MPS Also, remove some spurious copy-n-paste checks from MPS codepath CUDA already has this optimization, see `dc9c507d24/aten/src/ATen/native/cuda/MultinomialKernel.cu (L355-L356)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86342 Approved by: https://github.com/ngimel	2022-10-07 00:08:42 +00:00
Peter Bell	5a8b07de75	Declare public dependencies on libshm (#82694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82694 Approved by: https://github.com/malfet	2022-10-07 00:01:25 +00:00
Brian Hirsh	08e3999fa4	Merge more symbolic meta kernels and symint changes from branch (#86334 ) symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops add meta_bernoulli_ meta kernel for at::gather get pytorch_struct to pass: meta for scatter_add, fix backward symintify split ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/86334 Approved by: https://github.com/ezyang	2022-10-06 23:29:04 +00:00
atalman	3af0eafea6	Release 1.13: Bump nightly version 1.13->1.14 (#86296 ) Release 1.13: Bump nightly version 1.13->1.14 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86296 Approved by: https://github.com/seemethere, https://github.com/malfet	2022-10-06 23:26:58 +00:00
Tongzhou Wang	5ed75ec1d7	Fix SparseAdam consuming iterator (#86210 ) Fixes https://github.com/pytorch/pytorch/issues/86209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86210 Approved by: https://github.com/cpuhrsch	2022-10-06 23:11:25 +00:00
Rohan Varma	f0977c4658	[FSDP] Doc to explain running submodules (#86343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86343 Approved by: https://github.com/awgu	2022-10-06 23:10:23 +00:00
Rohan Varma	3db8ddcac1	[FSDP] Fix clip_grad_norm for CPU offload (#86337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86337 Approved by: https://github.com/awgu	2022-10-06 23:10:23 +00:00
Rohan Varma	adfd8f3823	[FSDP] assert to runtime error (#86336 ) Prefer raising an error over `assert` which should mostly to indicate a developer bug, but user can cause this error path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86336 Approved by: https://github.com/awgu	2022-10-06 23:10:21 +00:00
Rohan Varma	7a411952fb	CheckpointSequential support non-reentrant (#86331 ) Closes https://github.com/pytorch/pytorch/issues/86328 Adds `use_reentrant` argument to `checkpoint_sequential`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86331 Approved by: https://github.com/zhaojuanmao, https://github.com/albanD	2022-10-06 23:10:18 +00:00
David	3037f3d710	Docs: fix typo (#86273 ) Typo in torch.fx.Interpreter.fetch_attr docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/86273 Approved by: https://github.com/kit1980	2022-10-06 22:38:50 +00:00
PyTorch MergeBot	233d6f195a	Revert "Fix memory leak in _LRScheduler.step() (#85602 )" This reverts commit eb32330d6b3709dc8910eb298d8802fbca57b05c. Reverted https://github.com/pytorch/pytorch/pull/85602 on behalf of https://github.com/albanD due to newly added test is flaky	2022-10-06 22:02:02 +00:00
atalman	bf74679884	Fix for binary upload step, use bash shell rather then default sh (#86382 ) This fixes the issue during upload: ``` Run # reference ends with an RC suffix # reference ends with an RC suffix if [[ ${GITHUB_REF_NAME} = -rc[0-9] ]]; then echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV" fi shell: sh -e {0} /__w/_temp/f045f5d8-ddb.sh: 2: [[: not found ``` Test failure: https://github.com/pytorch/pytorch/actions/runs/3199561387/jobs/5225448559 Test success: https://github.com/pytorch/pytorch/actions/runs/3199573560/jobs/5225480345 Error started when we switched to: continuumio/miniconda3:4.12.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86382 Approved by: https://github.com/weiwangmeta	2022-10-06 21:55:33 +00:00
HDCharles	facf210f9a	[ao] fixing public v private for qconfig.py (#86026 ) Summary: no changes, just removed the exception for this file, someone had already fixed the actual file Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86026 Approved by: https://github.com/jerryzh168	2022-10-06 21:42:44 +00:00
Jay Chae	7c5e07f87b	[kineto] guard global observer init against Edge profiler (#86347 ) Summary: looks like Sandcastle CI didn't cover any of concrete mobile CI(cc: kimishpatel i'd assume we have ton of mobile tests in Github?). This is failing on Oculus with the similar failure as Mac(not sure if this is an ARM thing). either way on demand tracing should not be enabled on these platforms so disable them completely in the future, we should have runtime check on this for even safer guarding Test Plan: Set up Hollywood via P536072492 ## Before crash on mutex. likely SIOF ``` FORTIFY: pthread_mutex_lock called on a destroyed mutex (0x5d7e298b08) * Aborted at 1665017107 (Unix time, try 'date -d 1665017107') * * Signal 6 (SIGABRT) (0xeca) received by PID 3786 (pthread TID 0x785bd1eed0) (linux TID 3786) (maybe from PID 3786, UID 0) (code: -1), stack trace: * (error retrieving stack trace) ``` ## After Redacted in the top but the test passes without the crash P536101962 Differential Revision: D40129840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86347 Approved by: https://github.com/aaronenyeshi	2022-10-06 21:36:15 +00:00
Jiaxu Zhu	bc919ac796	[torch.ao.quantization] include torch.qint32 for static quant (#86345 ) Summary: include `torch.qint32` to `activation_is_statically_quantized` and `get_quant_type` so that fakequantize with `dtype=torch.qint32` won't be skipped Test Plan: updated `test_custom_module_class` Differential Revision: D40128178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86345 Approved by: https://github.com/jerryzh168	2022-10-06 20:05:56 +00:00
lezcano	08780229df	Two small improvements to references (#86371 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/86371 Approved by: https://github.com/mruberry	2022-10-06 19:31:11 +00:00
Huy Do	795906f207	Add total GPU memory utilization (#86250 ) Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html. Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250 Approved by: https://github.com/ZainRizvi	2022-10-06 18:53:59 +00:00
Zain Rizvi	1059d3b52d	Make mergebot message clearer when starting a new merge (#86311 ) Modifying how the merge started message appears to make it more readable. Also removing some deprecated v1 land checks messages Old: <img width="917" alt="image" src="https://user-images.githubusercontent.com/4468967/194150650-c9e384a3-d13c-40aa-975d-f43853790603.png"> New: <img width="933" alt="image" src="https://user-images.githubusercontent.com/4468967/194151507-a5900cd5-5711-4cab-9447-c2cc6ed0d7b5.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86311 Approved by: https://github.com/malfet, https://github.com/huydhn	2022-10-06 18:47:07 +00:00
Pearu Peterson	6b295cd046	Enable autograd on Linear with sparse COO weight (#86302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86302 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:31 +00:00
Pearu Peterson	8f2c2167d4	Support autograd on sparse_mm in full. (#86301 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86301 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:31 +00:00
Pearu Peterson	88b882cd1c	Support sum on a sparse COO tensor. (#86300 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86300 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:28 +00:00
Pearu Peterson	f104490d63	Support autograd on Linear with sparse compressed weight. (#86137 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:25 +00:00
Pearu Peterson	fc21cc82fc	Enable sparse_dim() and dense_dim() methods for Strided tensors (#86203 ) The reason for enabling sparse/dense_dim() for strided tensors is to have more meaningful error messages: For instance, compare ``` NotImplementedError: Could not run 'aten::sparse_dim' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::sparse_dim' is only available for these backends: [SparseCPU, SparseCUDA, SparseMeta, SparseCsrCPU, SparseCsrCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher]. ``` [master] vs ``` RuntimeError: addmm: matrices expected, got 0D tensor ``` [this PR] where the latter message gives a hint of which function is to blame for dealing with unexpected inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86203 Approved by: https://github.com/cpuhrsch	2022-10-06 18:39:22 +00:00
PyTorch MergeBot	bed1ece9c5	[torchdynamo hash update] update the pinned torchdynamo hash (#86306 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned torchdynamo hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86306 Approved by: https://github.com/pytorchbot	2022-10-06 17:34:29 +00:00
Chengqi Deng	eb32330d6b	Fix memory leak in _LRScheduler.step() (#85602 ) Fixes #85410 This diff removed the cyclic references in `_LRScheduler.step()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602 Approved by: https://github.com/albanD	2022-10-06 17:07:36 +00:00
Huy Do	b8b564c908	Ensure the minimum NVIDIA driver version to be 515.57 for CUDA 11.7 (#86344 ) This does 2 things: * Ensure that `nvidia-driver-latest-dkms` package is removed if it's installed. This allows the installation to go forward without the below error when using the standard installation script from S3: ``` (Answer: Abort installation) ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details. ``` * Not skipping the installation if a driver different than `515.57` exists to avoid any unexpected behavior when using a different driver version. This partly addresses the recent issue in https://github.com/pytorch/pytorch/issues/85778 in which `510.60.02` is there instead (not sure from where) and fails CUDA 11.7 test Pull Request resolved: https://github.com/pytorch/pytorch/pull/86344 Approved by: https://github.com/atalman, https://github.com/malfet	2022-10-06 16:47:45 +00:00
Christian Puhrsch	0c148a4b5f	Remove extra bracket, update header definition (#86317 ) Summary: Fix compilation error Test Plan: Unit test Reviewed By: malfet, mikaylagawarecki Differential Revision: D40108369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86317 Approved by: https://github.com/malfet	2022-10-06 16:28:05 +00:00
Peter Bell	fb9b96593c	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2022-10-06 15:43:39 +00:00
Nikita Shulga	fa799132d8	[MPS] Better error message for `slow_conv2d_forward` (#86303 ) Error `Could not run 'aten::_slow_conv2d_forward' with arguments from the 'MPS' backend.` is very misleading as usually this method is only invoked if input is on CPU but weights are on MPS device. Raise a more user friendly error in this case Add test to `test_invalid_conv2d` to check for those conditions. Fixes https://github.com/pytorch/pytorch/issues/77931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86303 Approved by: https://github.com/kulinseth	2022-10-06 15:38:57 +00:00
Edward Z. Yang	4d7728890b	Inline asIntArrayRef (#86350 ) I was benchmarking and this is worth maybe 5% on at::empty, but it's basically free so we should do it. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86350 Approved by: https://github.com/albanD	2022-10-06 14:55:03 +00:00
andrewor14	cebf08afb2	[Quant] Remove weight from DTypeConfig for non-weighted ops (#86335 ) Summary: Weight dtypes should be specified only for weighted ops like conv and linear. This commit removes weight dtypes from the DTypeConfigs used in binary ops and fixed qparams ops. Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps Reviewers: jerryzh168, vkuzo Subscribers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/86335 Approved by: https://github.com/vkuzo	2022-10-06 13:30:59 +00:00
Antoni Viros i Martin	cdbffa7f66	🦊 [AI Accelerators] Consolidate native_layer_norm for nested tensor (#86295 ) Summary: In order to make the layer normalization implementation for nested tensors public, it needs to be generalized to accept a normalized_shape argument instead of assuming it to be the last dimension of the nested_tensor. This commit does that, as well as adding extra unit tests to ensure the implementation is correct. Test Plan: All unit tests designed to test different ways of using the function work: `buck test //caffe2/test:nested -- test_layer_norm` Differential Revision: D40105207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86295 Approved by: https://github.com/drisspg	2022-10-06 13:10:25 +00:00
John Detloff	85c3b745f6	Conditionally build the TestApp benchmark based on lite interpreter (#86314 ) The TestApp benchmark was recently re-added, however it seems it only builds when pytorch is built with the lite interpreter. This diff adds a macro to compile out the benchmark when pytorch is built as full jit. This should fix our full jit simulator nightly builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86314 Approved by: https://github.com/malfet	2022-10-06 10:08:54 +00:00
Sahan Paliskara	936e93058b	Delete torch::deploy from pytorch core (#85953 ) As we have migrated torch::deploy over to https://github.com/pytorch/multipy, we can now delete it from pytorch core as ongoing development will happen there. This PR was created due to syncing issues with https://github.com/pytorch/pytorch/pull/85443 which is where the review history can be found. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85953 Approved by: https://github.com/seemethere, https://github.com/malfet	2022-10-06 07:20:16 +00:00
Seonglyong Gong	27c3fb0386	[Profiler] trace verbose=false by default (#86263 ) Summary: - Added config option to remove 'Call stack' field from trace file (#84982) - Change default value to `false` Test Plan: - `experimental_config=_ExperimentalConfig(verbose=true),` will add 'Call stack' field back in the trace file. - CI tests Differential Revision: D40092377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86263 Approved by: https://github.com/aaronenyeshi	2022-10-06 06:32:25 +00:00
Seonglyong Gong	a117fde86f	[Profiler] Apply TensorMetadata for Optimizer and nnModule (#86047 ) Summary: - Use `TensorMetadat` struct in saving tensor info from Optimizer and nnModule. Test Plan: buck run mode/opt //caffe2/test:profiler Reviewed By: chaekit Differential Revision: D39682205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86047 Approved by: https://github.com/chaekit, https://github.com/robieta	2022-10-06 06:18:56 +00:00
albanD	fd5085c445	Symintify getitem and add the required helper functions (#86207 ) Note that this might not cover every use of the function (we know it doesn't) But this is enough to get few models passing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86207 Approved by: https://github.com/ezyang, https://github.com/Chillee, https://github.com/bdhirsh	2022-10-06 04:46:19 +00:00
Edward Yang	0a75c42f36	Workaround MSVC ICE due to constexpr char* template argument (#86288 ) Test Plan: Lease a Windows sandcastle https://www.internalfb.com/intern/wiki/Windows_Platform_Engineering/Leasable_VM_-_User_Guide/ and run: ``` buck build arvr/mode/win/opt //xplat/caffe2:_C_impl ``` Differential Revision: D40109191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86288 Approved by: https://github.com/albanD, https://github.com/malfet	2022-10-06 04:11:05 +00:00
Edward Z. Yang	45f03d6948	Add at::symint:: namespace for ease of templated functions (#86329 ) Our prevailing strategy for symbolic shapes in C++ is to only write the SymInt version of the code, and pay a slight performance tax from not knowing if it is symbolic or not. However, there are some fastpath functions where this tax is unacceptable, and we want to specialize for the int case. Sometimes, it is easy to template the function; but when the function involves Tensors, it is not, because the functions you may want to call are not templated, e.g., t.view vs t.view_symint This PR adds an at::symint:: namespace which contains templated functions for all functions in PyTorch which you can use in this way. To show this works, I refactored sum_to to stop incorrectly reinterpret casting and instead use a template. Instead of t.sizes(), we call at::symint::sizes<T>(t), and so forth. The template functions are SFINAE'd using a template argument that is not otherwise used. As such, deduction is impossible. Typically, deduction is hard anyway, because many of the constructors are ambiguous (this is why we split foo and foo_symint in the first place). So you must pass a template argument to these functions. These functions are codegened into Functions.h so they are subject to per-operator headers. This matters most for methods, which likely didn't include the per-operator header, so you will have to add an include in that case. We never generate method variants for these. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86329 Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym	2022-10-06 04:09:17 +00:00
Edward Z. Yang	ea21a982f2	Reduce warning suppression by just disabling pytest warnings plugin (#86255 ) Fixes https://github.com/pytorch/pytorch/issues/85626 Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86255 Approved by: https://github.com/lezcano, https://github.com/albanD	2022-10-06 04:08:50 +00:00
Edward Z. Yang	adf5919720	Add option to record C++ backtraces in _record_memory_history (#86145 ) I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145 Approved by: https://github.com/albanD, https://github.com/zdevito	2022-10-06 04:07:37 +00:00
Edward Z. Yang	97d6b5bbf8	Refactor _cuda_recordMemoryHistory to use pybind11 (#86139 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86139 Approved by: https://github.com/albanD	2022-10-06 04:07:37 +00:00
Elias Ellison	d04889323e	Add Context Manager for Disabling Multithreading in Backwards, use in aot autograd (#86245 ) We were running into a few issues with running multithreaded backwards in aot_autograd: such as https://github.com/pytorch/pytorch/issues/86136, and `FakeTensorMode` getting into a weird state as a result of not executing functions completely sequentially. The multithreaded backwards is lost in translation when we trace out the backwards anyway, and adds a lot of additional complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86245 Approved by: https://github.com/albanD, https://github.com/yf225	2022-10-06 03:27:42 +00:00
Vasiliy Kuznetsov	237316aa1d	PNP: early FX numeric suite tool to quantize each layer N times (#80521 ) Summary: This PR is an early prototype of a tool to quantize each layer of a model N times, with N qconfigs each. We follow the design agreed upon in https://fburl.com/gdoc/e1gaq3ih . Current API: ``` m = M().eval() example_input = (torch.randn(2, 2),) qconfig_mappings = [ QConfigMapping().set_global(torch.quantization.default_qconfig), QConfigMapping().set_global(torch.quantization.default_dynamic_qconfig), ] backend_config = get_native_backend_config() msp = prepare_n_shadows_model( m, example_input, qconfig_mappings, backend_config) for _ in range(2): msp(example_input) msq = convert_n_shadows_model(msp) msq(example_input) results = extract_results_n_shadows_model(msq) print_comparisons_n_shadows_model(results) // example output subgraph_idx ref_node_name best_idx 1 2 -------------- --------------- ---------- ------- ------- subgraph_0 fc1 2 42.0834 42.6279 subgraph_1 fc2 2 43.7259 50.0593 ``` Test plan: ``` python test/test_quantization.py -k test_n_shadows ``` Differential Revision: [D37650332](https://our.internmc.facebook.com/intern/diff/D37650332) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80521 Approved by: https://github.com/jerryzh168, https://github.com/andrewor14	2022-10-06 02:30:45 +00:00
Yu Guo	b233d83471	make torch.histc ignore NaNs on CPU (#85870 ) Summary: cuda torch.histc already ignores NaNs Test Plan: unittest added Differential Revision: D39911272 fix https://github.com/pytorch/pytorch/issues/85853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85870 Approved by: https://github.com/ngimel	2022-10-06 01:09:00 +00:00
Mike Iovine	ddec1eea05	[Static Runtime] Block linalg_svdvals codegen & run codegen script (#85983 ) Summary: The test is causing issues: ``` terminate called after throwing an instance of 'std::runtime_error' what(): The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): graph(%A: Tensor, %driver: str?): %bias: None = prim::Constant() %ret = aten::linalg_svdvals(%A, %driver) ~~~~ <--- HERE %cloned = aten::clone(%ret, %bias) return (%cloned) RuntimeError: torch.linalg.svd: keyword argument `driver=` is only supported on CUDA inputs with cuSOLVER backend. ``` Just block the op and re-run the codegen script to remove everything and update the generated ops. Test Plan: Existing tests Differential Revision: D39973860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85983 Approved by: https://github.com/xuzhao9, https://github.com/tenpercent	2022-10-06 01:07:40 +00:00
Charlie Yan	bebd162249	Fix doc of DDP (#86244 ) (#86256 ) [ghstack-poisoned] Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86256 Approved by: https://github.com/rohan-varma	2022-10-06 00:48:56 +00:00
Brian Hirsh	020f2b2c0b	add myself for dynamic shapes PR review (#86292 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86292 Approved by: https://github.com/albanD	2022-10-06 00:34:34 +00:00
Natalia Gimelshein	dc9c507d24	add nominal support for int32 indices in index/index_put ops (#86309 ) Currently index_select/index_add decompositions decompose to `index` or `index_put` ops. The problem with this is that `index_select` and `index_add` accept int32 indices while `index` doesn't. That leads to error in meta func for those decompositions. This PR adds non-performant support for int32 indices to `index` operations, to allow decompositions go through. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86309 Approved by: https://github.com/lezcano	2022-10-05 23:59:16 +00:00
Edward Z. Yang	e8b0bea677	Rename fromIntArrayRef to fromIntArrayRefSlow, audit call sites (#86235 ) Some of them are known non-negative, I've revised them accordingly. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86235 Approved by: https://github.com/albanD	2022-10-05 23:11:01 +00:00
PyTorch MergeBot	168ba066e3	Revert "Symintify getitem and add the required helper functions (#86207 )" This reverts commit 17addb307ee9a4d12ad6918e90358a9a47a4f12b. Reverted https://github.com/pytorch/pytorch/pull/86207 on behalf of https://github.com/malfet due to Broke lint, by double-registering `meta_index_put`, but no CI was run during the outage	2022-10-05 22:42:56 +00:00
Rohan Varma	be4e43c7d0	Remove DataParallel remnants from DDP doc (#86221 ) As @aazzolini pointed out, the docstring is incorrect and probably vestige from DP / single process multi device mode in DDP. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86221 Approved by: https://github.com/aazzolini	2022-10-05 22:30:02 +00:00
Sherlock Huang	9e1a431220	Mark ctc_loss with dynamic_output_shape (#86293 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86293 Approved by: https://github.com/eellison	2022-10-05 22:26:50 +00:00
Edward Z. Yang	0e5a27fb8d	Fix horribly double truncation bug in Scalar (#86304 ) Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86304 Approved by: https://github.com/albanD	2022-10-05 22:24:17 +00:00
HDCharles	73777d8a2b	[ao] fixing public v private for quantization_mappings.py (#86025 ) Summary: no significant changes, just added __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86025 Approved by: https://github.com/jerryzh168	2022-10-05 22:12:03 +00:00
HDCharles	28a5cd9480	[ao] fixing public v private for quantize_jit.py (#86024 ) Summary: just needed to add __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86024 Approved by: https://github.com/jerryzh168	2022-10-05 22:11:43 +00:00
albanD	17addb307e	Symintify getitem and add the required helper functions (#86207 ) Note that this might not cover every use of the function (we know it doesn't) But this is enough to get few models passing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86207 Approved by: https://github.com/ezyang	2022-10-05 21:19:00 +00:00
albanD	b8895df8db	Fix black binary again for debug python (#86275 ) The `--no-binary` flag was not ported when moving from black only to ufmt. This adds it back. This is to work around the fact that black binary hard crashes when running with debug python and it needs to be compiled from source. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86275 Approved by: https://github.com/bdhirsh, https://github.com/malfet	2022-10-05 21:08:40 +00:00
Edward Z. Yang	e778fbf519	Revert "Revert "SymIntify cat and narrow (#86191 )"" (#86289 ) This reverts commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86289 Approved by: https://github.com/wconstab	2022-10-05 20:51:28 +00:00
Min Si	089a64e99e	Install c10d headers with absolute path (#86257 ) https://github.com/pytorch/pytorch/pull/85780 updated all c10d headers in pytorch to use absolute path following the other distributed components. However, the headers were still copied to `${TORCH_INSTALL_INCLUDE_DIR}/torch`, thus external extentions still have to reference the c10d headers as `<c10d/*.h>`, making the usage inconsistent (the only exception was c10d/exception.h, which was copied to `${TORCH_INSTALL_INCLUDE_DIR}/torch/csrc/distributed/c10d`). This patch fixes the installation step to copy all c10d headers to `${TORCH_INSTALL_INCLUDE_DIR}/torch/csrc/distributed/c10d`, thus external extensions can consistently reference c10d headers with the absolute path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86257 Approved by: https://github.com/kumpera	2022-10-05 20:02:05 +00:00
lezcano	b67e022833	Fix ref / decomposition index_add (#86266 ) The decomposition of `index_add` was using `slice(None)`, when it should use just `None`. The reference for index_add was also wrong, as `x[idx] += t` does not use atomic add, so it does not work when several `idx`s point to the same location. This PR adds extra reference inputs to help test for this. Fixes https://github.com/pytorch/torchdynamo/issues/1356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86266 Approved by: https://github.com/ngimel	2022-10-05 19:59:15 +00:00
HDCharles	14db44ad72	[ao] fixing public v private for quantize.py (#86023 ) Summary: just needed to add __all__ Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86023 Approved by: https://github.com/jerryzh168	2022-10-05 19:40:42 +00:00
HDCharles	c21caff876	[ao] correctly set public v private for fake_quantize.py (#86022 ) Summary: biggest issue was that the constructors for the fake_quantize classes use custom partials that live in the observer module and so the module for these needed to be set correctly in the constructor class method Test Plan: python test/test_public_bindings.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/86022 Approved by: https://github.com/jerryzh168	2022-10-05 19:30:50 +00:00
Edward Z. Yang	3b1ec7511e	Optimize is_symbolic test and some refactor (#86230 ) Our SymInt rep can be represented more efficiently as just a greater than test, but the compiler doesn't seem to figure it out. Help it out. There is also some refactoring to simplify the code and add more debugging. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86230 Approved by: https://github.com/albanD	2022-10-05 19:01:36 +00:00
Bin Chen	8c6d352bcf	Log a new "timer expired" event to Scuba in file_based_local_timer (#85861 ) Summary: The "kill worker process" event was logged to Scuba only when the worker process was really reaped. We want to add a new event "timer expired", no matter the worker process will be reaped or not. This will help collect data before we enable the JustKnob to kill the worker process on timeout. Test Plan: ### Unit Test ``` buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test ``` ``` Test Session: https://www.internalfb.com/intern/testinfra/testrun/7318349508929624 RE: reSessionID-ea464c43-54e7-44f2-942b-14ea8aa98c74 Up: 10.5 KiB Down: 1.1 MiB Jobs completed: 100. Time elapsed: 3206.9s. Cache hits: 91%. Commands: 11 (cached: 10, remote: 1, local: 0) Tests finished: Pass 55. Fail 0. Fatal 0. Skip 0. 0 builds failed ``` -------- ``` buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test ``` ``` Test Session: https://www.internalfb.com/intern/testinfra/testrun/6473924579130483 RE: reSessionID-231a47b7-a43d-4c0f-9f73-64713ffcbbd3 Up: 5.7 MiB Down: 1.9 GiB Jobs completed: 182156. Time elapsed: 282.4s. Cache hits: 99%. Commands: 72112 (cached: 72107, remote: 1, local: 4) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. 0 builds failed ``` Differential Revision: D39903376 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85861 Approved by: https://github.com/d4l3k	2022-10-05 18:23:53 +00:00
PyTorch MergeBot	fc94a2115b	Revert "SymIntify cat and narrow (#86191 )" This reverts commit 63d8d4f6ec5c973ad7b8669cd39ee9b550e5f55b. Reverted https://github.com/pytorch/pytorch/pull/86191 on behalf of https://github.com/seemethere due to Fails internal tests, see [D40106464](https://www.internalfb.com/diff/D40106464)	2022-10-05 17:19:55 +00:00
Peter Bell	3ec71fce79	Improve make_tensor performance for float and complex types (#85473 ) For floating types, `make_tensor` calls `rand` and then does a linear interpolation from `low` to `high`. This instead calls `uniform_(low, high)` to cut out the interpolation step. For complex types, `make_tensor` does the `rand` + interpolation step twice and calls `torch.complex(real, imag)` at the end. This instead uses `view_as_real` and `uniform_(low, high)` to fuse it all into one operation. My benchmarks show significant speedups in all cases for float32 and complex64. \| Device \| dtype \| Size \| Master (us) \| This PR (us) \| Speedup \| \|--------\|-----------\|-------\|-------------\|--------------\|---------\| \| CPU \| float32 \| 8 \| 19.4 \| 6.34 \| 3.1 \| \| \| \| 4096 \| 36.8 \| 21.3 \| 1.7 \| \| \| \| 224 \| 167,000 \| 80,500 \| 2.1 \| \| \| complex32 \| 8 \| 37.0 \| 7.57 \| 4.9 \| \| \| \| 4096 \| 73.1 \| 37.6 \| 1.9 \| \| \| \| 224 \| 409,000 \| 161,000 \| 2.5 \| \| CUDA \| float32 \| 8 \| 40.4 \| 11.7 \| 3.5 \| \| \| \| 4096 \| 38.7 \| 11.7 \| 3.3 \| \| \| \| 224 \| 2,300 \| 238 \| 9.7 \| \| \| complex32 \| 8 \| 78.7 \| 14 \| 5.6 \| \| \| \| 4096 \| 82.7 \| 13.8 \| 6.0 \| \| \| \| 224 \| 5,520 \| 489 \| 11.3 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473 Approved by: https://github.com/mruberry	2022-10-05 17:05:20 +00:00
PyTorch MergeBot	7f607e8cb5	[torchdynamo hash update] update the pinned torchdynamo hash (#85774 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned torchdynamo hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85774 Approved by: https://github.com/pytorchbot, https://github.com/malfet	2022-10-05 17:02:33 +00:00
Nikita Shulga	97d2e1df55	[MPS] Fix GELU for `torch.half` (#86218 ) Also, make sure it raises catcheable errors if invoked with integral types Otherwise, it used to fail with following fatal error invoked for `torch.half` and with similar signatures if invoked for integral types ``` loc("mps_multiply"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/4883e71d-37bd-11ed-b0ef-b25c5e9b9057/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<2xf16>' and 'tensor<1xf32>' are not broadcast compatible LLVM ERROR: Failed to infer result type(s). ``` Modified `test_gelu_simple` to check both fwd and backward gradients for gelu	2022-10-05 09:09:17 -07:00
Will Constable	63d8d4f6ec	SymIntify cat and narrow (#86191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86191 Approved by: https://github.com/ezyang	2022-10-05 14:46:55 +00:00
Horace He	0e03dc5f1e	Remove softmax from recomputable ops (#86268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86268 Approved by: https://github.com/ezyang	2022-10-05 14:16:53 +00:00
lezcano	c609768896	Add refs for torch.unfold and a decomposition for its backward. (#85629 ) It's not clear to me what's the difference between `unfold` and `unfold_copy`, as this latter one is codegen'd I also took this chance to clean the implementation of unfold and its reference Pull Request resolved: https://github.com/pytorch/pytorch/pull/85629 Approved by: https://github.com/mruberry	2022-10-05 12:15:49 +00:00

5067 changed files with 485375 additions and 190307 deletions

2

.bazelrc

View File

-build --cxxopt=--std=c++14
+build --cxxopt=--std=c++17
 build --copt=-I.
 # Bazel does not support including its cc_library targets as system
 # headers. We work around this for generated code

0

.jenkins/caffe2/README.md → .ci/caffe2/README.md

View File

									
										2

.jenkins/caffe2/common.sh → .ci/caffe2/common.sh
									
												View File
												
					@ -28,7 +28,7 @@ fi

					# /usr/local/caffe2 is where the cpp bits are installed to in cmake-only

					# /usr/local/caffe2 is where the cpp bits are installed to in cmake-only

					# builds. In +python builds the cpp tests are copied to /usr/local/caffe2 so

					# builds. In +python builds the cpp tests are copied to /usr/local/caffe2 so

					# that the test code in .jenkins/test.sh is the same

					# that the test code in .ci/test.sh is the same

					INSTALL_PREFIX="/usr/local/caffe2"

					INSTALL_PREFIX="/usr/local/caffe2"

					mkdir -p "$gtest_reports_dir" || true

					mkdir -p "$gtest_reports_dir" || true

									
										17

.jenkins/caffe2/test.sh → .ci/caffe2/test.sh
									
												View File
												
					@ -149,6 +149,9 @@ export DNNL_MAX_CPU_ISA=AVX2

					# Should still run even in the absence of SHARD_NUMBER

					# Should still run even in the absence of SHARD_NUMBER

					if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

					if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

					  # TODO(sdym@meta.com) remove this when the linked issue resolved.

					  # py is temporary until https://github.com/Teemu/pytest-sugar/issues/241 is fixed

					  pip install --user py==1.11.0

					  pip install --user pytest-sugar

					  pip install --user pytest-sugar

					  # NB: Warnings are disabled because they make it harder to see what

					  # NB: Warnings are disabled because they make it harder to see what

					  # the actual erroring test is

					  # the actual erroring test is

					@ -167,17 +170,3 @@ if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then

					    "$caffe2_pypath/python" \

					    "$caffe2_pypath/python" \

					    "${EXTRA_TESTS[@]}"

					    "${EXTRA_TESTS[@]}"

					fi

					fi

					##############

					# ONNX tests #

					##############

					if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then

					  pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

					  pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.12.1 beartype==0.10.4

					  # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.

					  # We don't actually need it for our tests, but it's imported if it's present, so uninstall.

					  pip uninstall -q --yes numba

					  # JIT C++ extensions require ninja, so put it into PATH.

					  export PATH="/var/lib/jenkins/.local/bin:$PATH"

					  "$ROOT_DIR/scripts/onnx/test.sh"

					fi

0

.circleci/docker/README.md → .ci/docker/README.md

View File

0

.circleci/docker/android/AndroidManifest.xml → .ci/docker/android/AndroidManifest.xml

View File

0

.circleci/docker/android/build.gradle → .ci/docker/android/build.gradle

View File

									
										201

.circleci/docker/build.sh → .ci/docker/build.sh
									
												View File
												
					@ -33,7 +33,7 @@ function extract_all_from_image_name() {

					    if [ "x${name}" = xpy ]; then

					    if [ "x${name}" = xpy ]; then

					      vername=ANACONDA_PYTHON_VERSION

					      vername=ANACONDA_PYTHON_VERSION

					    fi

					    fi

					    # skip non-conforming fields such as "pytorch", "linux" or "xenial" without version string

					    # skip non-conforming fields such as "pytorch", "linux" or "bionic" without version string

					    if [ -n "${name}" ]; then

					    if [ -n "${name}" ]; then

					      extract_version_from_image_name "${name}" "${vername}"

					      extract_version_from_image_name "${name}" "${vername}"

					    fi

					    fi

					@ -46,11 +46,7 @@ if [[ "$image" == *xla* ]]; then

					  exit 0

					  exit 0

					fi

					fi

					if [[ "$image" == *-xenial* ]]; then

					if [[ "$image" == *-bionic* ]]; then

					  UBUNTU_VERSION=16.04

					elif [[ "$image" == *-artful* ]]; then

					  UBUNTU_VERSION=17.10

					elif [[ "$image" == *-bionic* ]]; then

					  UBUNTU_VERSION=18.04

					  UBUNTU_VERSION=18.04

					elif [[ "$image" == *-focal* ]]; then

					elif [[ "$image" == *-focal* ]]; then

					  UBUNTU_VERSION=20.04

					  UBUNTU_VERSION=20.04

					@ -77,69 +73,21 @@ if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then

					  DOCKERFILE="${OS}-cuda/Dockerfile"

					  DOCKERFILE="${OS}-cuda/Dockerfile"

					elif [[ "$image" == *rocm* ]]; then

					elif [[ "$image" == *rocm* ]]; then

					  DOCKERFILE="${OS}-rocm/Dockerfile"

					  DOCKERFILE="${OS}-rocm/Dockerfile"

					elif [[ "$image" == *linter* ]]; then

					  # Use a separate Dockerfile for linter to keep a small image size

					  DOCKERFILE="linter/Dockerfile"

					fi

					fi

					if [[ "$image" == *xenial* ]] || [[ "$image" == *bionic* ]]; then

					# CMake 3.18 is needed to support CUDA17 language variant

					  CMAKE_VERSION=3.13.5

					CMAKE_VERSION=3.18.5

					fi

					TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"

					_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab

					_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab

					_UCC_COMMIT=12944da33f911daf505d9bbc51411233d0ed85e1

					_UCC_COMMIT=1c7a7127186e7836f73aafbd7697bbc274a77eee

					# It's annoying to rename jobs every time you want to rewrite a

					# It's annoying to rename jobs every time you want to rewrite a

					# configuration, so we hardcode everything here rather than do it

					# configuration, so we hardcode everything here rather than do it

					# from scratch

					# from scratch

					case "$image" in

					case "$image" in

					  pytorch-linux-xenial-py3.8)

					    ANACONDA_PYTHON_VERSION=3.8

					    GCC_VERSION=7

					    # Do not install PROTOBUF, DB, and VISION as a test

					    ;;

					  pytorch-linux-xenial-py3.7-gcc7.2)

					    ANACONDA_PYTHON_VERSION=3.7

					    GCC_VERSION=7

					    # Do not install PROTOBUF, DB, and VISION as a test

					    ;;

					  pytorch-linux-xenial-py3.7-gcc7)

					    ANACONDA_PYTHON_VERSION=3.7

					    GCC_VERSION=7

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    ;;

					  pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)

					    CUDA_VERSION=10.2

					    CUDNN_VERSION=7

					    ANACONDA_PYTHON_VERSION=3.7

					    GCC_VERSION=7

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    KATEX=yes

					    ;;

					  pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)

					    CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names

					    CUDNN_VERSION=8

					    TENSORRT_VERSION=8.0.1.6

					    ANACONDA_PYTHON_VERSION=3.7

					    GCC_VERSION=7

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    KATEX=yes

					    ;;

					  pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9)

					    CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names

					    CUDNN_VERSION=8

					    TENSORRT_VERSION=8.0.1.6

					    ANACONDA_PYTHON_VERSION=3.7

					    CLANG_VERSION=9

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    KATEX=yes

					    ;;

					  pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)

					  pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)

					    CUDA_VERSION=11.6.2

					    CUDA_VERSION=11.6.2

					    CUDNN_VERSION=8

					    CUDNN_VERSION=8

					@ -151,6 +99,7 @@ case "$image" in

					    KATEX=yes

					    KATEX=yes

					    UCX_COMMIT=${_UCX_COMMIT}

					    UCX_COMMIT=${_UCX_COMMIT}

					    UCC_COMMIT=${_UCC_COMMIT}

					    UCC_COMMIT=${_UCC_COMMIT}

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)

					  pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)

					    CUDA_VERSION=11.7.0

					    CUDA_VERSION=11.7.0

					@ -163,45 +112,40 @@ case "$image" in

					    KATEX=yes

					    KATEX=yes

					    UCX_COMMIT=${_UCX_COMMIT}

					    UCX_COMMIT=${_UCX_COMMIT}

					    UCC_COMMIT=${_UCC_COMMIT}

					    UCC_COMMIT=${_UCC_COMMIT}

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-xenial-py3-clang5-asan)

					  pytorch-linux-bionic-cuda11.8-cudnn8-py3-gcc7)

					    ANACONDA_PYTHON_VERSION=3.7

					    CUDA_VERSION=11.8.0

					    CLANG_VERSION=5.0

					    CUDNN_VERSION=8

					    PROTOBUF=yes

					    ANACONDA_PYTHON_VERSION=3.10

					    DB=yes

					    GCC_VERSION=7

					    VISION=yes

					    ;;

					  pytorch-linux-xenial-py3-clang7-asan)

					    ANACONDA_PYTHON_VERSION=3.7

					    CLANG_VERSION=7

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    KATEX=yes

					    UCX_COMMIT=${_UCX_COMMIT}

					    UCC_COMMIT=${_UCC_COMMIT}

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-focal-py3-clang7-asan)

					  pytorch-linux-focal-py3-clang7-asan)

					    ANACONDA_PYTHON_VERSION=3.7

					    ANACONDA_PYTHON_VERSION=3.9

					    CLANG_VERSION=7

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    ;;

					  pytorch-linux-xenial-py3-clang7-onnx)

					    ANACONDA_PYTHON_VERSION=3.7

					    CLANG_VERSION=7

					    CLANG_VERSION=7

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-focal-py3-clang10-onnx)

					  pytorch-linux-focal-py3-clang10-onnx)

					    ANACONDA_PYTHON_VERSION=3.7

					    ANACONDA_PYTHON_VERSION=3.8

					    CLANG_VERSION=10

					    CLANG_VERSION=10

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-xenial-py3-clang5-android-ndk-r19c)

					  pytorch-linux-focal-py3-clang7-android-ndk-r19c)

					    ANACONDA_PYTHON_VERSION=3.7

					    ANACONDA_PYTHON_VERSION=3.7

					    CLANG_VERSION=5.0

					    CLANG_VERSION=7

					    LLVMDEV=yes

					    LLVMDEV=yes

					    PROTOBUF=yes

					    PROTOBUF=yes

					    ANDROID=yes

					    ANDROID=yes

					@ -209,21 +153,25 @@ case "$image" in

					    GRADLE_VERSION=6.8.3

					    GRADLE_VERSION=6.8.3

					    NINJA_VERSION=1.9.0

					    NINJA_VERSION=1.9.0

					    ;;

					    ;;

					  pytorch-linux-xenial-py3.7-clang7)

					  pytorch-linux-bionic-py3.8-clang9)

					    ANACONDA_PYTHON_VERSION=3.7

					    ANACONDA_PYTHON_VERSION=3.8

					    CLANG_VERSION=7

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    ;;

					  pytorch-linux-bionic-py3.7-clang9)

					    ANACONDA_PYTHON_VERSION=3.7

					    CLANG_VERSION=9

					    CLANG_VERSION=9

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    VULKAN_SDK_VERSION=1.2.162.1

					    VULKAN_SDK_VERSION=1.2.162.1

					    SWIFTSHADER=yes

					    SWIFTSHADER=yes

					    CONDA_CMAKE=yes

					    ;;

					  pytorch-linux-bionic-py3.11-clang9)

					    ANACONDA_PYTHON_VERSION=3.11

					    CLANG_VERSION=9

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    VULKAN_SDK_VERSION=1.2.162.1

					    SWIFTSHADER=yes

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-bionic-py3.8-gcc9)

					  pytorch-linux-bionic-py3.8-gcc9)

					    ANACONDA_PYTHON_VERSION=3.8

					    ANACONDA_PYTHON_VERSION=3.8

					@ -231,49 +179,36 @@ case "$image" in

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-bionic-cuda10.2-cudnn7-py3.7-clang9)

					  pytorch-linux-focal-rocm-n-1-py3)

					    CUDA_VERSION=10.2

					    ANACONDA_PYTHON_VERSION=3.8

					    CUDNN_VERSION=7

					    ANACONDA_PYTHON_VERSION=3.7

					    CLANG_VERSION=9

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    ;;

					  pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)

					    CUDA_VERSION=10.2

					    CUDNN_VERSION=7

					    ANACONDA_PYTHON_VERSION=3.9

					    GCC_VERSION=7

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    ;;

					  pytorch-linux-focal-rocm5.1-py3.7)

					    ANACONDA_PYTHON_VERSION=3.7

					    GCC_VERSION=9

					    GCC_VERSION=9

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    ROCM_VERSION=5.1.1

					    ROCM_VERSION=5.3

					    NINJA_VERSION=1.9.0

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-focal-rocm5.2-py3.7)

					  pytorch-linux-focal-rocm-n-py3)

					    ANACONDA_PYTHON_VERSION=3.7

					    ANACONDA_PYTHON_VERSION=3.8

					    GCC_VERSION=9

					    GCC_VERSION=9

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    ROCM_VERSION=5.2

					    ROCM_VERSION=5.4.2

					    NINJA_VERSION=1.9.0

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-focal-py3.7-gcc7)

					  pytorch-linux-focal-py3.8-gcc7)

					    ANACONDA_PYTHON_VERSION=3.7

					    ANACONDA_PYTHON_VERSION=3.8

					    CMAKE_VERSION=3.16.9  # Required for precompiled header support

					    GCC_VERSION=7

					    GCC_VERSION=7

					    PROTOBUF=yes

					    PROTOBUF=yes

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    KATEX=yes

					    KATEX=yes

					    CONDA_CMAKE=yes

					    ;;

					    ;;

					  pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12)

					  pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12)

					    ANACONDA_PYTHON_VERSION=3.8

					    ANACONDA_PYTHON_VERSION=3.8

					@ -293,6 +228,22 @@ case "$image" in

					    DB=yes

					    DB=yes

					    VISION=yes

					    VISION=yes

					    ;;

					    ;;

					  pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)

					    ANACONDA_PYTHON_VERSION=3.8

					    CUDA_VERSION=11.8

					    CUDNN_VERSION=8

					    CLANG_VERSION=12

					    PROTOBUF=yes

					    DB=yes

					    VISION=yes

					    ;;

					  pytorch-linux-focal-linter)

					    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

					    # We will need to update mypy version eventually, but that's for another day. The task

					    # would be to upgrade mypy to 1.0.0 with Python 3.11

					    ANACONDA_PYTHON_VERSION=3.9

					    CONDA_CMAKE=yes

					    ;;

					  *)

					  *)

					    # Catch-all for builds that are not hardcoded.

					    # Catch-all for builds that are not hardcoded.

					    PROTOBUF=yes

					    PROTOBUF=yes

					@ -308,6 +259,10 @@ case "$image" in

					    fi

					    fi

					    if [[ "$image" == *rocm* ]]; then

					    if [[ "$image" == *rocm* ]]; then

					      extract_version_from_image_name rocm ROCM_VERSION

					      extract_version_from_image_name rocm ROCM_VERSION

					      NINJA_VERSION=1.9.0

					    fi

					    if [[ "$image" == *centos7* ]]; then

					      NINJA_VERSION=1.10.2

					    fi

					    fi

					    if [[ "$image" == *gcc* ]]; then

					    if [[ "$image" == *gcc* ]]; then

					      extract_version_from_image_name gcc GCC_VERSION

					      extract_version_from_image_name gcc GCC_VERSION

					@ -327,12 +282,6 @@ case "$image" in

					  ;;

					  ;;

					esac

					esac

					# Set Jenkins UID and GID if running Jenkins

					if [ -n "${JENKINS:-}" ]; then

					  JENKINS_UID=$(id -u jenkins)

					  JENKINS_GID=$(id -g jenkins)

					fi

					tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

					tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

					#when using cudnn version 8 install it separately from cuda

					#when using cudnn version 8 install it separately from cuda

					@ -349,17 +298,12 @@ fi

					docker build \

					docker build \

					       --no-cache \

					       --no-cache \

					       --progress=plain \

					       --progress=plain \

					       --build-arg "TRAVIS_DL_URL_PREFIX=${TRAVIS_DL_URL_PREFIX}" \

					       --build-arg "BUILD_ENVIRONMENT=${image}" \

					       --build-arg "BUILD_ENVIRONMENT=${image}" \

					       --build-arg "PROTOBUF=${PROTOBUF:-}" \

					       --build-arg "PROTOBUF=${PROTOBUF:-}" \

					       --build-arg "THRIFT=${THRIFT:-}" \

					       --build-arg "THRIFT=${THRIFT:-}" \

					       --build-arg "LLVMDEV=${LLVMDEV:-}" \

					       --build-arg "LLVMDEV=${LLVMDEV:-}" \

					       --build-arg "DB=${DB:-}" \

					       --build-arg "DB=${DB:-}" \

					       --build-arg "VISION=${VISION:-}" \

					       --build-arg "VISION=${VISION:-}" \

					       --build-arg "EC2=${EC2:-}" \

					       --build-arg "JENKINS=${JENKINS:-}" \

					       --build-arg "JENKINS_UID=${JENKINS_UID:-}" \

					       --build-arg "JENKINS_GID=${JENKINS_GID:-}" \

					       --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \

					       --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \

					       --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

					       --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

					       --build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \

					       --build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \

					@ -383,6 +327,7 @@ docker build \

					       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

					       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

					       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

					       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

					       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

					       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

					       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

					       -f $(dirname ${DOCKERFILE})/Dockerfile \

					       -f $(dirname ${DOCKERFILE})/Dockerfile \

					       -t "$tmp_tag" \

					       -t "$tmp_tag" \

					       "$@" \

					       "$@" \

									
										11

.circleci/docker/build_docker.sh → .ci/docker/build_docker.sh
									
												View File
												
					@ -18,7 +18,6 @@ tag="${DOCKER_TAG}"

					registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"

					registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"

					image="${registry}/pytorch/${IMAGE_NAME}"

					image="${registry}/pytorch/${IMAGE_NAME}"

					ghcr_image="ghcr.io/pytorch/ci-image"

					login() {

					login() {

					  aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' |

					  aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' |

					@ -36,9 +35,6 @@ if [[ -z "${GITHUB_ACTIONS}" ]]; then

					  trap "docker logout ${registry}" EXIT

					  trap "docker logout ${registry}" EXIT

					fi

					fi

					# export EC2=1

					# export JENKINS=1

					# Try to pull the previous image (perhaps we can reuse some layers)

					# Try to pull the previous image (perhaps we can reuse some layers)

					# if [ -n "${last_tag}" ]; then

					# if [ -n "${last_tag}" ]; then

					#   docker pull "${image}:${last_tag}" || true

					#   docker pull "${image}:${last_tag}" || true

					@ -55,13 +51,6 @@ if [ "${DOCKER_SKIP_PUSH:-true}" = "false" ]; then

					  if ! docker manifest inspect "${image}:${tag}" >/dev/null 2>/dev/null; then

					  if ! docker manifest inspect "${image}:${tag}" >/dev/null 2>/dev/null; then

					    docker push "${image}:${tag}"

					    docker push "${image}:${tag}"

					  fi

					  fi

					  if [ "${PUSH_GHCR_IMAGE:-}" = "true" ]; then

					    # Push docker image to the ghcr.io

					    echo $GHCR_PAT | docker login ghcr.io -u pytorch --password-stdin

					    docker tag "${image}:${tag}" "${ghcr_image}:${IMAGE_NAME}-${tag}"

					    docker push "${ghcr_image}:${IMAGE_NAME}-${tag}"

					  fi

					fi

					fi

					if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then

					if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then

									
										13

.circleci/docker/centos-rocm/Dockerfile → .ci/docker/centos-rocm/Dockerfile
									
												View File
												
					@ -11,14 +11,15 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

					# Install required packages to build Caffe2

					# Install required packages to build Caffe2

					# Install common dependencies (so that this step can be cached separately)

					# Install common dependencies (so that this step can be cached separately)

					ARG EC2

					COPY ./common/install_base.sh install_base.sh

					COPY ./common/install_base.sh install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					# Update CentOS git version

					# Update CentOS git version

					RUN yum -y remove git

					RUN yum -y remove git

					RUN yum -y remove git-*

					RUN yum -y remove git-*

					RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm

					RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm || \

					    (yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \

					    sed -i "s/packages.endpoint/packages.endpointdev/" /etc/yum.repos.d/endpoint.repo)

					RUN yum install -y git

					RUN yum install -y git

					# Install devtoolset

					# Install devtoolset

					@ -38,12 +39,14 @@ COPY ./common/install_user.sh install_user.sh

					RUN bash ./install_user.sh && rm install_user.sh

					RUN bash ./install_user.sh && rm install_user.sh

					# Install conda and other packages (e.g., numpy, pytest)

					# Install conda and other packages (e.g., numpy, pytest)

					ENV PATH /opt/conda/bin:$PATH

					ARG ANACONDA_PYTHON_VERSION

					ARG ANACONDA_PYTHON_VERSION

					ARG CONDA_CMAKE

					ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

					ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY ./common/install_conda.sh install_conda.sh

					COPY ./common/install_conda.sh install_conda.sh

					RUN bash ./install_conda.sh && rm install_conda.sh

					COPY ./common/common_utils.sh common_utils.sh

					RUN rm /opt/conda/requirements-ci.txt

					RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

					# (optional) Install protobuf for ONNX

					# (optional) Install protobuf for ONNX

					ARG PROTOBUF

					ARG PROTOBUF

									
										32

.ci/docker/common/common_utils.sh
									
										Normal file
									
												View File
												
					@ -0,0 +1,32 @@

					#!/bin/bash

					# Work around bug where devtoolset replaces sudo and breaks it.

					if [ -n "$DEVTOOLSET_VERSION" ]; then

					  export SUDO=/bin/sudo

					else

					  export SUDO=sudo

					fi

					as_jenkins() {

					  # NB: unsetting the environment variables works around a conda bug

					  # https://github.com/conda/conda/issues/6576

					  # NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation

					  # NB: This must be run from a directory that jenkins has access to,

					  # works around https://github.com/conda/conda-package-handling/pull/34

					  $SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*

					}

					conda_install() {

					  # Ensure that the install command don't upgrade/downgrade Python

					  # This should be called as

					  #   conda_install pkg1 pkg2 ... [-c channel]

					  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*

					}

					conda_run() {

					  as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION --no-capture-output $*

					}

					pip_install() {

					  as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION pip install --progress-bar off $*

					}

0

.circleci/docker/common/install_android.sh → .ci/docker/common/install_android.sh

View File

									
										11

.circleci/docker/common/install_base.sh → .ci/docker/common/install_base.sh
									
												View File
												
					@ -68,7 +68,10 @@ install_ubuntu() {

					    sudo \

					    sudo \

					    vim \

					    vim \

					    jq \

					    jq \

					    libtool

					    libtool \

					    vim \

					    unzip \

					    gdb

					  # Should resolve issues related to various apt package repository cert issues

					  # Should resolve issues related to various apt package repository cert issues

					  # see: https://github.com/pytorch/pytorch/issues/65931

					  # see: https://github.com/pytorch/pytorch/issues/65931

					@ -126,7 +129,9 @@ install_centos() {

					    opencv-devel \

					    opencv-devel \

					    sudo \

					    sudo \

					    wget \

					    wget \

					    vim

					    vim \

					    unzip \

					    gdb

					  # Cleanup

					  # Cleanup

					  yum clean all

					  yum clean all

					@ -152,7 +157,7 @@ esac

					# Install Valgrind separately since the apt-get version is too old.

					# Install Valgrind separately since the apt-get version is too old.

					mkdir valgrind_build && cd valgrind_build

					mkdir valgrind_build && cd valgrind_build

					VALGRIND_VERSION=3.16.1

					VALGRIND_VERSION=3.20.0

					wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2

					wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2

					tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2

					tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2

					cd valgrind-${VALGRIND_VERSION}

					cd valgrind-${VALGRIND_VERSION}

0

.circleci/docker/common/install_cache.sh → .ci/docker/common/install_cache.sh

View File

0

.circleci/docker/common/install_clang.sh → .ci/docker/common/install_clang.sh

View File

									
										14

.circleci/docker/common/install_cmake.sh → .ci/docker/common/install_cmake.sh
									
												View File
												
					@ -5,7 +5,19 @@ set -ex

					[ -n "$CMAKE_VERSION" ]

					[ -n "$CMAKE_VERSION" ]

					# Remove system cmake install so it won't get used instead

					# Remove system cmake install so it won't get used instead

					apt-get remove cmake -y

					ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

					case "$ID" in

					  ubuntu)

					    apt-get remove cmake -y

					    ;;

					  centos)

					    yum remove cmake -y

					    ;;

					  *)

					    echo "Unable to determine OS..."

					    exit 1

					    ;;

					esac

					# Turn 3.6.3 into v3.6

					# Turn 3.6.3 into v3.6

					path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')

					path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')

									
										58

.circleci/docker/common/install_conda.sh → .ci/docker/common/install_conda.sh
									
												View File
												
					@ -24,26 +24,12 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

					  mkdir -p /opt/conda

					  mkdir -p /opt/conda

					  chown jenkins:jenkins /opt/conda

					  chown jenkins:jenkins /opt/conda

					  # Work around bug where devtoolset replaces sudo and breaks it.

					  source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

					  if [ -n "$DEVTOOLSET_VERSION" ]; then

					    SUDO=/bin/sudo

					  else

					    SUDO=sudo

					  fi

					  as_jenkins() {

					    # NB: unsetting the environment variables works around a conda bug

					    # https://github.com/conda/conda/issues/6576

					    # NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation

					    # NB: This must be run from a directory that jenkins has access to,

					    # works around https://github.com/conda/conda-package-handling/pull/34

					    $SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*

					  }

					  pushd /tmp

					  pushd /tmp

					  wget -q "${BASE_URL}/${CONDA_FILE}"

					  wget -q "${BASE_URL}/${CONDA_FILE}"

					  chmod +x "${CONDA_FILE}"

					  # NB: Manually invoke bash per https://github.com/conda/conda/issues/10431

					  as_jenkins ./"${CONDA_FILE}" -b -f -p "/opt/conda"

					  as_jenkins bash "${CONDA_FILE}" -b -f -p "/opt/conda"

					  popd

					  popd

					  # NB: Don't do this, rely on the rpath to get it right

					  # NB: Don't do this, rely on the rpath to get it right

					@ -61,24 +47,15 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

					  # as_jenkins conda update -y -n base conda

					  # as_jenkins conda update -y -n base conda

					  # Install correct Python version

					  # Install correct Python version

					  as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION"

					  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"

					  conda_install() {

					    # Ensure that the install command don't upgrade/downgrade Python

					    # This should be called as

					    #   conda_install pkg1 pkg2 ... [-c channel]

					    as_jenkins conda install -q -y python="$ANACONDA_PYTHON_VERSION" $*

					  }

					  pip_install() {

					    as_jenkins pip install --progress-bar off $*

					  }

					  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

					  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

					  # DO NOT install cmake here as it would install a version newer than 3.13, but

					  CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

					  # we want to pin to version 3.13.

					  if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then

					  CONDA_COMMON_DEPS="astunparse pyyaml mkl=2022.0.1 mkl-include=2022.0.1 setuptools cffi future six"

					    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

					  if [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then

					    # TODO: Stop using `-c malfet`

					    conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0 -c malfet

					  elif [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then

					    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

					    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

					    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

					    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

					  elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then

					  elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then

					@ -88,8 +65,16 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

					    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

					    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

					    conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

					    conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

					  else

					  else

					    # Install `typing_extensions` for 3.7

					    # Install `typing-extensions` for 3.7

					    conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing_extensions

					    conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing-extensions

					  fi

					  # Use conda cmake in some cases. Conda cmake will be newer than our supported

					  # min version (3.5 for xenial and 3.10 for bionic), so we only do it in those

					  # following builds that we know should use conda. Specifically, Ubuntu bionic

					  # and focal cannot find conda mkl with stock cmake, so we need a cmake from conda

					  if [ -n "${CONDA_CMAKE}" ]; then

					    conda_install cmake

					  fi

					  fi

					  # Magma package names are concatenation of CUDA major and minor ignoring revision

					  # Magma package names are concatenation of CUDA major and minor ignoring revision

					@ -98,9 +83,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

					    conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch

					    conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch

					  fi

					  fi

					  # TODO: This isn't working atm

					  conda_install nnpack -c killeent

					  # Install some other packages, including those needed for Python test reporting

					  # Install some other packages, including those needed for Python test reporting

					  pip_install -r /opt/conda/requirements-ci.txt

					  pip_install -r /opt/conda/requirements-ci.txt

									
										7

.circleci/docker/common/install_cudnn.sh → .ci/docker/common/install_cudnn.sh
									
												View File
												
					@ -6,9 +6,12 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then

					    CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"

					    CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"

					    if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then

					    if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then

					        CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"

					        CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"

					        curl -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz

					        curl --retry 3 -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz

					    elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

					        CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"

					        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz

					    else

					    else

					        curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz

					        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz

					    fi

					    fi

					    tar xf ${CUDNN_NAME}.tar.xz

					    tar xf ${CUDNN_NAME}.tar.xz

0

.circleci/docker/common/install_db.sh → .ci/docker/common/install_db.sh

View File

0

.circleci/docker/common/install_devtoolset.sh → .ci/docker/common/install_devtoolset.sh

View File

									
										4

.circleci/docker/common/install_docs_reqs.sh → .ci/docker/common/install_docs_reqs.sh
									
												View File
												
					@ -7,10 +7,10 @@ if [ -n "$KATEX" ]; then

					  # Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04)

					  # Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04)

					  apt-get install -y gpg-agent || :

					  apt-get install -y gpg-agent || :

					  curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -

					  curl --retry 3 -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -

					  sudo apt-get install -y nodejs

					  sudo apt-get install -y nodejs

					  curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -

					  curl --retry 3 -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -

					  echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list

					  echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list

					  apt-get update

					  apt-get update

0

.circleci/docker/common/install_gcc.sh → .ci/docker/common/install_gcc.sh

View File

0

.circleci/docker/common/install_glibc.sh → .ci/docker/common/install_glibc.sh

View File

0

.circleci/docker/common/install_jni.sh → .ci/docker/common/install_jni.sh

View File

0

.circleci/docker/common/install_lcov.sh → .ci/docker/common/install_lcov.sh

View File

									
										29

.ci/docker/common/install_linter.sh
									
										Normal file
									
												View File
												
					@ -0,0 +1,29 @@

					#!/bin/bash

					set -ex

					source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

					if [ -n "${UBUNTU_VERSION}" ]; then

					  apt update

					  apt-get install -y clang doxygen git graphviz nodejs npm libtinfo5

					fi

					# Do shallow clone of PyTorch so that we can init lintrunner in Docker build context

					git clone https://github.com/pytorch/pytorch.git --depth 1

					chown -R jenkins pytorch

					pushd pytorch

					# Install all linter dependencies

					pip_install -r requirements.txt

					conda_run lintrunner init

					# Cache .lintbin directory as part of the Docker image

					cp -r .lintbin /tmp

					popd

					# Node dependencies required by toc linter job

					npm install -g markdown-toc

					# Cleaning up

					rm -rf pytorch

0

.circleci/docker/common/install_ninja.sh → .ci/docker/common/install_ninja.sh

View File

0

.circleci/docker/common/install_openmpi.sh → .ci/docker/common/install_openmpi.sh

View File

0

.circleci/docker/common/install_openssl.sh → .ci/docker/common/install_openssl.sh

View File

									
										2

.circleci/docker/common/install_protobuf.sh → .ci/docker/common/install_protobuf.sh
									
												View File
												
					@ -12,7 +12,7 @@ install_protobuf_317() {

					  #   g++: error: ./../lib64/crti.o: No such file or directory

					  #   g++: error: ./../lib64/crti.o: No such file or directory

					  ln -s /usr/lib64 "$pb_dir/lib64"

					  ln -s /usr/lib64 "$pb_dir/lib64"

					  curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz"

					  curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3

					  tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

					  tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

					  # -j6 to balance memory usage and speed.

					  # -j6 to balance memory usage and speed.

					  # naked `-j` seems to use too much memory.

					  # naked `-j` seems to use too much memory.

									
										22

.circleci/docker/common/install_rocm.sh → .ci/docker/common/install_rocm.sh
									
												View File
												
					@ -29,7 +29,12 @@ install_ubuntu() {

					    if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then

					    if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then

					        # Add amdgpu repository

					        # Add amdgpu repository

					        UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

					        UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`

					        local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"

					        local amdgpu_baseurl

					        if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then

					          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"

					        else

					          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"

					        fi

					        echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

					        echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

					    fi

					    fi

					@ -38,6 +43,10 @@ install_ubuntu() {

					        ROCM_REPO="xenial"

					        ROCM_REPO="xenial"

					    fi

					    fi

					    if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then

					        ROCM_REPO="${UBUNTU_VERSION_NAME}"

					    fi

					    # Add rocm repository

					    # Add rocm repository

					    wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -

					    wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -

					    local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"

					    local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"

					@ -78,7 +87,16 @@ install_centos() {

					  if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then

					  if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then

					      # Add amdgpu repository

					      # Add amdgpu repository

					      local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"

					      local amdgpu_baseurl

					      if [[ $OS_VERSION == 9 ]]; then

					          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/9.0/main/x86_64"

					      else

					        if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then

					          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"

					        else

					          amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"

					        fi

					      fi

					      echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo

					      echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo

					      echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo

					      echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo

					      echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo

					      echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo

									
										4

.circleci/docker/common/install_rocm_magma.sh → .ci/docker/common/install_rocm_magma.sh
									
												View File
												
					@ -23,7 +23,7 @@ done

					# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

					# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

					sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

					sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

					make -f make.gen.hipMAGMA -j $(nproc)

					make -f make.gen.hipMAGMA -j $(nproc)

					LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda

					LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION

					make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda

					make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION

					popd

					popd

					mv magma /opt/rocm

					mv magma /opt/rocm

0

.circleci/docker/common/install_swiftshader.sh → .ci/docker/common/install_swiftshader.sh

View File

0

.circleci/docker/common/install_thrift.sh → .ci/docker/common/install_thrift.sh

View File

0

.circleci/docker/common/install_ucc.sh → .ci/docker/common/install_ucc.sh

View File

									
										9

.circleci/docker/common/install_user.sh → .ci/docker/common/install_user.sh
									
												View File
												
					@ -22,5 +22,12 @@ chown jenkins:jenkins /usr/local

					# TODO: Maybe we shouldn't

					# TODO: Maybe we shouldn't

					echo 'jenkins ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/jenkins

					echo 'jenkins ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/jenkins

					# Work around bug where devtoolset replaces sudo and breaks it.

					if [ -n "$DEVTOOLSET_VERSION" ]; then

					  SUDO=/bin/sudo

					else

					  SUDO=sudo

					fi

					# Test that sudo works

					# Test that sudo works

					sudo -u jenkins sudo -v

					$SUDO -u jenkins $SUDO -v

0

.circleci/docker/common/install_vision.sh → .ci/docker/common/install_vision.sh

View File

0

.circleci/docker/common/install_vulkan_sdk.sh → .ci/docker/common/install_vulkan_sdk.sh

View File

0

.circleci/docker/java/jni.h → .ci/docker/java/jni.h

View File

									
										34

.ci/docker/linter/Dockerfile
									
										Normal file
									
												View File
												
					@ -0,0 +1,34 @@

					ARG UBUNTU_VERSION

					FROM ubuntu:${UBUNTU_VERSION}

					ARG UBUNTU_VERSION

					ENV DEBIAN_FRONTEND noninteractive

					# Install common dependencies (so that this step can be cached separately)

					COPY ./common/install_base.sh install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					# Install user

					COPY ./common/install_user.sh install_user.sh

					RUN bash ./install_user.sh && rm install_user.sh

					# Install conda and other packages (e.g., numpy, pytest)

					ARG ANACONDA_PYTHON_VERSION

					ARG CONDA_CMAKE

					ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

					ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY ./common/install_conda.sh install_conda.sh

					COPY ./common/common_utils.sh common_utils.sh

					RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

					# Note that Docker build forbids copying file outside the build context

					COPY ./common/install_linter.sh install_linter.sh

					COPY ./common/common_utils.sh common_utils.sh

					RUN bash ./install_linter.sh

					RUN rm install_linter.sh common_utils.sh

					USER jenkins

					CMD ["bash"]

34

.circleci/docker/requirements-ci.txt → .ci/docker/requirements-ci.txt

View File

 #Pinned versions: 2.0
 #test that import:
-#future #this breaks linux-bionic-rocm4.5-py3.7
-#Description: compatibility layer between python 2 and python 3
-#Pinned versions:
-#test that import:
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
 #Pinned versions: 2.1.1
 #test that import:
-librosa>=0.6.2
+librosa>=0.6.2 ; python_version < "3.11"
 #Description: A python package for music and audio analysis
 #Pinned versions: >=0.6.2
 #test that import: test_spectral_ops.py
 #Pinned versions:
 #test that import:
+pytest-flakefinder==1.1.0
+#Description: plugin for rerunning tests a fixed number of times in pytest
+#Pinned versions: 1.1.0
+#test that import:
 pytest-rerunfailures
-#Description: plugin for rerunning tests in pytest
+#Description: plugin for rerunning failure tests in pytest
 #Pinned versions:
 #test that import:
 #Pinned versions:
 #test that import:
-xdoctest==1.0.2
+xdoctest==1.1.0
 #Description: runs doctests in pytest
-#Pinned versions: 1.0.2
+#Pinned versions: 1.1.0
 #test that import:
 pygments==2.12.0
 scipy==1.6.3 ; python_version < "3.10"
 scipy==1.8.1 ; python_version == "3.10"
+scipy==1.9.3 ; python_version == "3.11"
 # Pin SciPy because of failing distribution tests (see #60347)
 #Description: scientific python
 #Pinned versions: 1.6.3
 #Description: saves unit test results to xml
 #Pinned versions:
 #test that import:
+lintrunner==0.9.2
+#Description: all about linters
+#Pinned versions: 0.9.2
+#test that import:
+rockset==1.0.3
+#Description: queries Rockset
+#Pinned versions: 1.0.3
+#test that import:
+ghstack==0.7.1
+#Description: ghstack tool
+#Pinned versions: 0.7.1
+#test that import:

									
										9

.circleci/docker/ubuntu-cuda/Dockerfile → .ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
					@ -10,7 +10,6 @@ ARG CUDA_VERSION

					ENV DEBIAN_FRONTEND noninteractive

					ENV DEBIAN_FRONTEND noninteractive

					# Install common dependencies (so that this step can be cached separately)

					# Install common dependencies (so that this step can be cached separately)

					ARG EC2

					COPY ./common/install_base.sh install_base.sh

					COPY ./common/install_base.sh install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					@ -24,12 +23,14 @@ COPY ./common/install_docs_reqs.sh install_docs_reqs.sh

					RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

					RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

					# Install conda and other packages (e.g., numpy, pytest)

					# Install conda and other packages (e.g., numpy, pytest)

					ENV PATH /opt/conda/bin:$PATH

					ARG ANACONDA_PYTHON_VERSION

					ARG ANACONDA_PYTHON_VERSION

					ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

					ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

					ARG CONDA_CMAKE

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY ./common/install_conda.sh install_conda.sh

					COPY ./common/install_conda.sh install_conda.sh

					RUN bash ./install_conda.sh && rm install_conda.sh

					COPY ./common/common_utils.sh common_utils.sh

					RUN rm /opt/conda/requirements-ci.txt

					RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

					# Install gcc

					# Install gcc

					ARG GCC_VERSION

					ARG GCC_VERSION

0

.circleci/docker/ubuntu-rocm/.gitignore → .ci/docker/ubuntu-rocm/.gitignore vendored

View File

									
										9

.circleci/docker/ubuntu-rocm/Dockerfile → .ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
					@ -11,7 +11,6 @@ ARG PYTORCH_ROCM_ARCH

					ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

					ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

					# Install common dependencies (so that this step can be cached separately)

					# Install common dependencies (so that this step can be cached separately)

					ARG EC2

					COPY ./common/install_base.sh install_base.sh

					COPY ./common/install_base.sh install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					@ -26,12 +25,14 @@ COPY ./common/install_user.sh install_user.sh

					RUN bash ./install_user.sh && rm install_user.sh

					RUN bash ./install_user.sh && rm install_user.sh

					# Install conda and other packages (e.g., numpy, pytest)

					# Install conda and other packages (e.g., numpy, pytest)

					ENV PATH /opt/conda/bin:$PATH

					ARG ANACONDA_PYTHON_VERSION

					ARG ANACONDA_PYTHON_VERSION

					ARG CONDA_CMAKE

					ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

					ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY ./common/install_conda.sh install_conda.sh

					COPY ./common/install_conda.sh install_conda.sh

					RUN bash ./install_conda.sh && rm install_conda.sh

					COPY ./common/common_utils.sh common_utils.sh

					RUN rm /opt/conda/requirements-ci.txt

					RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

					# Install gcc

					# Install gcc

					ARG GCC_VERSION

					ARG GCC_VERSION

									
										13

.circleci/docker/ubuntu/Dockerfile → .ci/docker/ubuntu/Dockerfile
									
												View File
												
					@ -9,7 +9,6 @@ ENV DEBIAN_FRONTEND noninteractive

					ARG CLANG_VERSION

					ARG CLANG_VERSION

					# Install common dependencies (so that this step can be cached separately)

					# Install common dependencies (so that this step can be cached separately)

					ARG EC2

					COPY ./common/install_base.sh install_base.sh

					COPY ./common/install_base.sh install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					RUN bash ./install_base.sh && rm install_base.sh

					@ -35,12 +34,14 @@ COPY ./common/install_docs_reqs.sh install_docs_reqs.sh

					RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

					RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

					# Install conda and other packages (e.g., numpy, pytest)

					# Install conda and other packages (e.g., numpy, pytest)

					ENV PATH /opt/conda/bin:$PATH

					ARG ANACONDA_PYTHON_VERSION

					ARG ANACONDA_PYTHON_VERSION

					ARG CONDA_CMAKE

					ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

					ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY requirements-ci.txt /opt/conda/requirements-ci.txt

					COPY ./common/install_conda.sh install_conda.sh

					COPY ./common/install_conda.sh install_conda.sh

					RUN bash ./install_conda.sh && rm install_conda.sh

					COPY ./common/common_utils.sh common_utils.sh

					RUN rm /opt/conda/requirements-ci.txt

					RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

					# Install gcc

					# Install gcc

					ARG GCC_VERSION

					ARG GCC_VERSION

					@ -136,10 +137,6 @@ RUN rm install_openssl.sh

					# Install ccache/sccache (do this last, so we get priority in PATH)

					# Install ccache/sccache (do this last, so we get priority in PATH)

					COPY ./common/install_cache.sh install_cache.sh

					COPY ./common/install_cache.sh install_cache.sh

					ENV PATH /opt/cache/bin:$PATH

					ENV PATH /opt/cache/bin:$PATH

					# See https://github.com/pytorch/pytorch/issues/82174

					# TODO(sdym@fb.com):

					# check if this is needed after full off Xenial migration

					ENV CARGO_NET_GIT_FETCH_WITH_CLI true

					RUN bash ./install_cache.sh && rm install_cache.sh

					RUN bash ./install_cache.sh && rm install_cache.sh

					# Add jni.h for java host build

					# Add jni.h for java host build

									
										14

.ci/onnx/README.md
									
										Normal file
									
												View File
												
					@ -0,0 +1,14 @@

					# Jenkins

					The scripts in this directory are the entrypoint for testing ONNX exporter.

					The environment variable `BUILD_ENVIRONMENT` is expected to be set to

					the build environment you intend to test. It is a hint for the build

					and test scripts to configure Caffe2 a certain way and include/exclude

					tests. Docker images, they equal the name of the image itself. For

					example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are

					built on Jenkins and are used in triggered builds already have this

					environment variable set in their manifest. Also see

					`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.

					Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

									
										19

.ci/onnx/common.sh
									
										Normal file
									
												View File
												
					@ -0,0 +1,19 @@

					set -ex

					LOCAL_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)

					ROOT_DIR=$(cd "$LOCAL_DIR"/../.. && pwd)

					TEST_DIR="$ROOT_DIR/test"

					pytest_reports_dir="${TEST_DIR}/test-reports/python"

					# Figure out which Python to use

					PYTHON="$(which python)"

					if [[ "${BUILD_ENVIRONMENT}" =~ py((2|3)\.?[0-9]?\.?[0-9]?) ]]; then

					  PYTHON=$(which "python${BASH_REMATCH[1]}")

					fi

					if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

					    # HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors

					    unset HIP_PLATFORM

					fi

					mkdir -p "$pytest_reports_dir" || true

									
										74

.ci/onnx/test.sh
									
										Executable file
									
												View File
												
					@ -0,0 +1,74 @@

					#!/bin/bash

					# shellcheck source=./common.sh

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then

					  pip install click mock tabulate networkx==2.0

					  pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"

					fi

					# Skip tests in environments where they are not built/applicable

					if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then

					  echo 'Skipping tests'

					  exit 0

					fi

					if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then

					  # temporary to locate some kernel issues on the CI nodes

					  export HSAKMT_DEBUG_LEVEL=4

					fi

					# These additional packages are needed for circleci ROCm builds.

					if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then

					    # Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by

					    # defaults installs the most recent networkx version, so we install this lower

					    # version explicitly before scikit-image pulls it in as a dependency

					    pip install networkx==2.0

					    # click - onnx

					    pip install --progress-bar off click protobuf tabulate virtualenv mock typing-extensions

					fi

					################################################################################

					# Python tests #

					################################################################################

					if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then

					  exit 0

					fi

					# If pip is installed as root, we must use sudo.

					# CircleCI docker images could install conda as jenkins user, or use the OS's python package.

					PIP=$(which pip)

					PIP_USER=$(stat --format '%U' $PIP)

					CURRENT_USER=$(id -u -n)

					if [[ "$PIP_USER" = root && "$CURRENT_USER" != root ]]; then

					  MAYBE_SUDO=sudo

					fi

					# Uninstall pre-installed hypothesis and coverage to use an older version as newer

					# versions remove the timeout parameter from settings which ideep/conv_transpose_test.py uses

					$MAYBE_SUDO pip -q uninstall -y hypothesis

					$MAYBE_SUDO pip -q uninstall -y coverage

					# "pip install hypothesis==3.44.6" from official server is unreliable on

					# CircleCI, so we host a copy on S3 instead

					$MAYBE_SUDO pip -q install attrs==18.1.0 -f https://s3.amazonaws.com/ossci-linux/wheels/attrs-18.1.0-py2.py3-none-any.whl

					$MAYBE_SUDO pip -q install coverage==4.5.1 -f https://s3.amazonaws.com/ossci-linux/wheels/coverage-4.5.1-cp36-cp36m-macosx_10_12_x86_64.whl

					$MAYBE_SUDO pip -q install hypothesis==4.57.1

					##############

					# ONNX tests #

					##############

					if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then

					  pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

					  pip install -q --user transformers==4.25.1

					  pip install -q --user ninja flatbuffers==2.0 numpy==1.22.4 onnxruntime==1.14.0 beartype==0.10.4

					  # TODO: change this when onnx 1.13.1 is released.

					  pip install --no-use-pep517 'onnx @ git+https://github.com/onnx/onnx@e192ba01e438d22ca2dedd7956e28e3551626c91'

					  # TODO: change this when onnx-script is on testPypi

					  pip install 'onnx-script @ git+https://github.com/microsoft/onnx-script@a71e35bcd72537bf7572536ee57250a0c0488bf6'

					  # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.

					  # We don't actually need it for our tests, but it's imported if it's present, so uninstall.

					  pip uninstall -q --yes numba

					  # JIT C++ extensions require ninja, so put it into PATH.

					  export PATH="/var/lib/jenkins/.local/bin:$PATH"

					  "$ROOT_DIR/scripts/onnx/test.sh"

					fi

0

.jenkins/pytorch/.shellcheckrc → .ci/pytorch/.shellcheckrc

View File

									
										2

.jenkins/pytorch/README.md → .ci/pytorch/README.md
									
												View File
												
					@ -10,7 +10,7 @@ it is very easy to run these tests yourself:

					   ``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,

					   ``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,

					   where ``$BUILD_ENVIRONMENT`` is one of the build environments

					   where ``$BUILD_ENVIRONMENT`` is one of the build environments

					   enumerated in

					   enumerated in

					   [pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/build.sh). The dockerfile used by jenkins can be found under the `.circle` [directory](https://github.com/pytorch/pytorch/blob/master/.circleci/docker)

					   [pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)

					2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and

					2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and

					   run one of the scripts in this directory.

					   run one of the scripts in this directory.

									
										2

.jenkins/pytorch/build-asan.sh → .ci/pytorch/build-asan.sh
									
												View File
												
					@ -26,7 +26,7 @@ CC="clang" CXX="clang++" LDSHARED="clang --shared" \

					  CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \

					  CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \

					  USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \

					  USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \

					  python setup.py bdist_wheel

					  python setup.py bdist_wheel

					  python -mpip install "$(echo dist/*.whl)[opt-einsum]"

					  pip_install_whl "$(echo dist/*.whl)"

					# Test building via the sdist source tarball

					# Test building via the sdist source tarball

					python setup.py sdist

					python setup.py sdist

0

.jenkins/pytorch/build-mobile.sh → .ci/pytorch/build-mobile.sh

View File

									
										29

.ci/pytorch/build-tsan.sh
									
										Executable file
									
												View File
												
					@ -0,0 +1,29 @@

					#!/bin/bash

					# Required environment variable: $BUILD_ENVIRONMENT

					# (This is set by default in the Docker images we build, so you don't

					# need to set it yourself.

					# shellcheck source=./common.sh

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					# shellcheck source=./common-build.sh

					source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

					echo "Clang version:"

					clang --version

					python tools/stats/export_test_times.py

					if [ -n "$(which conda)" ]; then

					  export CMAKE_PREFIX_PATH=/opt/conda

					fi

					CC="clang" CXX="clang++" LDSHARED="clang --shared" \

					  CFLAGS="-fsanitize=thread" \

					  USE_TSAN=1 USE_CUDA=0 USE_MKLDNN=0 \

					  python setup.py bdist_wheel

					  pip_install_whl "$(echo dist/*.whl)"

					print_sccache_stats

					assert_git_not_dirty

									
										51

.jenkins/pytorch/build.sh → .ci/pytorch/build.sh
									
												View File
												
					@ -15,14 +15,12 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang7-asan* ]]; then

					  exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"

					  exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *-clang7-tsan* ]]; then

					  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"

					  exec "$(dirname "${BASH_SOURCE[0]}")/build-tsan.sh" "$@"

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *deploy* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then

					  # Enabling DEPLOY build (embedded torch python interpreter, experimental)

					  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"

					  # only on one config for now, can expand later

					  export USE_DEPLOY=ON

					fi

					fi

					echo "Python version:"

					echo "Python version:"

					@ -43,8 +41,6 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

					  # enable split torch_cuda build option in CMake

					  export BUILD_SPLIT_CUDA=ON

					  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then

					  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then

					    # TODO: there is a linking issue when building with UCC using clang,

					    # TODO: there is a linking issue when building with UCC using clang,

					    # disable it for now and to be fix later.

					    # disable it for now and to be fix later.

					@ -53,7 +49,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

					  fi

					  fi

					fi

					fi

					if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then

					if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* ]]; then

					  echo "Caffe2 build is ON"

					  export BUILD_CAFFE2=ON

					  export BUILD_CAFFE2=ON

					fi

					fi

					@ -64,9 +61,6 @@ elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

					  export ATEN_THREADING=NATIVE

					  export ATEN_THREADING=NATIVE

					fi

					fi

					# TODO: Don't run this...

					pip_install -r requirements.txt || true

					# Enable LLVM dependency for TensorExpr testing

					# Enable LLVM dependency for TensorExpr testing

					if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

					  export USE_LLVM=/opt/rocm/llvm

					  export USE_LLVM=/opt/rocm/llvm

					@ -76,13 +70,11 @@ else

					  export LLVM_DIR=/opt/llvm/lib/cmake/llvm

					  export LLVM_DIR=/opt/llvm/lib/cmake/llvm

					fi

					fi

					# TODO: Don't install this here

					if ! which conda; then

					if ! which conda; then

					  # In ROCm CIs, we are doing cross compilation on build machines with

					  # In ROCm CIs, we are doing cross compilation on build machines with

					  # intel cpu and later run tests on machines with amd cpu.

					  # intel cpu and later run tests on machines with amd cpu.

					  # Also leave out two builds to make sure non-mkldnn builds still work.

					  # Also leave out two builds to make sure non-mkldnn builds still work.

					  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

					  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

					    pip_install mkl mkl-devel

					    export USE_MKLDNN=1

					    export USE_MKLDNN=1

					  else

					  else

					    export USE_MKLDNN=0

					    export USE_MKLDNN=0

					@ -191,17 +183,8 @@ if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build*  ]]; then

					  export USE_GLOO_WITH_OPENSSL=ON

					  export USE_GLOO_WITH_OPENSSL=ON

					fi

					fi

					# TODO: Remove after xenial->focal migration

					if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

					if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3* ]]; then

					  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

					  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

					    export BUILD_STATIC_RUNTIME_BENCHMARK=ON

					  fi

					fi

					if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-focal-py3* ]]; then

					  if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then

					    export BUILD_STATIC_RUNTIME_BENCHMARK=ON

					  fi

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

					@ -209,9 +192,14 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

					  get_bazel

					  get_bazel

					  tools/bazel build --config=no-tty //...

					  # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing

					  # the runner

					  BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"

					  BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"

					  tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...

					  # Build torch, the Python module, and tests for CPU-only

					  # Build torch, the Python module, and tests for CPU-only

					  tools/bazel build --config=no-tty --config=cpu-only :torch :_C.so :all_tests

					  tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" --config=cpu-only :torch :_C.so :all_tests

					else

					else

					  # check that setup.py would fail with bad arguments

					  # check that setup.py would fail with bad arguments

					@ -232,7 +220,7 @@ else

					    else

					    else

					      python setup.py bdist_wheel

					      python setup.py bdist_wheel

					    fi

					    fi

					    python -mpip install "$(echo dist/*.whl)[opt-einsum]"

					    pip_install_whl "$(echo dist/*.whl)"

					    # TODO: I'm not sure why, but somehow we lose verbose commands

					    # TODO: I'm not sure why, but somehow we lose verbose commands

					    set -x

					    set -x

					@ -304,6 +292,13 @@ else

					  else

					  else

					    # Test no-Python build

					    # Test no-Python build

					    echo "Building libtorch"

					    echo "Building libtorch"

					    # This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization

					    # is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has

					    # 16 CPUs

					    MAX_JOBS=$(nproc --ignore=4)

					    export MAX_JOBS

					    # NB: Install outside of source directory (at the same level as the root

					    # NB: Install outside of source directory (at the same level as the root

					    # pytorch folder) so that it doesn't get cleaned away prior to docker push.

					    # pytorch folder) so that it doesn't get cleaned away prior to docker push.

					    BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py

					    BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py

									
										4

.jenkins/pytorch/codegen-test.sh → .ci/pytorch/codegen-test.sh
									
												View File
												
					@ -3,8 +3,8 @@

					# This script can also be used to test whether your diff changes any codegen output.

					# This script can also be used to test whether your diff changes any codegen output.

					#

					#

					# Run it before and after your change:

					# Run it before and after your change:

					#   .jenkins/pytorch/codegen-test.sh <baseline_output_dir>

					#   .ci/pytorch/codegen-test.sh <baseline_output_dir>

					#   .jenkins/pytorch/codegen-test.sh <test_output_dir>

					#   .ci/pytorch/codegen-test.sh <test_output_dir>

					#

					#

					# Then run diff to compare the generated files:

					# Then run diff to compare the generated files:

					#   diff -Naur <baseline_output_dir> <test_output_dir>

					#   diff -Naur <baseline_output_dir> <test_output_dir>

									
										58

.ci/pytorch/common-build.sh
									
										Normal file
									
												View File
												
					@ -0,0 +1,58 @@

					#!/bin/bash

					# Required environment variables:

					#   $BUILD_ENVIRONMENT (should be set by your Docker image)

					if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then

					    # Save the absolute path in case later we chdir (as occurs in the gpu perf test)

					    script_dir="$( cd "$(dirname "${BASH_SOURCE[0]}")" || exit ; pwd -P )"

					    if which sccache > /dev/null; then

					        # Save sccache logs to file

					        sccache --stop-server > /dev/null  2>&1 || true

					        rm -f ~/sccache_error.log || true

					        function sccache_epilogue() {

					            echo "::group::Sccache Compilation Log"

					            echo '=================== sccache compilation log ==================='

					            python "$script_dir/print_sccache_log.py" ~/sccache_error.log 2>/dev/null || true

					            echo '=========== If your build fails, please take a look at the log above for possible reasons ==========='

					            sccache --show-stats

					            sccache --stop-server || true

					            echo "::endgroup::"

					        }

					        # Register the function here so that the error log can be printed even when

					        # sccache fails to start, i.e. timeout error

					        trap_add sccache_epilogue EXIT

					        if [[ -n "${SKIP_SCCACHE_INITIALIZATION:-}" ]]; then

					            # sccache --start-server seems to hang forever on self hosted runners for GHA

					            # so let's just go ahead and skip the --start-server altogether since it seems

					            # as though sccache still gets used even when the sscache server isn't started

					            # explicitly

					            echo "Skipping sccache server initialization, setting environment variables"

					            export SCCACHE_IDLE_TIMEOUT=1200

					            export SCCACHE_ERROR_LOG=~/sccache_error.log

					            export RUST_LOG=sccache::server=error

					        elif [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

					            SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=0 sccache --start-server

					        else

					            # increasing SCCACHE_IDLE_TIMEOUT so that extension_backend_test.cpp can build after this PR:

					            # https://github.com/pytorch/pytorch/pull/16645

					            SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=1200 RUST_LOG=sccache::server=error sccache --start-server

					        fi

					        # Report sccache stats for easier debugging

					        sccache --zero-stats

					    fi

					    if which ccache > /dev/null; then

					        # Report ccache stats for easier debugging

					        ccache --zero-stats

					        ccache --show-stats

					        function ccache_epilogue() {

					            ccache --show-stats

					        }

					        trap_add ccache_epilogue EXIT

					    fi

					fi

									
										28

.ci/pytorch/common.sh
									
										Normal file
									
												View File
												
					@ -0,0 +1,28 @@

					#!/bin/bash

					# Common setup for all Jenkins scripts

					# shellcheck source=./common_utils.sh

					source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

					set -ex

					# Required environment variables:

					#   $BUILD_ENVIRONMENT (should be set by your Docker image)

					# Figure out which Python to use for ROCm

					if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

					  # HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors

					  unset HIP_PLATFORM

					  export PYTORCH_TEST_WITH_ROCM=1

					  # temporary to locate some kernel issues on the CI nodes

					  export HSAKMT_DEBUG_LEVEL=4

					  # improve rccl performance for distributed tests

					  export HSA_FORCE_FINE_GRAIN_PCIE=1

					fi

					# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

					# shellcheck disable=SC2034

					BUILD_TEST_LIBTORCH=0

					retry () {

					  "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")

					}

									
										129

.jenkins/pytorch/common_utils.sh → .ci/pytorch/common_utils.sh
									
												View File
												
					@ -9,6 +9,10 @@ log() { printf '%s\n' "$*"; }

					error() { log "ERROR: $*" >&2; }

					error() { log "ERROR: $*" >&2; }

					fatal() { error "$@"; exit 1; }

					fatal() { error "$@"; exit 1; }

					retry () {

					    "$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")

					}

					# compositional trap taken from https://stackoverflow.com/a/7287873/23845

					# compositional trap taken from https://stackoverflow.com/a/7287873/23845

					# appends a command to a trap

					# appends a command to a trap

					#

					#

					@ -49,6 +53,12 @@ function assert_git_not_dirty() {

					    fi

					    fi

					}

					}

					function pip_install_whl() {

					  # This is used to install PyTorch and other build artifacts wheel locally

					  # without using any network connection

					  python3 -mpip install --no-index --no-deps "$@"

					}

					function pip_install() {

					function pip_install() {

					  # retry 3 times

					  # retry 3 times

					  # old versions of pip don't have the "--progress-bar" flag

					  # old versions of pip don't have the "--progress-bar" flag

					@ -72,12 +82,12 @@ function get_exit_code() {

					function get_bazel() {

					function get_bazel() {

					  if [[ $(uname) == "Darwin" ]]; then

					  if [[ $(uname) == "Darwin" ]]; then

					    # download bazel version

					    # download bazel version

					    curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64  -Lo tools/bazel

					    retry curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64  -Lo tools/bazel

					    # verify content

					    # verify content

					    echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c  tools/bazel' | shasum -a 256 -c >/dev/null

					    echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c  tools/bazel' | shasum -a 256 -c >/dev/null

					  else

					  else

					    # download bazel version

					    # download bazel version

					    curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel

					    retry curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel

					    # verify content

					    # verify content

					    echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c  tools/bazel' | shasum -a 256 -c >/dev/null

					    echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c  tools/bazel' | shasum -a 256 -c >/dev/null

					  fi

					  fi

					@ -95,25 +105,21 @@ function get_pinned_commit() {

					  cat .github/ci_commit_pins/"${1}".txt

					  cat .github/ci_commit_pins/"${1}".txt

					}

					}

					function install_torchtext() {

					  local commit

					  commit=$(get_pinned_commit text)

					  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${commit}"

					}

					function install_torchvision() {

					function install_torchvision() {

					  local commit

					  local commit

					  commit=$(get_pinned_commit vision)

					  commit=$(get_pinned_commit vision)

					  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"

					  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"

					}

					}

					function checkout_install_torchvision() {

					  local commit

					  commit=$(get_pinned_commit vision)

					  git clone https://github.com/pytorch/vision

					  pushd vision

					  git checkout "${commit}"

					  time python setup.py install

					  popd

					}

					function clone_pytorch_xla() {

					function clone_pytorch_xla() {

					  if [[ ! -d ./xla ]]; then

					  if [[ ! -d ./xla ]]; then

					    git clone --recursive --quiet https://github.com/pytorch/xla.git

					    git clone --recursive -b r2.0 --quiet https://github.com/pytorch/xla.git

					    pushd xla

					    pushd xla

					    # pin the xla hash so that we don't get broken by changes to xla

					    # pin the xla hash so that we don't get broken by changes to xla

					    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

					    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

					@ -123,24 +129,103 @@ function clone_pytorch_xla() {

					  fi

					  fi

					}

					}

					function install_torchdynamo() {

					function install_filelock() {

					  local commit

					  pip_install filelock

					  commit=$(get_pinned_commit torchdynamo)

					  pip_install --user "git+https://github.com/pytorch/torchdynamo.git@${commit}"

					}

					}

					function checkout_install_torchdynamo() {

					function install_triton() {

					  local commit

					  local commit

					  commit=$(get_pinned_commit torchdynamo)

					  commit=$(get_pinned_commit triton)

					  local short_hash

					  short_hash=$(echo "${commit}"|cut -c -10)

					  local index_url

					  index_url=https://download.pytorch.org/whl/nightly/cpu

					  if [[ "${TEST_CONFIG}" == *rocm* ]]; then

					    echo "skipping triton due to rocm"

					  elif pip install "pytorch-triton==2.0.0+${short_hash}" --index-url "${index_url}"; then

					     echo "Using prebuilt version ${short_hash}"

					  else

					    if [[ "${BUILD_ENVIRONMENT}" == *gcc7* ]]; then

					      # Trition needs gcc-9 to build

					      sudo apt-get install -y g++-9

					      CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"

					    elif [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

					      # Trition needs <filesystem> which surprisingly is not available with clang-9 toolchain

					      sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test

					      sudo apt-get install -y g++-9

					      CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"

					    else

					      pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"

					    fi

					    pip_install --user jinja2

					  fi

					}

					function setup_torchdeploy_deps(){

					  conda install -y -n "py_${ANACONDA_PYTHON_VERSION}" "libpython-static=${ANACONDA_PYTHON_VERSION}"

					  local CC

					  local CXX

					  CC="$(which gcc)"

					  CXX="$(which g++)"

					  export CC

					  export CXX

					  pip install --upgrade pip

					}

					function checkout_install_torchdeploy() {

					  local commit

					  commit=$(get_pinned_commit multipy)

					  setup_torchdeploy_deps

					  pushd ..

					  pushd ..

					  git clone https://github.com/pytorch/torchdynamo

					  git clone --recurse-submodules https://github.com/pytorch/multipy.git

					  pushd torchdynamo

					  pushd multipy

					  git checkout "${commit}"

					  git checkout "${commit}"

					  time python setup.py develop

					  python multipy/runtime/example/generate_examples.py

					  pip install -e . --install-option="--cudatests"

					  popd

					  popd

					  popd

					  popd

					}

					}

					function test_torch_deploy(){

					 pushd ..

					 pushd multipy

					 ./multipy/runtime/build/test_deploy

					 ./multipy/runtime/build/test_deploy_gpu

					 popd

					 popd

					}

					function install_huggingface() {

					  local commit

					  commit=$(get_pinned_commit huggingface)

					  pip_install pandas

					  pip_install scipy

					  pip_install "git+https://github.com/huggingface/transformers.git@${commit}#egg=transformers"

					}

					function install_timm() {

					  local commit

					  commit=$(get_pinned_commit timm)

					  pip_install pandas

					  pip_install scipy

					  pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"

					}

					function checkout_install_torchbench() {

					  git clone https://github.com/pytorch/benchmark torchbench

					  pushd torchbench

					  git checkout no_torchaudio

					  if [ "$1" ]; then

					    python install.py --continue_on_fail models "$@"

					  else

					    # Occasionally the installation may fail on one model but it is ok to continue

					    # to install and test other models

					    python install.py --continue_on_fail

					  fi

					  popd

					}

					function test_functorch() {

					function test_functorch() {

					  python test/run_test.py --functorch --verbose

					  python test/run_test.py --functorch --verbose

					}

					}

0

.jenkins/pytorch/create_test_cert.py → .ci/pytorch/create_test_cert.py

View File

0

.jenkins/pytorch/docker-build-test.sh → .ci/pytorch/docker-build-test.sh

View File

0

.jenkins/pytorch/docs-test.sh → .ci/pytorch/docs-test.sh

View File

0

.jenkins/pytorch/fake_numpy/numpy.py → .ci/pytorch/fake_numpy/numpy.py

View File

0

.jenkins/pytorch/macos-build-test.sh → .ci/pytorch/macos-build-test.sh

View File

									
										6

.jenkins/pytorch/macos-build.sh → .ci/pytorch/macos-build.sh
									
												View File
												
					@ -35,11 +35,13 @@ fi

					cross_compile_arm64() {

					cross_compile_arm64() {

					  # Cross compilation for arm64

					  # Cross compilation for arm64

					  USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

					  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests

					  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448

					  USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

					}

					}

					compile_x86_64() {

					compile_x86_64() {

					  USE_DISTRIBUTED=1 WERROR=1 python setup.py bdist_wheel

					  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel

					}

					}

					build_lite_interpreter() {

					build_lite_interpreter() {

									
										14

.ci/pytorch/macos-common.sh
									
										Executable file
									
												View File
												
					@ -0,0 +1,14 @@

					#!/bin/bash

					# Common prelude for macos-build.sh and macos-test.sh

					# shellcheck source=./common.sh

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					sysctl -a | grep machdep.cpu

					# These are required for both the build job and the test job.

					# In the latter to test cpp extensions.

					export MACOSX_DEPLOYMENT_TARGET=10.9

					export CXX=clang++

					export CC=clang

									
										76

.jenkins/pytorch/macos-test.sh → .ci/pytorch/macos-test.sh
									
												View File
												
					@ -4,40 +4,9 @@

					# shellcheck source=./macos-common.sh

					# shellcheck source=./macos-common.sh

					source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"

					source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"

					conda install -y six

					if [[ -n "$CONDA_ENV" ]]; then

					if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then

					  # Use binaries under conda environment

					  pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba==0.56.0" psutil "scipy==1.9.0"

					  export PATH="$CONDA_ENV/bin":$PATH

					else

					  pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3"

					fi

					# TODO move this to docker

					# Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014

					pip install "unittest-xml-reporting<=3.2.0,>=2.0.0" \

					  pytest \

					  pytest-xdist \

					  pytest-shard \

					  pytest-rerunfailures \

					  "xdoctest==1.0.2" \

					  "pygments==2.12.0" \

					  "opt-einsum>=3.3"

					if [ -z "${CI}" ]; then

					  rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*

					fi

					export CMAKE_PREFIX_PATH=${WORKSPACE_DIR}/miniconda3/

					# Test PyTorch

					if [ -z "${CI}" ]; then

					  export DEVELOPER_DIR=/Applications/Xcode9.app/Contents/Developer

					fi

					# Download torch binaries in the test jobs

					if [ -z "${CI}" ]; then

					  rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*

					  aws s3 cp s3://ossci-macos-build/pytorch/"${IMAGE_COMMIT_TAG}".7z "${IMAGE_COMMIT_TAG}".7z

					  7z x "${IMAGE_COMMIT_TAG}".7z -o"${WORKSPACE_DIR}/miniconda3/lib/python3.6/site-packages"

					fi

					fi

					# Test that OpenMP is enabled for non-arm64 build

					# Test that OpenMP is enabled for non-arm64 build

					@ -113,13 +82,34 @@ test_libtorch() {

					  fi

					  fi

					}

					}

					print_cmake_info() {

					  CMAKE_EXEC=$(which cmake)

					  echo "$CMAKE_EXEC"

					  CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")

					  # Print all libraries under cmake rpath for debugging

					  ls -la "$CONDA_INSTALLATION_DIR/../lib"

					  export CMAKE_EXEC

					  # Explicitly add conda env lib folder to cmake rpath to address the flaky issue

					  # where cmake dependencies couldn't be found. This seems to point to how conda

					  # links $CMAKE_EXEC to its package cache when cloning a new environment

					  install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true

					  # Adding the rpath will invalidate cmake signature, so signing it again here

					  # to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))

					  # with an exit code 137 otherwise

					  codesign -f -s - "${CMAKE_EXEC}" || true

					}

					test_custom_backend() {

					test_custom_backend() {

					  print_cmake_info

					  echo "Testing custom backends"

					  echo "Testing custom backends"

					  pushd test/custom_backend

					  pushd test/custom_backend

					  rm -rf build && mkdir build

					  rm -rf build && mkdir build

					  pushd build

					  pushd build

					  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

					  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

					  CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake ..

					  CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" "${CMAKE_EXEC}" ..

					  make VERBOSE=1

					  make VERBOSE=1

					  popd

					  popd

					@ -134,13 +124,15 @@ test_custom_backend() {

					}

					}

					test_custom_script_ops() {

					test_custom_script_ops() {

					  print_cmake_info

					  echo "Testing custom script operators"

					  echo "Testing custom script operators"

					  pushd test/custom_operator

					  pushd test/custom_operator

					  # Build the custom operator library.

					  # Build the custom operator library.

					  rm -rf build && mkdir build

					  rm -rf build && mkdir build

					  pushd build

					  pushd build

					  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

					  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

					  CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake ..

					  CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" "${CMAKE_EXEC}" ..

					  make VERBOSE=1

					  make VERBOSE=1

					  popd

					  popd

					@ -154,13 +146,15 @@ test_custom_script_ops() {

					}

					}

					test_jit_hooks() {

					test_jit_hooks() {

					  print_cmake_info

					  echo "Testing jit hooks in cpp"

					  echo "Testing jit hooks in cpp"

					  pushd test/jit_hooks

					  pushd test/jit_hooks

					  # Build the custom operator library.

					  # Build the custom operator library.

					  rm -rf build && mkdir build

					  rm -rf build && mkdir build

					  pushd build

					  pushd build

					  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

					  SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

					  CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake ..

					  CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" "${CMAKE_EXEC}" ..

					  make VERBOSE=1

					  make VERBOSE=1

					  popd

					  popd

					@ -172,12 +166,6 @@ test_jit_hooks() {

					  assert_git_not_dirty

					  assert_git_not_dirty

					}

					}

					test_dynamo() {

					  pushd ../torchdynamo

					  pytest test

					  popd

					}

					if [[ "${TEST_CONFIG}" == *functorch* ]]; then

					if [[ "${TEST_CONFIG}" == *functorch* ]]; then

					  test_functorch

					  test_functorch

					elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

					elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

					@ -190,11 +178,9 @@ elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

					    test_custom_backend

					    test_custom_backend

					  fi

					  fi

					else

					else

					  checkout_install_torchdynamo

					  test_python_all

					  test_python_all

					  test_libtorch

					  test_libtorch

					  test_custom_script_ops

					  test_custom_script_ops

					  test_jit_hooks

					  test_jit_hooks

					  test_custom_backend

					  test_custom_backend

					  test_dynamo

					fi

					fi

									
										10

.jenkins/pytorch/multigpu-test.sh → .ci/pytorch/multigpu-test.sh
									
												View File
												
					@ -8,11 +8,6 @@

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					echo "Testing pytorch"

					echo "Testing pytorch"

					if [ -n "${CI}" ]; then

					  # TODO move this to docker

					  # Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014

					  pip_install "unittest-xml-reporting<=3.2.0,>=2.0.0"

					fi

					# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

					# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

					# python tools/download_mnist.py --quiet -d test/cpp/api/mnist

					# python tools/download_mnist.py --quiet -d test/cpp/api/mnist

					@ -28,8 +23,8 @@ time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_a

					# FSDP tests

					# FSDP tests

					for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

					for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done

					# ShardedTensor tests

					# ShardedTensor tests

					time python test/run_test.py --verbose -i distributed/_shard/checkpoint/test_checkpoint

					time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

					time python test/run_test.py --verbose -i distributed/_shard/checkpoint/test_file_system_checkpoint

					time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

					time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

					time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

					time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

					time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

					time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype

					time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype

					@ -50,4 +45,5 @@ time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor

					time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor

					time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor

					# Other tests

					# Other tests

					time python test/run_test.py --verbose -i test_cuda_primary_ctx

					time python test/run_test.py --verbose -i test_cuda_primary_ctx

					time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors

					assert_git_not_dirty

					assert_git_not_dirty

0

.jenkins/pytorch/perf_test/common.sh → .ci/pytorch/perf_test/common.sh

View File

									
										2

.jenkins/pytorch/perf_test/compare_with_baseline.py → .ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
												
					@ -62,7 +62,7 @@ if z_value >= 3:

					    raise Exception('''\n

					    raise Exception('''\n

					z-value >= 3, there is high chance of perf regression.\n

					z-value >= 3, there is high chance of perf regression.\n

					To reproduce this regression, run

					To reproduce this regression, run

					`cd .jenkins/pytorch/perf_test/ && bash {}.sh` on your local machine

					`cd .ci/pytorch/perf_test/ && bash {}.sh` on your local machine

					and compare the runtime before/after your code change.

					and compare the runtime before/after your code change.

					'''.format(test_name))

					'''.format(test_name))

					else:

					else:

0

.jenkins/pytorch/perf_test/get_stats.py → .ci/pytorch/perf_test/get_stats.py

View File

0

.jenkins/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh → .ci/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh

View File

0

.jenkins/pytorch/perf_test/test_cpu_speed_mnist.sh → .ci/pytorch/perf_test/test_cpu_speed_mnist.sh

View File

									
										2

.jenkins/pytorch/perf_test/test_cpu_speed_torch.sh → .ci/pytorch/perf_test/test_cpu_speed_torch.sh
									
												View File
												
					@ -19,7 +19,7 @@ test_cpu_speed_torch () {

					  fi

					  fi

					  if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then

					  if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then

					    echo "To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."

					    echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."

					    exit 1

					    exit 1

					  fi

					  fi

					}

					}

									
										2

.jenkins/pytorch/perf_test/test_cpu_speed_torch_tensor.sh → .ci/pytorch/perf_test/test_cpu_speed_torch_tensor.sh
									
												View File
												
					@ -19,7 +19,7 @@ test_cpu_speed_torch_tensor () {

					  fi

					  fi

					  if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then

					  if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then

					    echo "To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."

					    echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."

					    exit 1

					    exit 1

					  fi

					  fi

					}

					}

0

.jenkins/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh → .ci/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_lstm.sh → .ci/pytorch/perf_test/test_gpu_speed_lstm.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_mlstm.sh → .ci/pytorch/perf_test/test_gpu_speed_mlstm.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_mnist.sh → .ci/pytorch/perf_test/test_gpu_speed_mnist.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_word_language_model.sh → .ci/pytorch/perf_test/test_gpu_speed_word_language_model.sh

View File

0

.jenkins/pytorch/perf_test/update_commit_hash.py → .ci/pytorch/perf_test/update_commit_hash.py

View File

0

.jenkins/pytorch/print_sccache_log.py → .ci/pytorch/print_sccache_log.py

View File

0

.jenkins/pytorch/run_glootls_test.sh → .ci/pytorch/run_glootls_test.sh

View File

									
										4

.jenkins/pytorch/short-perf-test-cpu.sh → .ci/pytorch/short-perf-test-cpu.sh
									
												View File
												
					@ -2,10 +2,10 @@

					SCRIPT_PARENT_DIR=$(dirname "${BASH_SOURCE[0]}")

					SCRIPT_PARENT_DIR=$(dirname "${BASH_SOURCE[0]}")

					# shellcheck source=.jenkins/pytorch/common.sh

					# shellcheck source=.ci/pytorch/common.sh

					source "$SCRIPT_PARENT_DIR/common.sh"

					source "$SCRIPT_PARENT_DIR/common.sh"

					cd .jenkins/pytorch/perf_test

					cd .ci/pytorch/perf_test

					echo "Running CPU perf test for PyTorch..."

					echo "Running CPU perf test for PyTorch..."

									
										2

.jenkins/pytorch/short-perf-test-gpu.sh → .ci/pytorch/short-perf-test-gpu.sh
									
												View File
												
					@ -3,7 +3,7 @@

					# shellcheck source=./common.sh

					# shellcheck source=./common.sh

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

					pushd .jenkins/pytorch/perf_test

					pushd .ci/pytorch/perf_test

					echo "Running GPU perf test for PyTorch..."

					echo "Running GPU perf test for PyTorch..."

									
										371

.jenkins/pytorch/test.sh → .ci/pytorch/test.sh
									
												View File
												
					@ -6,6 +6,9 @@

					set -ex

					set -ex

					echo "Environment variables:"

					env

					TORCH_INSTALL_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/torch

					TORCH_INSTALL_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/torch

					TORCH_BIN_DIR="$TORCH_INSTALL_DIR"/bin

					TORCH_BIN_DIR="$TORCH_INSTALL_DIR"/bin

					TORCH_LIB_DIR="$TORCH_INSTALL_DIR"/lib

					TORCH_LIB_DIR="$TORCH_INSTALL_DIR"/lib

					@ -16,6 +19,7 @@ BUILD_RENAMED_DIR="build_renamed"

					BUILD_BIN_DIR="$BUILD_DIR"/bin

					BUILD_BIN_DIR="$BUILD_DIR"/bin

					export VALGRIND=ON

					export VALGRIND=ON

					export TORCH_INDUCTOR_INSTALL_GXX=ON

					if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then

					  # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,

					  # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,

					  # such that valgrind complains along these lines:

					  # such that valgrind complains along these lines:

					@ -97,10 +101,6 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then

					  export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"

					  export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

					  export BUILD_SPLIT_CUDA=ON

					fi

					if [[ "$TEST_CONFIG" == *crossref* ]]; then

					if [[ "$TEST_CONFIG" == *crossref* ]]; then

					  export PYTORCH_TEST_WITH_CROSSREF=1

					  export PYTORCH_TEST_WITH_CROSSREF=1

					fi

					fi

					@ -109,12 +109,8 @@ if [[ "$TEST_CONFIG" == *dynamo* ]]; then

					  export PYTORCH_TEST_WITH_DYNAMO=1

					  export PYTORCH_TEST_WITH_DYNAMO=1

					fi

					fi

					# TODO: this condition is never true, need to fix this.

					if [[ "$TEST_CONFIG" == *inductor* ]]; then

					if [[ -n "$PR_NUMBER" ]] && [[ -z "$CI_MASTER" || "$CI_MASTER" == "false" ]]; then

					  export PYTORCH_TEST_WITH_INDUCTOR=1

					  # skip expensive checks when on PR and CI_MASTER flag is not set

					  export PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=1

					else

					  export PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

					@ -125,7 +121,7 @@ fi

					if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

					if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

					  # JIT C++ extensions require ninja.

					  # JIT C++ extensions require ninja.

					  pip_install --user ninja

					  pip_install --user "ninja==1.10.2"

					  # ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins

					  # ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins

					  # but this script should be runnable by any user, including root

					  # but this script should be runnable by any user, including root

					  export PATH="$HOME/.local/bin:$PATH"

					  export PATH="$HOME/.local/bin:$PATH"

					@ -135,9 +131,8 @@ fi

					# if you're not careful.  Check this if you made some changes and the

					# if you're not careful.  Check this if you made some changes and the

					# ASAN test is not working

					# ASAN test is not working

					if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

					if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

					    # Suppress vptr violations arising from multiple copies of pybind11

					    export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=1:strict_init_order=true:detect_odr_violation=0

					    export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=1:strict_init_order=true:detect_odr_violation=0

					    export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp

					    export UBSAN_OPTIONS=print_stacktrace=1

					    export PYTORCH_TEST_WITH_ASAN=1

					    export PYTORCH_TEST_WITH_ASAN=1

					    export PYTORCH_TEST_WITH_UBSAN=1

					    export PYTORCH_TEST_WITH_UBSAN=1

					    # TODO: Figure out how to avoid hard-coding these paths

					    # TODO: Figure out how to avoid hard-coding these paths

					@ -180,12 +175,17 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then

					    ulimit -s 81920

					    ulimit -s 81920

					    (cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)")

					    (cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)")

					    echo "The next three invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"

					    echo "The next four invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_asan(3)")

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_asan(3)")

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_ubsan(0)")

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_ubsan(0)")

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_vptr_ubsan()")

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_aten_asan(3)")

					    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_aten_asan(3)")

					fi

					fi

					if [[ "$BUILD_ENVIRONMENT" == *-tsan* ]]; then

					  export PYTORCH_TEST_WITH_TSAN=1

					fi

					if [[ $TEST_CONFIG == 'nogpu_NO_AVX2' ]]; then

					if [[ $TEST_CONFIG == 'nogpu_NO_AVX2' ]]; then

					  export ATEN_CPU_CAPABILITY=default

					  export ATEN_CPU_CAPABILITY=default

					elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then

					elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then

					@ -219,6 +219,7 @@ test_dynamo_shard() {

					    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

					    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

					    exit 1

					    exit 1

					  fi

					  fi

					  python tools/dynamo/verify_dynamo.py

					  # Temporarily disable test_fx for dynamo pending the investigation on TTS

					  # Temporarily disable test_fx for dynamo pending the investigation on TTS

					  # regression in https://github.com/pytorch/torchdynamo/issues/784

					  # regression in https://github.com/pytorch/torchdynamo/issues/784

					  time python test/run_test.py \

					  time python test/run_test.py \

					@ -239,12 +240,172 @@ test_dynamo_shard() {

					      test_python_dispatch \

					      test_python_dispatch \

					      test_fx \

					      test_fx \

					      test_package \

					      test_package \

					      test_vmap \

					      test_legacy_vmap \

					    --shard "$1" "$NUM_TEST_SHARDS" \

					    --shard "$1" "$NUM_TEST_SHARDS" \

					    --verbose

					    --verbose

					  assert_git_not_dirty

					  assert_git_not_dirty

					}

					}

					test_inductor_distributed() {

					  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

					  # with if required # gpus aren't available

					  PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include distributed/test_dynamo_distributed --verbose

					  assert_git_not_dirty

					}

					test_inductor() {

					  python tools/dynamo/verify_dynamo.py

					  python test/run_test.py --include test_modules test_ops test_ops_gradients test_torch --verbose

					  PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo --verbose

					}

					test_single_dynamo_benchmark() {

					  # Usage: test_single_dynamo_benchmark inductor_inference huggingface 0 --args-for-script

					  # Use test-reports directory under test folder will allow the CI to automatically pick up

					  # the test reports and upload them to S3. Need to use full path here otherwise the script

					  # will bark about file not found later on

					  TEST_REPORTS_DIR=$(pwd)/test/test-reports

					  mkdir -p "$TEST_REPORTS_DIR"

					  local name="$1"

					  shift

					  local suite="$1"

					  shift

					  # shard id is mandatory, even if it is not passed

					  local shard_id="$1"

					  shift

					  local partition_flags=()

					  if [[ -n "$NUM_TEST_SHARDS" && -n "$shard_id" ]]; then

					    partition_flags=( --total-partitions 2 --partition-id "$shard_id" )

					  fi

					  # Feel free to remove --device cuda if you ever decide to need to

					  # test CPU as well in CI

					  python "benchmarks/dynamo/$suite.py" \

					    --ci --accuracy --timing --explain --device cuda \

					    "$@" "${partition_flags[@]}" \

					    --output "$TEST_REPORTS_DIR/${name}_${suite}.csv"

					  python benchmarks/dynamo/check_csv.py \

					    -f "$TEST_REPORTS_DIR/${name}_${suite}.csv"

					}

					test_aot_eager_benchmark() {

					  # Usage: test_dynamo_benchmark huggingface 0

					  local exit_status=0

					  # Check inference with --float32

					  test_single_dynamo_benchmark "aot_eager_inference" "$@" --backend aot_eager || exit_status=$?

					  # Check training with --amp

					  test_single_dynamo_benchmark "aot_eager_training" "$@" --backend aot_eager --training --amp || exit_status=$?

					  if [[ $exit_status -ne 0 ]]; then

					    echo "Some benchmarks failed; scroll up for details"

					  fi

					  return $exit_status

					}

					test_inductor_benchmark() {

					  # Usage: test_dynamo_benchmark huggingface 0

					  # Check inference with --float32

					  test_single_dynamo_benchmark "inductor_inference" "$@" --inductor

					  # Check training with --amp

					  test_single_dynamo_benchmark "inductor_training" "$@" --inductor --training --amp

					  # Check inference with --dynamic-shapes

					  test_single_dynamo_benchmark "dynamic_inductor-inference" "$@" --inductor --dynamic-shapes

					}

					test_inductor_benchmark_perf() {

					  # Use test-reports directory under test folder will allow the CI to automatically pick up

					  # the test reports and upload them to S3. Need to use full path here otherwise the script

					  # will bark about file not found later on

					  TEST_REPORTS_DIR=$(pwd)/test/test-reports

					  PARTITION_FLAGS=""

					  if [[ -n "$NUM_TEST_SHARDS" && -n "$2" ]]; then

					    PARTITION_FLAGS="--total-partitions 2 --partition-id $2"

					  fi

					  mkdir -p "$TEST_REPORTS_DIR"

					  # Check training with --amp

					  # Not checking accuracy for perf test for now

					  # shellcheck disable=SC2086

					  if [[ "$1" == *smoketest* ]]; then

					    python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

					      --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

					      --output "$TEST_REPORTS_DIR"/inductor_training_$1.csv

					    # the reference speedup value is hardcoded in check_hf_bert_perf_csv.py

					    # this value needs to be actively maintained to make this check useful

					    python benchmarks/dynamo/check_hf_bert_perf_csv.py -f "$TEST_REPORTS_DIR"/inductor_training_$1.csv

					    # Check memory compression ratio for a few models

					    for test in hf_Albert timm_efficientdet timm_vision_transformer; do

					      python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \

					        --disable-cudagraphs --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" \

					        --only $test --output "$TEST_REPORTS_DIR"/inductor_training_$1_$test.csv

					      cat "$TEST_REPORTS_DIR"/inductor_training_$1_$test.csv

					      python benchmarks/dynamo/check_memory_compression_ratio.py --actual \

					        "$TEST_REPORTS_DIR"/inductor_training_$1_$test.csv \

					        --expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv

					    done

					  else

					    python benchmarks/dynamo/$1.py --ci --training --performance --disable-cudagraphs\

					      --device cuda --inductor --amp $PARTITION_FLAGS  --output "$TEST_REPORTS_DIR"/inductor_training_$1.csv

					  fi

					}

					# No sharding for the periodic job, we don't care if latency is bad

					test_aot_eager_all() {

					  local exit_status=0

					  PYTHONPATH=$(pwd)/torchbench test_aot_eager_benchmark torchbench "" "$@" || exit_status=$?

					  test_aot_eager_benchmark huggingface "" "$@" || exit_status=$?

					  test_aot_eager_benchmark timm_models "" "$@" || exit_status=$?

					  if [[ $exit_status -ne 0 ]]; then

					    echo "Some benchmarks failed; scroll up for details"

					  fi

					  return $exit_status

					}

					test_inductor_huggingface() {

					  test_inductor_benchmark huggingface ""

					}

					test_inductor_huggingface_perf() {

					  test_inductor_benchmark_perf huggingface

					}

					test_inductor_timm_shard() {

					  if [[ -z "$NUM_TEST_SHARDS" ]]; then

					    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

					    exit 1

					  fi

					  test_inductor_benchmark timm_models "$1"

					}

					test_inductor_timm_perf_shard() {

					  if [[ -z "$NUM_TEST_SHARDS" ]]; then

					    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

					    exit 1

					  fi

					  test_inductor_benchmark_perf timm_models "$1"

					}

					test_inductor_torchbench() {

					  PYTHONPATH=$(pwd)/torchbench test_inductor_benchmark torchbench ""

					}

					test_inductor_torchbench_perf() {

					  PYTHONPATH=$(pwd)/torchbench test_inductor_benchmark_perf torchbench

					}

					test_inductor_torchbench_smoketest_perf(){

					  PYTHONPATH=$(pwd)/torchbench test_inductor_benchmark_perf smoketest

					}

					test_python_gloo_with_tls() {

					test_python_gloo_with_tls() {

					  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

					  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

					  assert_git_not_dirty

					  assert_git_not_dirty

					@ -323,20 +484,25 @@ test_libtorch() {

					    ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libnvfuser* "$TORCH_BIN_DIR"

					    # Start background download

					    # Start background download

					    python tools/download_mnist.py --quiet -d test/cpp/api/mnist &

					    python tools/download_mnist.py --quiet -d test/cpp/api/mnist &

					    # Make test_reports directory

					    # Make test_reports directory

					    # NB: the ending test_libtorch must match the current function name for the current

					    # NB: the ending test_libtorch must match the current function name for the current

					    # test reporting process (in print_test_stats.py) to function as expected.

					    # test reporting process to function as expected.

					    TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_libtorch

					    TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_libtorch

					    mkdir -p $TEST_REPORTS_DIR

					    mkdir -p $TEST_REPORTS_DIR

					    # Run JIT cpp tests

					    if [[ "$BUILD_ENVIRONMENT" != *-tsan* ]]; then

					    python test/cpp/jit/tests_setup.py setup

					        # Run JIT cpp tests

					        python test/cpp/jit/tests_setup.py setup

					    fi

					    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

					    if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

					      "$TORCH_BIN_DIR"/test_jit  --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml

					      "$TORCH_BIN_DIR"/test_jit  --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml

					      "$TORCH_BIN_DIR"/nvfuser_tests --gtest_output=xml:$TEST_REPORTS_DIR/nvfuser_tests.xml

					    else

					    else

					      "$TORCH_BIN_DIR"/test_jit  --gtest_filter='-*CUDA' --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml

					      "$TORCH_BIN_DIR"/test_jit  --gtest_filter='-*CUDA' --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml

					    fi

					    fi

					@ -348,19 +514,19 @@ test_libtorch() {

					      "$TORCH_BIN_DIR"/test_lazy  --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml

					      "$TORCH_BIN_DIR"/test_lazy  --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml

					    fi

					    fi

					    python test/cpp/jit/tests_setup.py shutdown

					    if [[ "$BUILD_ENVIRONMENT" != *-tsan* ]]; then

					        python test/cpp/jit/tests_setup.py shutdown

					    fi

					    # Wait for background download to finish

					    # Wait for background download to finish

					    wait

					    wait

					    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy.

					    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy.

					    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml

					    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml

					    "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml

					    "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml

					    # TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this.

					    if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then

					    if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3* ]]; then

					      # TODO: Consider to run static_runtime_test from $TORCH_BIN_DIR (may need modify build script)

					      if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then

					      "$BUILD_BIN_DIR"/static_runtime_test --gtest_output=xml:$TEST_REPORTS_DIR/static_runtime_test.xml

					        # TODO: Consider to run static_runtime_test from $TORCH_BIN_DIR (may need modify build script)

					        "$BUILD_BIN_DIR"/static_runtime_test --gtest_output=xml:$TEST_REPORTS_DIR/static_runtime_test.xml

					      fi

					    fi

					    fi

					    assert_git_not_dirty

					    assert_git_not_dirty

					  fi

					  fi

					@ -373,7 +539,7 @@ test_aot_compilation() {

					  # Make test_reports directory

					  # Make test_reports directory

					  # NB: the ending test_libtorch must match the current function name for the current

					  # NB: the ending test_libtorch must match the current function name for the current

					  # test reporting process (in print_test_stats.py) to function as expected.

					  # test reporting process to function as expected.

					  TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_aot_compilation

					  TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_aot_compilation

					  mkdir -p $TEST_REPORTS_DIR

					  mkdir -p $TEST_REPORTS_DIR

					  if [ -f "$TORCH_BIN_DIR"/test_mobile_nnc ]; then "$TORCH_BIN_DIR"/test_mobile_nnc --gtest_output=xml:$TEST_REPORTS_DIR/test_mobile_nnc.xml; fi

					  if [ -f "$TORCH_BIN_DIR"/test_mobile_nnc ]; then "$TORCH_BIN_DIR"/test_mobile_nnc --gtest_output=xml:$TEST_REPORTS_DIR/test_mobile_nnc.xml; fi

					@ -387,7 +553,7 @@ test_vulkan() {

					    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_TEST_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_TEST_DIR"

					    export VK_ICD_FILENAMES=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/vk_swiftshader_icd.json

					    export VK_ICD_FILENAMES=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/vk_swiftshader_icd.json

					    # NB: the ending test_vulkan must match the current function name for the current

					    # NB: the ending test_vulkan must match the current function name for the current

					    # test reporting process (in print_test_stats.py) to function as expected.

					    # test reporting process to function as expected.

					    TEST_REPORTS_DIR=test/test-reports/cpp-vulkan/test_vulkan

					    TEST_REPORTS_DIR=test/test-reports/cpp-vulkan/test_vulkan

					    mkdir -p $TEST_REPORTS_DIR

					    mkdir -p $TEST_REPORTS_DIR

					    LD_LIBRARY_PATH=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/ "$TORCH_TEST_DIR"/vulkan_api_test --gtest_output=xml:$TEST_REPORTS_DIR/vulkan_test.xml

					    LD_LIBRARY_PATH=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/ "$TORCH_TEST_DIR"/vulkan_api_test --gtest_output=xml:$TEST_REPORTS_DIR/vulkan_test.xml

					@ -404,7 +570,7 @@ test_distributed() {

					    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

					    # NB: the ending test_distributed must match the current function name for the current

					    # NB: the ending test_distributed must match the current function name for the current

					    # test reporting process (in print_test_stats.py) to function as expected.

					    # test reporting process to function as expected.

					    TEST_REPORTS_DIR=test/test-reports/cpp-distributed/test_distributed

					    TEST_REPORTS_DIR=test/test-reports/cpp-distributed/test_distributed

					    mkdir -p $TEST_REPORTS_DIR

					    mkdir -p $TEST_REPORTS_DIR

					    "$TORCH_BIN_DIR"/FileStoreTest --gtest_output=xml:$TEST_REPORTS_DIR/FileStoreTest.xml

					    "$TORCH_BIN_DIR"/FileStoreTest --gtest_output=xml:$TEST_REPORTS_DIR/FileStoreTest.xml

					@ -428,7 +594,7 @@ test_rpc() {

					  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

					  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

					    echo "Testing RPC C++ tests"

					    echo "Testing RPC C++ tests"

					    # NB: the ending test_rpc must match the current function name for the current

					    # NB: the ending test_rpc must match the current function name for the current

					    # test reporting process (in print_test_stats.py) to function as expected.

					    # test reporting process to function as expected.

					    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

					    ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

					@ -557,6 +723,7 @@ test_forward_backward_compatibility() {

					  # build torch at the base commit to generate a base function schema for comparison

					  # build torch at the base commit to generate a base function schema for comparison

					  git reset --hard "${SHA_TO_COMPARE}"

					  git reset --hard "${SHA_TO_COMPARE}"

					  git submodule sync && git submodule update --init --recursive

					  echo "::group::Installing Torch From Base Commit"

					  echo "::group::Installing Torch From Base Commit"

					  pip install -r requirements.txt

					  pip install -r requirements.txt

					  # shellcheck source=./common-build.sh

					  # shellcheck source=./common-build.sh

					@ -570,6 +737,7 @@ test_forward_backward_compatibility() {

					  python dump_all_function_schemas.py --filename nightly_schemas.txt

					  python dump_all_function_schemas.py --filename nightly_schemas.txt

					  git reset --hard "${SHA1}"

					  git reset --hard "${SHA1}"

					  git submodule sync && git submodule update --init --recursive

					  # FC: verify new model can be load with old code.

					  # FC: verify new model can be load with old code.

					  if ! python ../load_torchscript_model.py /tmp/model_new.pt; then

					  if ! python ../load_torchscript_model.py /tmp/model_new.pt; then

					      echo "FC check failed: new model cannot be load in old code"

					      echo "FC check failed: new model cannot be load in old code"

					@ -649,39 +817,26 @@ test_vec256() {

					  fi

					  fi

					}

					}

					test_dynamo() {

					test_docs_test() {

					  pushd ../torchdynamo

					  .ci/pytorch/docs-test.sh

					  pytest test

					  popd

					}

					}

					test_torch_deploy() {

					test_executorch() {

					  python torch/csrc/deploy/example/generate_examples.py

					  # Test torchgen generated code for Executorch.

					  ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

					  echo "Testing Executorch op registration"

					  ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"

					  "$BUILD_BIN_DIR"/test_edge_op_registration

					  ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

					  "$TORCH_BIN_DIR"/test_deploy

					  "$TORCH_BIN_DIR"/test_deploy_gpu

					  assert_git_not_dirty

					  assert_git_not_dirty

					}

					}

					test_docs_test() {

					if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* || "${BUILD_ENVIRONMENT}" == *-tsan* ]]; then

					  .jenkins/pytorch/docs-test.sh

					}

					if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

					  (cd test && python -c "import torch; print(torch.__config__.show())")

					  (cd test && python -c "import torch; print(torch.__config__.show())")

					  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

					  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

					fi

					fi

					if [[ "${TEST_CONFIG}" == *deploy* ]]; then

					if [[ "${TEST_CONFIG}" == *backward* ]]; then

					  install_torchdynamo

					  test_torch_deploy

					elif [[ "${TEST_CONFIG}" == *backward* ]]; then

					  test_forward_backward_compatibility

					  test_forward_backward_compatibility

					  # Do NOT add tests after bc check tests, see its comment.

					  # Do NOT add tests after bc check tests, see its comment.

					elif [[ "${TEST_CONFIG}" == *xla* ]]; then

					elif [[ "${TEST_CONFIG}" == *xla* ]]; then

					  install_torchvision

					  install_torchvision

					  install_torchdynamo

					  build_xla

					  build_xla

					  test_xla

					  test_xla

					elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then

					elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then

					@ -690,32 +845,126 @@ elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then

					  # TODO: run some C++ tests

					  # TODO: run some C++ tests

					  echo "no-op at the moment"

					  echo "no-op at the moment"

					elif [[ "$TEST_CONFIG" == distributed ]]; then

					elif [[ "$TEST_CONFIG" == distributed ]]; then

					  install_torchdynamo

					  install_filelock

					  install_triton

					  test_distributed

					  test_distributed

					  # Only run RPC C++ tests on the first shard

					  # Only run RPC C++ tests on the first shard

					  if [[ "${SHARD_NUMBER}" == 1 ]]; then

					  if [[ "${SHARD_NUMBER}" == 1 ]]; then

					    test_rpc

					    test_rpc

					  fi

					  fi

					elif [[ "$TEST_CONFIG" == deploy ]]; then

					  checkout_install_torchdeploy

					  test_torch_deploy

					elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

					  install_filelock

					  install_triton

					  install_huggingface

					  test_inductor_distributed

					elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

					elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

					  test_without_numpy

					  test_without_numpy

					  install_torchvision

					  install_torchvision

					  install_torchdynamo

					  install_triton

					  test_dynamo_shard 1

					  test_dynamo_shard 1

					  test_aten

					  test_aten

					elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

					elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

					  install_torchvision

					  install_torchvision

					  checkout_install_torchdynamo

					  install_filelock

					  install_triton

					  test_dynamo_shard 2

					  test_dynamo_shard 2

					  test_dynamo

					elif [[ "${TEST_CONFIG}" == *aot_eager_all* ]]; then

					  install_torchtext

					  install_torchvision

					  install_filelock

					  checkout_install_torchbench

					  install_huggingface

					  install_timm

					  if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

					    # NB: This code path is currently dead because dynamic shapes takes

					    # too long to run unsharded

					    test_aot_eager_all --dynamic-shapes

					  else

					    test_aot_eager_all

					  fi

					elif [[ "${TEST_CONFIG}" == *aot_eager_huggingface* ]]; then

					  install_torchvision

					  install_filelock

					  install_huggingface

					  if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

					    test_aot_eager_benchmark huggingface "" --dynamic-shapes

					  else

					    test_aot_eager_benchmark huggingface ""

					  fi

					elif [[ "${TEST_CONFIG}" == *aot_eager_timm* && $NUM_TEST_SHARDS -gt 1 ]]; then

					  install_torchvision

					  install_filelock

					  install_timm

					  id=$((SHARD_NUMBER-1))

					  if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

					    test_aot_eager_benchmark timm_models "$id" --dynamic-shapes

					  else

					    test_aot_eager_benchmark timm_models "$id"

					  fi

					elif [[ "${TEST_CONFIG}" == *aot_eager_torchbench* ]]; then

					  install_torchtext

					  install_torchvision

					  install_filelock

					  checkout_install_torchbench

					  if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

					    PYTHONPATH=$(pwd)/torchbench test_aot_eager_benchmark torchbench "" --dynamic-shapes

					  else

					    PYTHONPATH=$(pwd)/torchbench test_aot_eager_benchmark torchbench ""

					  fi

					elif [[ "${TEST_CONFIG}" == *inductor_huggingface* ]]; then

					  install_torchvision

					  install_filelock

					  install_triton

					  install_huggingface

					  if [[ "${TEST_CONFIG}" == *inductor_huggingface_perf* ]]; then

					    test_inductor_huggingface_perf

					  else

					    test_inductor_huggingface

					  fi

					elif [[ "${TEST_CONFIG}" == *inductor_timm* && $NUM_TEST_SHARDS -gt 1 ]]; then

					  install_torchvision

					  install_filelock

					  install_triton

					  install_timm

					  id=$((SHARD_NUMBER-1))

					  if [[ "${TEST_CONFIG}" == *inductor_timm_perf* && $NUM_TEST_SHARDS -gt 1 ]]; then

					    test_inductor_timm_perf_shard $id

					  else

					    test_inductor_timm_shard $id

					  fi

					elif [[ "${TEST_CONFIG}" == *inductor_torchbench* ]]; then

					  install_torchtext

					  install_torchvision

					  install_filelock

					  install_triton

					  if [[ "${TEST_CONFIG}" == *inductor_torchbench_perf* ]]; then

					    checkout_install_torchbench

					    test_inductor_torchbench_perf

					  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

					    checkout_install_torchbench hf_Bert hf_Albert timm_efficientdet timm_vision_transformer

					    test_inductor_torchbench_smoketest_perf

					  else

					    checkout_install_torchbench

					    test_inductor_torchbench

					  fi

					elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then

					  install_torchvision

					  install_filelock

					  install_triton

					  test_inductor

					  test_inductor_distributed

					elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

					elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

					  test_without_numpy

					  test_without_numpy

					  install_torchvision

					  install_torchvision

					  install_torchdynamo

					  install_triton

					  test_python_shard 1

					  test_python_shard 1

					  test_aten

					  test_aten

					elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

					elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

					  install_torchvision

					  install_torchvision

					  checkout_install_torchdynamo

					  install_triton

					  test_python_shard 2

					  test_python_shard 2

					  test_libtorch

					  test_libtorch

					  test_aot_compilation

					  test_aot_compilation

					@ -724,7 +973,8 @@ elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then

					  test_torch_function_benchmark

					  test_torch_function_benchmark

					elif [[ "${SHARD_NUMBER}" -gt 2 ]]; then

					elif [[ "${SHARD_NUMBER}" -gt 2 ]]; then

					  # Handle arbitrary number of shards

					  # Handle arbitrary number of shards

					  install_torchdynamo

					  install_torchvision

					  install_triton

					  test_python_shard "$SHARD_NUMBER"

					  test_python_shard "$SHARD_NUMBER"

					elif [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then

					elif [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then

					  test_vulkan

					  test_vulkan

					@ -732,13 +982,17 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

					  test_bazel

					  test_bazel

					elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then

					elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then

					  test_libtorch

					  test_libtorch

					elif [[ "${BUILD_ENVIRONMENT}" == *-tsan* ]]; then

					  # TODO: TSAN check is currently failing with 415 data race warnings. This will

					  # be addressed later, the first PR can be merged first to setup the CI jobs

					  test_libtorch || true

					elif [[ "${TEST_CONFIG}" = docs_test ]]; then

					elif [[ "${TEST_CONFIG}" = docs_test ]]; then

					  test_docs_test

					  test_docs_test

					elif [[ "${TEST_CONFIG}" == *functorch* ]]; then

					elif [[ "${TEST_CONFIG}" == *functorch* ]]; then

					  test_functorch

					  test_functorch

					else

					else

					  install_torchvision

					  install_torchvision

					  install_torchdynamo

					  install_triton

					  install_monkeytype

					  install_monkeytype

					  test_python

					  test_python

					  test_aten

					  test_aten

					@ -749,4 +1003,5 @@ else

					  test_custom_backend

					  test_custom_backend

					  test_torch_function_benchmark

					  test_torch_function_benchmark

					  test_benchmarks

					  test_benchmarks

					  test_executorch

					fi

					fi

									
										4

.jenkins/pytorch/win-build.sh → .ci/pytorch/win-build.sh
									
												View File
												
					@ -41,12 +41,12 @@ fi

					export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

					export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

					set +ex

					set +ex

					grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h torch/

					grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/

					PYLONG_API_CHECK=$?

					PYLONG_API_CHECK=$?

					if [[ $PYLONG_API_CHECK == 0 ]]; then

					if [[ $PYLONG_API_CHECK == 0 ]]; then

					  echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

					  echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

					  echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."

					  echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."

					  echo "Please include \"torch/csrc/python_numbers.h\" and use the correspoding APIs instead."

					  echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."

					  echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"

					  echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"

					  echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"

					  echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"

					  echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

					  echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

									
										29

.jenkins/pytorch/win-test-helpers/build_pytorch.bat → .ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
					@ -35,11 +35,6 @@ call %INSTALLER_DIR%\activate_miniconda3.bat

					if errorlevel 1 exit /b

					if errorlevel 1 exit /b

					if not errorlevel 0 exit /b

					if not errorlevel 0 exit /b

					:: Install ninja and other deps

					if "%REBUILD%"=="" ( pip install -q "ninja==1.10.0.post1" dataclasses typing_extensions "expecttest==0.1.3" )

					if errorlevel 1 exit /b

					if not errorlevel 0 exit /b

					:: Override VS env here

					:: Override VS env here

					pushd .

					pushd .

					if "%VC_VERSION%" == "" (

					if "%VC_VERSION%" == "" (

					@ -85,10 +80,8 @@ set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

					set DISTUTILS_USE_SDK=1

					set DISTUTILS_USE_SDK=1

					set PATH=%TMP_DIR_WIN%\bin;%PATH%

					set PATH=%TMP_DIR_WIN%\bin;%PATH%

					:: Target only our CI GPU machine's CUDA arch to speed up the build, we can overwrite with env var

					:: The latest Windows CUDA test is running on AWS G5 runner with A10G GPU

					:: default on circleci is Tesla T4 which has capability of 7.5, ref: https://developer.nvidia.com/cuda-gpus

					if "%TORCH_CUDA_ARCH_LIST%" == "" set TORCH_CUDA_ARCH_LIST=8.6

					:: jenkins has M40, which is 5.2

					if "%TORCH_CUDA_ARCH_LIST%" == "" set TORCH_CUDA_ARCH_LIST=5.2

					:: The default sccache idle timeout is 600, which is too short and leads to intermittent build errors.

					:: The default sccache idle timeout is 600, which is too short and leads to intermittent build errors.

					set SCCACHE_IDLE_TIMEOUT=0

					set SCCACHE_IDLE_TIMEOUT=0

					@ -135,16 +128,22 @@ if "%REBUILD%" == "" (

					    if not errorlevel 0 exit /b

					    if not errorlevel 0 exit /b

					  )

					  )

					)

					)

					:: tests if BUILD_ENVIRONMENT contains cuda11 as a substring

					if not x%BUILD_ENVIRONMENT:cuda11=%==x%BUILD_ENVIRONMENT% (

					   set BUILD_SPLIT_CUDA=ON

					)

					python setup.py bdist_wheel && sccache --show-stats && python -c "import os, glob; os.system('python -mpip install ' + glob.glob('dist/*.whl')[0] + '[opt-einsum]')" (

					python setup.py bdist_wheel

					if errorlevel 1 exit /b

					if not errorlevel 0 exit /b

					sccache --show-stats

					python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"

					(

					  if "%BUILD_ENVIRONMENT%"=="" (

					  if "%BUILD_ENVIRONMENT%"=="" (

					    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.

					    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.

					  ) else (

					  ) else (

					    7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\caffe2 %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"

					    if "%USE_CUDA%"=="1" (

					        7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\nvfuser && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"

					    ) else (

					        7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"

					    )

					    if errorlevel 1 exit /b

					    if errorlevel 1 exit /b

					    if not errorlevel 0 exit /b

					    if not errorlevel 0 exit /b

0

.jenkins/pytorch/win-test-helpers/choose_runtime_cuda_version.bat → .ci/pytorch/win-test-helpers/choose_runtime_cuda_version.bat

View File

									
										4

.jenkins/pytorch/win-test-helpers/install_test_functorch.bat → .ci/pytorch/win-test-helpers/install_test_functorch.bat
									
												View File
												
					@ -6,10 +6,6 @@ if not errorlevel 0 (

					  exit /b

					  exit /b

					)

					)

					echo "Installing test dependencies"

					pip install networkx

					if errorlevel 1 exit /b

					echo "Test functorch"

					echo "Test functorch"

					pushd test

					pushd test

					python run_test.py --functorch --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

					python run_test.py --functorch --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

									
										12

.jenkins/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat → .ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
									
												View File
												
					@ -13,7 +13,7 @@ if not exist %CONDA_PARENT_DIR%\Miniconda3 (

					)

					)

					if "%INSTALL_FRESH_CONDA%"=="1" (

					if "%INSTALL_FRESH_CONDA%"=="1" (

					  curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe

					  curl --retry 3 --retry-all-errors -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe

					  if errorlevel 1 exit /b

					  if errorlevel 1 exit /b

					  if not errorlevel 0 exit /b

					  if not errorlevel 0 exit /b

					@ -24,13 +24,3 @@ if "%INSTALL_FRESH_CONDA%"=="1" (

					:: Activate conda so that we can use its commands, i.e. conda, python, pip

					:: Activate conda so that we can use its commands, i.e. conda, python, pip

					call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

					call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3

					if "%INSTALL_FRESH_CONDA%"=="1" (

					  call conda install -y -q numpy"<1.23" cffi pyyaml boto3 libuv

					  if errorlevel 1 exit /b

					  if not errorlevel 0 exit /b

					  call conda install -y -q -c conda-forge cmake=3.22.3

					  if errorlevel 1 exit /b

					  if not errorlevel 0 exit /b

					)

									
										2

.jenkins/pytorch/win-test-helpers/installation-helpers/install_magma.bat → .ci/pytorch/win-test-helpers/installation-helpers/install_magma.bat
									
												View File
												
					@ -24,7 +24,7 @@ if "%CUDA_SUFFIX%" == "" (

					if "%REBUILD%"=="" (

					if "%REBUILD%"=="" (

					  if "%BUILD_ENVIRONMENT%"=="" (

					  if "%BUILD_ENVIRONMENT%"=="" (

					    curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z

					    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z

					  ) else (

					  ) else (

					    aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet

					    aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet

					  )

					  )

									
										2

.jenkins/pytorch/win-test-helpers/installation-helpers/install_mkl.bat → .ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat
									
												View File
												
					@ -1,6 +1,6 @@

					if "%REBUILD%"=="" (

					if "%REBUILD%"=="" (

					  if "%BUILD_ENVIRONMENT%"=="" (

					  if "%BUILD_ENVIRONMENT%"=="" (

					    curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z

					    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z

					  ) else (

					  ) else (

					    aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet

					    aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet

					  )

					  )

									
										4

.jenkins/pytorch/win-test-helpers/installation-helpers/install_sccache.bat → .ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat
									
												View File
												
					@ -7,8 +7,8 @@ if "%REBUILD%"=="" (

					    del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul

					    del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul

					    del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul

					    del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul

					    if "%BUILD_ENVIRONMENT%"=="" (

					    if "%BUILD_ENVIRONMENT%"=="" (

					      curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe

					      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe

					      curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe

					      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe

					    ) else (

					    ) else (

					      aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe

					      aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe

					      aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe

					      aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe

0

.jenkins/pytorch/win-test-helpers/run_python_nn_smoketests.py → .ci/pytorch/win-test-helpers/run_python_nn_smoketests.py

View File

									
										15

.jenkins/pytorch/win-test-helpers/setup_pytorch_env.bat → .ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
					@ -14,13 +14,6 @@ call %INSTALLER_DIR%\activate_miniconda3.bat

					if errorlevel 1 exit /b

					if errorlevel 1 exit /b

					if not errorlevel 0 exit /b

					if not errorlevel 0 exit /b

					:: extra conda dependencies for testing purposes

					if NOT "%BUILD_ENVIRONMENT%"=="" (

					    call conda install -y -q mkl protobuf numba scipy=1.6.2 typing_extensions dataclasses

					    if errorlevel 1 exit /b

					    if not errorlevel 0 exit /b

					)

					pushd .

					pushd .

					if "%VC_VERSION%" == "" (

					if "%VC_VERSION%" == "" (

					    call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64

					    call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64

					@ -32,14 +25,6 @@ if not errorlevel 0 exit /b

					@echo on

					@echo on

					popd

					popd

					:: The version is fixed to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136

					=======

					:: Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014

					pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-shard pytest-rerunfailures "xdoctest==1.0.2" "pygments==2.12.0" "opt-einsum>=3.3"

					if errorlevel 1 exit /b

					if not errorlevel 0 exit /b

					set DISTUTILS_USE_SDK=1

					set DISTUTILS_USE_SDK=1

					if not "%USE_CUDA%"=="1" goto cuda_build_end

					if not "%USE_CUDA%"=="1" goto cuda_build_end

									
										2

.jenkins/pytorch/win-test-helpers/test_custom_backend.bat → .ci/pytorch/win-test-helpers/test_custom_backend.bat
									
												View File
												
					@ -1,6 +1,6 @@

					call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

					call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

					git submodule update --init --recursive --jobs 0 third_party/pybind11

					git submodule update --init --recursive third_party/pybind11

					cd test\custom_backend

					cd test\custom_backend

					:: Build the custom backend library.

					:: Build the custom backend library.

									
										2

.jenkins/pytorch/win-test-helpers/test_custom_script_ops.bat → .ci/pytorch/win-test-helpers/test_custom_script_ops.bat
									
												View File
												
					@ -1,6 +1,6 @@

					call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

					call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

					git submodule update --init --recursive --jobs 0 third_party/pybind11

					git submodule update --init --recursive third_party/pybind11

					cd test\custom_operator

					cd test\custom_operator

					:: Build the custom operator library.

					:: Build the custom operator library.

Compare commits

4488 Commits v1.13.0-rc ... v2.0.0-rc1

2 .bazelrc Unescape Escape View File

0 .jenkins/caffe2/README.md → .ci/caffe2/README.md Unescape Escape View File

2 .jenkins/caffe2/common.sh → .ci/caffe2/common.sh Unescape Escape View File

17 .jenkins/caffe2/test.sh → .ci/caffe2/test.sh Unescape Escape View File

0 .circleci/docker/README.md → .ci/docker/README.md Unescape Escape View File

0 .circleci/docker/android/AndroidManifest.xml → .ci/docker/android/AndroidManifest.xml Unescape Escape View File

0 .circleci/docker/android/build.gradle → .ci/docker/android/build.gradle Unescape Escape View File

201 .circleci/docker/build.sh → .ci/docker/build.sh Unescape Escape View File

11 .circleci/docker/build_docker.sh → .ci/docker/build_docker.sh Unescape Escape View File

13 .circleci/docker/centos-rocm/Dockerfile → .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

32 .ci/docker/common/common_utils.sh Normal file Unescape Escape View File

0 .circleci/docker/common/install_android.sh → .ci/docker/common/install_android.sh Unescape Escape View File

11 .circleci/docker/common/install_base.sh → .ci/docker/common/install_base.sh Unescape Escape View File

0 .circleci/docker/common/install_cache.sh → .ci/docker/common/install_cache.sh Unescape Escape View File

0 .circleci/docker/common/install_clang.sh → .ci/docker/common/install_clang.sh Unescape Escape View File

14 .circleci/docker/common/install_cmake.sh → .ci/docker/common/install_cmake.sh Unescape Escape View File

58 .circleci/docker/common/install_conda.sh → .ci/docker/common/install_conda.sh Unescape Escape View File

7 .circleci/docker/common/install_cudnn.sh → .ci/docker/common/install_cudnn.sh Unescape Escape View File

0 .circleci/docker/common/install_db.sh → .ci/docker/common/install_db.sh Unescape Escape View File

0 .circleci/docker/common/install_devtoolset.sh → .ci/docker/common/install_devtoolset.sh Unescape Escape View File

4 .circleci/docker/common/install_docs_reqs.sh → .ci/docker/common/install_docs_reqs.sh Unescape Escape View File

0 .circleci/docker/common/install_gcc.sh → .ci/docker/common/install_gcc.sh Unescape Escape View File

0 .circleci/docker/common/install_glibc.sh → .ci/docker/common/install_glibc.sh Unescape Escape View File

0 .circleci/docker/common/install_jni.sh → .ci/docker/common/install_jni.sh Unescape Escape View File

0 .circleci/docker/common/install_lcov.sh → .ci/docker/common/install_lcov.sh Unescape Escape View File

29 .ci/docker/common/install_linter.sh Normal file Unescape Escape View File

0 .circleci/docker/common/install_ninja.sh → .ci/docker/common/install_ninja.sh Unescape Escape View File

0 .circleci/docker/common/install_openmpi.sh → .ci/docker/common/install_openmpi.sh Unescape Escape View File

0 .circleci/docker/common/install_openssl.sh → .ci/docker/common/install_openssl.sh Unescape Escape View File

2 .circleci/docker/common/install_protobuf.sh → .ci/docker/common/install_protobuf.sh Unescape Escape View File

22 .circleci/docker/common/install_rocm.sh → .ci/docker/common/install_rocm.sh Unescape Escape View File

4 .circleci/docker/common/install_rocm_magma.sh → .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

0 .circleci/docker/common/install_swiftshader.sh → .ci/docker/common/install_swiftshader.sh Unescape Escape View File

0 .circleci/docker/common/install_thrift.sh → .ci/docker/common/install_thrift.sh Unescape Escape View File

0 .circleci/docker/common/install_ucc.sh → .ci/docker/common/install_ucc.sh Unescape Escape View File

9 .circleci/docker/common/install_user.sh → .ci/docker/common/install_user.sh Unescape Escape View File

0 .circleci/docker/common/install_vision.sh → .ci/docker/common/install_vision.sh Unescape Escape View File

0 .circleci/docker/common/install_vulkan_sdk.sh → .ci/docker/common/install_vulkan_sdk.sh Unescape Escape View File

0 .circleci/docker/java/jni.h → .ci/docker/java/jni.h Unescape Escape View File

34 .ci/docker/linter/Dockerfile Normal file Unescape Escape View File

34 .circleci/docker/requirements-ci.txt → .ci/docker/requirements-ci.txt Unescape Escape View File

9 .circleci/docker/ubuntu-cuda/Dockerfile → .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

0 .circleci/docker/ubuntu-rocm/.gitignore → .ci/docker/ubuntu-rocm/.gitignore vendored Unescape Escape View File

9 .circleci/docker/ubuntu-rocm/Dockerfile → .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

13 .circleci/docker/ubuntu/Dockerfile → .ci/docker/ubuntu/Dockerfile Unescape Escape View File

14 .ci/onnx/README.md Normal file Unescape Escape View File

19 .ci/onnx/common.sh Normal file Unescape Escape View File

74 .ci/onnx/test.sh Executable file Unescape Escape View File

0 .jenkins/pytorch/.shellcheckrc → .ci/pytorch/.shellcheckrc Unescape Escape View File

2 .jenkins/pytorch/README.md → .ci/pytorch/README.md Unescape Escape View File

2 .jenkins/pytorch/build-asan.sh → .ci/pytorch/build-asan.sh Unescape Escape View File

0 .jenkins/pytorch/build-mobile.sh → .ci/pytorch/build-mobile.sh Unescape Escape View File

29 .ci/pytorch/build-tsan.sh Executable file Unescape Escape View File

51 .jenkins/pytorch/build.sh → .ci/pytorch/build.sh Unescape Escape View File

4 .jenkins/pytorch/codegen-test.sh → .ci/pytorch/codegen-test.sh Unescape Escape View File

58 .ci/pytorch/common-build.sh Normal file Unescape Escape View File

28 .ci/pytorch/common.sh Normal file Unescape Escape View File

129 .jenkins/pytorch/common_utils.sh → .ci/pytorch/common_utils.sh Unescape Escape View File

0 .jenkins/pytorch/create_test_cert.py → .ci/pytorch/create_test_cert.py Unescape Escape View File

0 .jenkins/pytorch/docker-build-test.sh → .ci/pytorch/docker-build-test.sh Unescape Escape View File

0 .jenkins/pytorch/docs-test.sh → .ci/pytorch/docs-test.sh Unescape Escape View File

0 .jenkins/pytorch/fake_numpy/numpy.py → .ci/pytorch/fake_numpy/numpy.py Unescape Escape View File

0 .jenkins/pytorch/macos-build-test.sh → .ci/pytorch/macos-build-test.sh Unescape Escape View File

6 .jenkins/pytorch/macos-build.sh → .ci/pytorch/macos-build.sh Unescape Escape View File

14 .ci/pytorch/macos-common.sh Executable file Unescape Escape View File

76 .jenkins/pytorch/macos-test.sh → .ci/pytorch/macos-test.sh Unescape Escape View File

10 .jenkins/pytorch/multigpu-test.sh → .ci/pytorch/multigpu-test.sh Unescape Escape View File

0 .jenkins/pytorch/perf_test/common.sh → .ci/pytorch/perf_test/common.sh Unescape Escape View File

2 .jenkins/pytorch/perf_test/compare_with_baseline.py → .ci/pytorch/perf_test/compare_with_baseline.py Unescape Escape View File

0 .jenkins/pytorch/perf_test/get_stats.py → .ci/pytorch/perf_test/get_stats.py Unescape Escape View File

0 .jenkins/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh → .ci/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh Unescape Escape View File

0 .jenkins/pytorch/perf_test/test_cpu_speed_mnist.sh → .ci/pytorch/perf_test/test_cpu_speed_mnist.sh Unescape Escape View File

2 .jenkins/pytorch/perf_test/test_cpu_speed_torch.sh → .ci/pytorch/perf_test/test_cpu_speed_torch.sh Unescape Escape View File

2 .jenkins/pytorch/perf_test/test_cpu_speed_torch_tensor.sh → .ci/pytorch/perf_test/test_cpu_speed_torch_tensor.sh Unescape Escape View File

0 .jenkins/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh → .ci/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh Unescape Escape View File

0 .jenkins/pytorch/perf_test/test_gpu_speed_lstm.sh → .ci/pytorch/perf_test/test_gpu_speed_lstm.sh Unescape Escape View File

0 .jenkins/pytorch/perf_test/test_gpu_speed_mlstm.sh → .ci/pytorch/perf_test/test_gpu_speed_mlstm.sh Unescape Escape View File

0 .jenkins/pytorch/perf_test/test_gpu_speed_mnist.sh → .ci/pytorch/perf_test/test_gpu_speed_mnist.sh Unescape Escape View File

4488 Commits

v1.13.0-rc ... v2.0.0-rc1

2

.bazelrc

View File

0

.jenkins/caffe2/README.md → .ci/caffe2/README.md

View File

2

.jenkins/caffe2/common.sh → .ci/caffe2/common.sh

View File

17

.jenkins/caffe2/test.sh → .ci/caffe2/test.sh

View File

0

.circleci/docker/README.md → .ci/docker/README.md

View File

0

.circleci/docker/android/AndroidManifest.xml → .ci/docker/android/AndroidManifest.xml

View File

0

.circleci/docker/android/build.gradle → .ci/docker/android/build.gradle

View File

201

.circleci/docker/build.sh → .ci/docker/build.sh

View File

11

.circleci/docker/build_docker.sh → .ci/docker/build_docker.sh

View File

13

.circleci/docker/centos-rocm/Dockerfile → .ci/docker/centos-rocm/Dockerfile

View File

32

.ci/docker/common/common_utils.sh Normal file

View File

0

.circleci/docker/common/install_android.sh → .ci/docker/common/install_android.sh

View File

11

.circleci/docker/common/install_base.sh → .ci/docker/common/install_base.sh

View File

0

.circleci/docker/common/install_cache.sh → .ci/docker/common/install_cache.sh

View File

0

.circleci/docker/common/install_clang.sh → .ci/docker/common/install_clang.sh

View File

14

.circleci/docker/common/install_cmake.sh → .ci/docker/common/install_cmake.sh

View File

58

.circleci/docker/common/install_conda.sh → .ci/docker/common/install_conda.sh

View File

7

.circleci/docker/common/install_cudnn.sh → .ci/docker/common/install_cudnn.sh

View File

0

.circleci/docker/common/install_db.sh → .ci/docker/common/install_db.sh

View File

0

.circleci/docker/common/install_devtoolset.sh → .ci/docker/common/install_devtoolset.sh

View File

4

.circleci/docker/common/install_docs_reqs.sh → .ci/docker/common/install_docs_reqs.sh

View File

0

.circleci/docker/common/install_gcc.sh → .ci/docker/common/install_gcc.sh

View File

0

.circleci/docker/common/install_glibc.sh → .ci/docker/common/install_glibc.sh

View File

0

.circleci/docker/common/install_jni.sh → .ci/docker/common/install_jni.sh

View File

0

.circleci/docker/common/install_lcov.sh → .ci/docker/common/install_lcov.sh

View File

29

.ci/docker/common/install_linter.sh Normal file

View File

0

.circleci/docker/common/install_ninja.sh → .ci/docker/common/install_ninja.sh

View File

0

.circleci/docker/common/install_openmpi.sh → .ci/docker/common/install_openmpi.sh

View File

0

.circleci/docker/common/install_openssl.sh → .ci/docker/common/install_openssl.sh

View File

2

.circleci/docker/common/install_protobuf.sh → .ci/docker/common/install_protobuf.sh

View File

22

.circleci/docker/common/install_rocm.sh → .ci/docker/common/install_rocm.sh

View File

4

.circleci/docker/common/install_rocm_magma.sh → .ci/docker/common/install_rocm_magma.sh

View File

0

.circleci/docker/common/install_swiftshader.sh → .ci/docker/common/install_swiftshader.sh

View File

0

.circleci/docker/common/install_thrift.sh → .ci/docker/common/install_thrift.sh

View File

0

.circleci/docker/common/install_ucc.sh → .ci/docker/common/install_ucc.sh

View File

9

.circleci/docker/common/install_user.sh → .ci/docker/common/install_user.sh

View File

0

.circleci/docker/common/install_vision.sh → .ci/docker/common/install_vision.sh

View File

0

.circleci/docker/common/install_vulkan_sdk.sh → .ci/docker/common/install_vulkan_sdk.sh

View File

0

.circleci/docker/java/jni.h → .ci/docker/java/jni.h

View File

34

.ci/docker/linter/Dockerfile Normal file

View File

34

.circleci/docker/requirements-ci.txt → .ci/docker/requirements-ci.txt

View File

9

.circleci/docker/ubuntu-cuda/Dockerfile → .ci/docker/ubuntu-cuda/Dockerfile

View File

0

.circleci/docker/ubuntu-rocm/.gitignore → .ci/docker/ubuntu-rocm/.gitignore vendored

View File

9

.circleci/docker/ubuntu-rocm/Dockerfile → .ci/docker/ubuntu-rocm/Dockerfile

View File

13

.circleci/docker/ubuntu/Dockerfile → .ci/docker/ubuntu/Dockerfile

View File

14

.ci/onnx/README.md Normal file

View File

19

.ci/onnx/common.sh Normal file

View File

74

.ci/onnx/test.sh Executable file

View File

0

.jenkins/pytorch/.shellcheckrc → .ci/pytorch/.shellcheckrc

View File

2

.jenkins/pytorch/README.md → .ci/pytorch/README.md

View File

2

.jenkins/pytorch/build-asan.sh → .ci/pytorch/build-asan.sh

View File

0

.jenkins/pytorch/build-mobile.sh → .ci/pytorch/build-mobile.sh

View File

29

.ci/pytorch/build-tsan.sh Executable file

View File

51

.jenkins/pytorch/build.sh → .ci/pytorch/build.sh

View File

4

.jenkins/pytorch/codegen-test.sh → .ci/pytorch/codegen-test.sh

View File

58

.ci/pytorch/common-build.sh Normal file

View File

28

.ci/pytorch/common.sh Normal file

View File

129

.jenkins/pytorch/common_utils.sh → .ci/pytorch/common_utils.sh

View File

0

.jenkins/pytorch/create_test_cert.py → .ci/pytorch/create_test_cert.py

View File

0

.jenkins/pytorch/docker-build-test.sh → .ci/pytorch/docker-build-test.sh

View File

0

.jenkins/pytorch/docs-test.sh → .ci/pytorch/docs-test.sh

View File

0

.jenkins/pytorch/fake_numpy/numpy.py → .ci/pytorch/fake_numpy/numpy.py

View File

0

.jenkins/pytorch/macos-build-test.sh → .ci/pytorch/macos-build-test.sh

View File

6

.jenkins/pytorch/macos-build.sh → .ci/pytorch/macos-build.sh

View File

14

.ci/pytorch/macos-common.sh Executable file

View File

76

.jenkins/pytorch/macos-test.sh → .ci/pytorch/macos-test.sh

View File

10

.jenkins/pytorch/multigpu-test.sh → .ci/pytorch/multigpu-test.sh

View File

0

.jenkins/pytorch/perf_test/common.sh → .ci/pytorch/perf_test/common.sh

View File

2

.jenkins/pytorch/perf_test/compare_with_baseline.py → .ci/pytorch/perf_test/compare_with_baseline.py

View File

0

.jenkins/pytorch/perf_test/get_stats.py → .ci/pytorch/perf_test/get_stats.py

View File

0

.jenkins/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh → .ci/pytorch/perf_test/test_cpu_speed_mini_sequence_labeler.sh

View File

0

.jenkins/pytorch/perf_test/test_cpu_speed_mnist.sh → .ci/pytorch/perf_test/test_cpu_speed_mnist.sh

View File

2

.jenkins/pytorch/perf_test/test_cpu_speed_torch.sh → .ci/pytorch/perf_test/test_cpu_speed_torch.sh

View File

2

.jenkins/pytorch/perf_test/test_cpu_speed_torch_tensor.sh → .ci/pytorch/perf_test/test_cpu_speed_torch_tensor.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh → .ci/pytorch/perf_test/test_gpu_speed_cudnn_lstm.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_lstm.sh → .ci/pytorch/perf_test/test_gpu_speed_lstm.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_mlstm.sh → .ci/pytorch/perf_test/test_gpu_speed_mlstm.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_mnist.sh → .ci/pytorch/perf_test/test_gpu_speed_mnist.sh

View File

0

.jenkins/pytorch/perf_test/test_gpu_speed_word_language_model.sh → .ci/pytorch/perf_test/test_gpu_speed_word_language_model.sh

View File