pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-24 15:44:58 +08:00

Author	SHA1	Message	Date
Andrey Talman	97ff6cfd9c	[Release only] Release 2.3 start using triton package from pypi (#123580 )	2024-04-08 16:27:33 -04:00
pytorchbot	fb38ab7881	Fix for MPS regression in #122016 and #123178 (#123385 ) Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds. Before the fix on MacOS 14.4: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 3., 3.], device='mps:0') ``` After the fix: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 1., 3.], device='mps:0') ``` This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+ Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234 Approved by: https://github.com/malfet, https://github.com/kulinseth (cherry picked from commit 05289a278c3eaca271061649982f38c435b50674) Co-authored-by: Joona Havukainen <jhavukainen@apple.com>	2024-04-05 18:46:31 -04:00
pytorchbot	23961cef85	[Release/2.3] Set py3.x build-environment name consistently (#123446 ) https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != py3.8`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. `03b987fe3f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247 Approved by: https://github.com/malfet (cherry picked from commit 6fefc52a2b4f814c5bc85f4087a92ad7f6ee3abe) Co-authored-by: Huy Do <huydhn@gmail.com>	2024-04-05 09:01:19 -07:00
Andrey Talman	634cf5069a	[Wheel] Change libtorch_cpu OpenMP search path (#123417 ) (#123442 ) To prevent delocate from double-packing it, which makes Torch wheels unusable with torch.compile out of the box Fixes https://github.com/pytorch/pytorch/issues/122705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123417 Approved by: https://github.com/atalman Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2024-04-05 10:22:39 -04:00
pytorchbot	12d0e693d0	update submodule onnx==1.16.0 (#123387 ) Fixes #121258 CC @malfet @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/123125 Approved by: https://github.com/malfet (cherry picked from commit 19c2ed15c099c7ed9f96074584af6ab9da206f92) Co-authored-by: pbialecki <piotr.bialecki@hotmail.de>	2024-04-04 20:47:38 -04:00
Huy Do	38acd812ab	[MPS] Fwd-fix for clamp regression (#122148 ) (#123383 ) Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it - Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with ``` /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library' ``` - Change the order of max and min call as it's apparently important for consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148 Approved by: https://github.com/huydhn Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2024-04-04 16:29:42 -07:00
pytorchbot	b197f540bc	Use numpy 2.0.0rc1 in CI (#123356 ) Bump numpy version to 2.0.0rc1 in CI Related to: https://github.com/pytorch/pytorch/issues/107302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123286 Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/ZainRizvi (cherry picked from commit 26b4ccf9d171a4abb3b25d9f88fc594ea5aca1ce) Co-authored-by: atalman <atalman@fb.com>	2024-04-04 19:02:49 -04:00
pytorchbot	dc81d19aac	[CI] Test that NumPy-2.X builds are backward compatible with 1.X (#123354 ) By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X This has no affects on binary builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157 Approved by: https://github.com/atalman (cherry picked from commit 03b987fe3fa93f398c0af5b40e512950c39a7cb6) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-04-04 19:00:35 -04:00
pytorchbot	108305e47b	Upgrade submodule pybind to 2.12.0 (#123355 ) To fix https://github.com/pytorch/pytorch/issues/122056 Building with NP 2.0 allows me to run locally with both NP 2.0 and 1.26. Any other test we should run @rgommers ? FYI @Skylion007 @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/122899 Approved by: https://github.com/Skylion007 (cherry picked from commit 6c2f36c9845f310db8ece23c0d2e4ad6f702bc57) Co-authored-by: albanD <desmaison.alban@gmail.com>	2024-04-04 18:07:42 -04:00
pytorchbot	a8b009185d	Make PyTorch compilable against upcoming Numpy-2.0 (#121880 ) (#123380 ) Test plan: ``` % python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))" 2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9]) % python -c "import torch;print(torch.rand(3, 3).numpy())" [[0.0931946 0.44874293 0.8480404 ] [0.93877375 0.10188377 0.67375803] [0.02520031 0.89019287 0.5691561 ]] ``` Fixes https://github.com/pytorch/pytorch/issues/121798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880 Approved by: https://github.com/albanD (cherry picked from commit 38d9bb5abcc31ba97927a5399b88afe2cf60bf64) Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-04-04 14:22:26 -07:00
pytorchbot	b67b277268	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) (#122785 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet (cherry picked from commit 40acc84aafa82f00a5b3966302638f344bef07bd) Co-authored-by: Roger Lam <mrlamroger@gmail.com>	2024-04-04 13:26:29 -07:00
Bowen Bao	a8f93a5c71	[ONNX] beartype to emit warning instead of error by default (#123363 ) Making exporter more "robust" to advances in beartype tool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123205 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2024-04-04 16:13:58 -04:00
pytorchbot	fa07dc5132	[MPS] Fix naive matmul for BFloat16 (#123289 ) Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate Fixes https://github.com/pytorch/pytorch/issues/121583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731 Approved by: https://github.com/albanD (cherry picked from commit 5498804ec2ac9aa62ba3bbf20149118142567d9b) Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>	2024-04-04 16:04:46 -04:00
pytorchbot	2a82d31f78	fix breaking changes for ONNX Runtime Training (#123271 ) Fixes breaking changes for ONNX Runtime Training. PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor’ to ‘DLManagedTensor’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor)’ TORCH_API Tensor fromDLPack(DLManagedTensor src); ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000 Approved by: https://github.com/malfet (cherry picked from commit 765c3fc138fda4b49978403ee1394040221957cc) Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>	2024-04-03 18:52:06 -04:00
mikaylagawarecki	4bb5cb51e6	Fix swap_tensors path in _apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800 ) (#123116 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122800 Approved by: https://github.com/albanD (cherry picked from commit cc12668053ad847ff4a430e99eeebf99c136f3cd)	2024-04-02 16:16:37 -07:00
Wanchao	ef38d0572e	nn.Module: use swap_tensors for Tensor subclasses (#122755 ) (#123106 ) This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors. This uses the `swap_tensors` method to swap all of the tensors not just the .data field. Test plan: ``` pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting' python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755 Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki (cherry picked from commit e6ee8322d767ab241ce1651e7c178f539e8e3199) Co-authored-by: Tristan Rice <rice@fn.lc>	2024-04-02 16:16:16 -07:00
Xinya Zhang	5a53185e65	Remove cuda dependencies when building AOTriton (#122982 ) (#123179 ) Downloading CUDA sometimes failed and breaks the build process, but AOTriton does not need these packages. This commit comments out the related downloading scripts.	2024-04-02 19:08:22 -04:00
Xinya Zhang	bc9e23abb5	Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857 ) (#122967 ) This PR fixes the two major issues that was discovered after the initial merge of PR #121561 1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem. 2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-04-02 18:53:19 -04:00
Huy Do	8194fae625	Pin protobuf to 3.20.2 on macOS (#123197 ) The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, `3bc2bb6781`. This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29 The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too. We want to eventually just have one requirements.txt file. Fixes https://github.com/pytorch/pytorch/issues/122008 Fixes https://github.com/pytorch/pytorch/issues/121927 Fixes https://github.com/pytorch/pytorch/issues/121946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918 Approved by: https://github.com/kit1980	2024-04-02 15:08:09 -04:00
Iris Z	12acd4c9b3	[Cherrypick][DeviceMesh] Cache and reuse sliced result (#122975 ) (#123073 ) Fixes #118849 Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors. We will follow up with reusing pg from the parent_mesh during submesh creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975 Approved by: https://github.com/wanchaol	2024-04-02 15:05:07 -04:00
Chunyuan WU	857797d148	[CherryPick] Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297 ) (#123064 ) For `at::scalar_tensor` the default dtype will be `float` ([link to scalar_tensor](`0d8e960f74/aten/src/ATen/native/TensorFactories.cpp (L856)`), [link to default dtype](`0d8e960f74/c10/core/TensorOptions.h (L551)`)) if we don't set the `dtype` value. However, the input scalar value is not necessarily a `float` value. With `torch::tensor(x)`, the dtype of the tensor will be decided according to the dtype of the scalar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122297 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-02 15:03:25 -04:00
pytorchbot	233dfe4d6a	Proper view support for jagged layout NestedTensor (#122854 ) * Proper view support for jagged layout NestedTensor (#113279) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang (cherry picked from commit cd6bfc7965fc5ae20720bae0994e332e56f819c0) * Update executorch.txt * Update executorch.txt * Fix linter error --------- Co-authored-by: Joel Schlosser <jbschlosser@meta.com> Co-authored-by: Guang Yang <42389959+guangy10@users.noreply.github.com>	2024-04-02 11:46:53 -07:00
Xia Weiwen	e22b534b10	Upgrade submodule oneDNN to v3.3.6 for release/2.3 (#122164 ) (#122930 ) As the title. Including issue fixes for aarch64: - https://github.com/oneapi-src/oneDNN/pull/1831 - https://github.com/oneapi-src/oneDNN/pull/1834 --- ## Validation results (on Intel CPU + Linux) Static quantization with Inductor on CV models Quant method \| Geomean throughput ratio (v3.3.6/baseline) -- \| -- ptq \| 0.982937 ptq (cpp wrapper) \| 0.978384 qat \| 0.978828 Torchbench cpu userbenchmark with Inductor Items \| Perf Geomean Ratio (v3.3.6/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.01x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 1.00x eager_throughtput_bf16_train \| 1.46x eager_throughtput_fp32_train \| 1.41x Dynamo benchmarks tests Precision \| Shape \| Wrapper \| Thread \| Eager old/new GEOMEAN \| Inductor old/new GEOMEAN -- \| -- \| -- \| -- \| -- \| -- Float32 \| Static \| Default \| Multiple \| 1.003836812 \| 1.003425 Float32 \| Static \| Default \| Single \| 1.000181451 \| 0.999611 Float32 \| Dynamic \| Default \| Multiple \| 1.003980183 \| 1.006563 Float32 \| Dynamic \| Default \| Single \| 1.000076939 \| 0.999969 AMP \| Static \| Default \| Multiple \| 0.996824772 \| 0.998715 AMP \| Static \| Default \| Single \| 0.996402574 \| 1.001483 AMP \| Dynamic \| Default \| Multiple \| 0.994919866 \| 1.000467 AMP \| Dynamic \| Default \| Single \| 0.9962054 \| 1.000767 (on Aarch64) https://github.com/pytorch/pytorch/pull/122164#issuecomment-2007912919 --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman	2024-04-02 12:57:11 -04:00
Iris Z	8602990e3f	[CherryPick] Back out "[DeviceMesh] Add support for nD slicing (#119752 )" (#121763 ) (#122495 ) Summary: Original commit changeset: e52b8809c8d8 Original Phabricator Diff: D54778906 We have to backout this diff. D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248 Test Plan: Sandcastle Reviewed By: satgera Differential Revision: D54825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763 Approved by: https://github.com/osalpekar (cherry picked from commit e99fa0042cd3dcd2eded24585d59c53f2da9d9f5)	2024-03-28 14:25:08 -07:00
Jack Taylor	685cc955df	[ROCm] Update triton rocm branch to release/2.3.x (#122493 ) * Update triton rocm branch to release/2.3.x * Remove ROCM_TRITION_VERSION and update to 2.3.0 * Remove unnecessary ROCm conditionalisation * Skip failing UT	2024-03-28 14:18:37 -07:00
pytorchbot	b1c2430fbd	remove torchao dependency (#122635 ) * remove torchao dependency (#122524) Test Plan: CI ``` buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2 ``` ``` buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4 ``` Differential Revision: D55263008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524 Approved by: https://github.com/jerryzh168 (cherry picked from commit c677221798d8ce87c97aac1bd9ae34af0767c383) * Update executorch.txt * Update _decomposed.py * Update executorch.txt * Update executorch.txt * Update executorch.txt * Update executorch.txt * Update executorch.txt --------- Co-authored-by: Guang Yang <guangyang@meta.com> Co-authored-by: Guang Yang <42389959+guangy10@users.noreply.github.com>	2024-03-28 12:25:12 -07:00
pytorchbot	3002eb2556	[export] hack skip index_put_ in dce (#122683 ) (#122721 ) Summary: Ideally we should do whats in the todo. Just doing this for now to unblock llama capture Test Plan: capturing llama and using pt2e to quantize it Differential Revision: D55354487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122683 Approved by: https://github.com/kimishpatel (cherry picked from commit 41d24df08f72e059c4eebdde4315e63a9918406f) Co-authored-by: Jacob Szwejbka <jakeszwe@meta.com>	2024-03-27 21:29:53 -07:00
pytorchbot	e1a846d6b8	Fix auto_functionalize (#121990 ) (#122654 ) Differential Revision: D54964130 When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990 Approved by: https://github.com/ydwu4, https://github.com/zou3519 (cherry picked from commit 0d845f7b0781f091452a5fd31de14e1c2117f3d4) Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com>	2024-03-27 21:28:56 -07:00
pytorchbot	4a9a8c606d	[export] add pass to remove auto functionalized hop (#122246 ) (#122655 ) Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use Test Plan: added unit test Differential Revision: D55103867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246 Approved by: https://github.com/angelayi (cherry picked from commit c84f81b395fff969bbd2f784efad8ab1a8aa52de) Co-authored-by: Jacob Szwejbka <jakeszwe@meta.com>	2024-03-27 21:05:15 -07:00
Andrey Talman	d3201f48b1	Revert "Revert "CI: Specify libc and libstdcxx versions in conda environments"" (#122523 ) This reverts commit 74832f12fae2e1bc51bf1f9971dcd12c90a971f5.	2024-03-22 17:41:42 -04:00
Andrey Talman	74832f12fa	Revert "CI: Specify libc and libstdcxx versions in conda environments" (#122497 ) This reverts commit b4f90aae1b375bfe06d3c4a099240e06f93c81c4.	2024-03-22 11:27:50 -04:00
Andrey Talman	02cdb400d7	Use temporary name for triton package, fix lint (#122438 ) * Use temporary name for triton package * Fix lint	2024-03-21 17:30:38 -04:00
Andrey Talman	37257774c6	Triton wheel build using 2.3.x branch (#122403 ) * Triton build 2.3.x * Revert "[Release Only] Build triton using pinned version rather branch (#121765)" This reverts commit d69c4219127e2cf5d9637b0daacc0a24e65f8133. * Triton wheel change * release	2024-03-21 12:52:21 -04:00
shunting314	c4e5434423	necessary change to make torch2.3 work with triton2.2 (#122139 )	2024-03-21 08:24:53 -04:00
pytorchbot	b4f90aae1b	CI: Specify libc and libstdcxx versions in conda environments (#121929 ) Without this we get mismatches between the GLIBC and GLIBCXX ABI used by conda packages vs pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556 Approved by: https://github.com/isuruf, https://github.com/malfet (cherry picked from commit 7a53dedb07ed72b85d1e083ce38c43c7810fc5f1) Co-authored-by: Peter Bell <peterbell10@live.co.uk>	2024-03-14 17:56:46 -04:00
Andrey Talman	94d6463255	[RELEASE ONLY CHANGES] Increase timeout for linux binary jobs, fix workflow lint (#121851 ) * [release only] Increase timeout job for linux binary builds by 30min * fix lint	2024-03-13 19:50:57 -04:00
Andrey Talman	6a89a753b1	[RELEASE ONLY CHANGES] Apply release only changes Release 2.3 (#121813 ) * [Release only changes] Release only changes #2 * common+lint	2024-03-13 11:03:48 -04:00
Andrey Talman	d69c421912	[Release Only] Build triton using pinned version rather branch (#121765 )	2024-03-12 19:05:23 -04:00
Andrey Talman	6725db07ae	[RELEASE ONLY CHANGES] Apply release only changes Release 2.3 (#121726 ) * Apply release only changes * temp changes * tweak * fix * Revert "tweak" This reverts commit 38edcac21448829ac114c73423c84614628e2598.	2024-03-12 18:14:35 -04:00
lezcano	86a2d67bb9	Simplify guards using info from previous guards (#121463 ) Let me see what CI thinks about this one. Will add tests tomorrow. Fixes https://github.com/pytorch/pytorch/issues/119917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463 Approved by: https://github.com/ezyang	2024-03-12 04:22:20 +00:00
Nikita Shulga	703e83e336	Fix AARCH64 builds (#121700 ) After https://github.com/pytorch/pytorch/pull/119992 was landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/121700 Approved by: https://github.com/janeyx99, https://github.com/huydhn	2024-03-12 04:17:47 +00:00
Shen Xu	159f30331f	[quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548 ) Test Plan: ``` buck run caffe2/test:quantization_pt2e ``` Differential Revision: D54454707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548 Approved by: https://github.com/jerryzh168	2024-03-12 02:59:12 +00:00
Tugsbayasgalan Manlaibaatar	7fc497711d	Also test predispatch serialization (#121652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121652 Approved by: https://github.com/zhxchen17, https://github.com/angelayi	2024-03-12 02:37:59 +00:00
eellison	6ca9ae4f86	Express y grid > 2^16 in terms of z grid (#121554 ) CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554 Approved by: https://github.com/aakhundov	2024-03-12 02:36:19 +00:00
Jane Xu	fb1d7935bb	[optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618 Approved by: https://github.com/albanD	2024-03-12 02:33:21 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Kefei Lu	3a5f48d55f	Port remove_split_ops to PT2 pre-grad passes (#121674 ) Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain. Test Plan: Run test cases Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34. Reviewed By: frank-wei Differential Revision: D54711064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674 Approved by: https://github.com/frank-wei	2024-03-12 01:15:19 +00:00
Elias Ellison	5b5d423c2e	Benchmark templates (#118880 ) Adding support for benchmarking templates in `benchmark_fusion` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880 Approved by: https://github.com/shunting314	2024-03-11 23:55:13 +00:00
Mu-Chu Lee	7676433012	[AOTInductor] Reuse generated kernels between constant graph and main graph (#121564 ) Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated. Test Plan: Included in commit Differential Revision: D54706767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-03-11 22:44:38 +00:00
Andrew Gu	272cf29e4d	[FSDP2][BE] Refactored `check_1d_sharded_parity` to use mesh (#121357 ) Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357 Approved by: https://github.com/weifengpy ghstack dependencies: #121360	2024-03-11 22:34:42 +00:00
Sergii Dymchenko	cd1dc5e484	Delete requirements-flake8.txt (#121657 ) The file seems to be unused and also has different flake8 version compared to .lintrunner.toml, creating confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121657 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2024-03-11 22:29:25 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit 7b4f70eda519ccd7f28de17689edd43c52743bc9. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
Sergii Dymchenko	498a94a7f5	Don't install torchfix for python<3.9 (#121655 ) Fixes https://github.com/pytorch/pytorch/issues/121591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121655 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-03-11 22:18:42 +00:00
PyTorch MergeBot	b2f09c1859	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit d27509c384c9847cd2ac1f5d63ec143704b50591. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))	2024-03-11 22:18:36 +00:00
Alexander Grund	d1f45a93af	Check for releasing GIL at compiletime (#116695 ) Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster. @albanD This is the GIL change extracted from #112607 as discussed. Also fixes a potential use of a moved-from object introduced in #116560: - `f` is captured by value in a lambda that may be used called times - After `std::move(f)` the lambda is not safe to call anymore CC @cyyever for that change Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-03-11 22:04:56 +00:00
Sam Larsen	fd13a56f61	Refactor some testing helpers for FX graph cache testing (#121520 ) Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520 Approved by: https://github.com/eellison	2024-03-11 21:46:27 +00:00
Andres Lugo-Reyes	e01b07e1e8	[ROCm] Autocast RNN Support (#121539 ) Fixes #116361 Implements Autocast wrapper for miopen rnn's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-03-11 21:14:43 +00:00
Kefei Lu	fc712311ce	port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617 ) Summary: Does not change weights structure so compatible with const folding and realtime weights update Test Plan: run added test cases Reviewed By: frank-wei Differential Revision: D53843428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617 Approved by: https://github.com/frank-wei	2024-03-11 20:51:11 +00:00
Zhenghao Zhao	3461404869	[pt2 export]fix name collision on constant name (#121145 ) Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args. Test Plan: added test case Differential Revision: D54435230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145 Approved by: https://github.com/zhxchen17	2024-03-11 20:40:59 +00:00
Huy Do	b091a32909	Add a section on release wiki about pytorchbot cherry-pick command (#121648 ) I add a section about the new `pytorchbot cherry-pick` command in the release wiki so that more people know about it Pull Request resolved: https://github.com/pytorch/pytorch/pull/121648 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-03-11 20:09:58 +00:00
Jinzhe Zeng	dd2062c737	fix CMake FindCUDA module for cross-compiling (#121590 ) Fix two cross-compiling issues in `FindCUDA.cmake` (xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/224). 1. `setup.py` reads the cached `CUDA_TOOLKIT_ROOT_DIR`, so it must be cached. `41286f1505/setup.py (L593)` I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9323. 2. [SBSA toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Cross&Distribution=Ubuntu&target_version=20.04&target_type=deb_network_cross) is in `sbsa-linux` directory. See also https://gitlab.kitware.com/cmake/cmake/-/issues/24192 I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121590 Approved by: https://github.com/malfet	2024-03-11 20:09:52 +00:00
lancerts	5fd7f5c4e3	Include torch warn in each error in cudnn/Conv_v8.cpp (#120719 ) Fixes #120702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120719 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-03-11 20:05:42 +00:00
Jason Ansel	9aa3fedb75	Slightly faster FX graph iterator (#121611 ) Before: ``` iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s) ``` After: ``` iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611 Approved by: https://github.com/oulgen	2024-03-11 20:00:19 +00:00
James Wu	ae22bdaefe	Update torchbench commit pin, add sam_fast benchmark (#121420 ) After this, the sam_fast benchmark can now be run in the pytorch repo: ``` SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast ``` sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420 Approved by: https://github.com/oulgen, https://github.com/msaroufim	2024-03-11 19:48:53 +00:00
Daniel Herrera	dccc1ca839	[torch] Use __prepare_scriptable__ for closures (#121553 ) Summary: This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229 The object is using __prepare_scriptable__ correctly inside of torch.jit.script() but the clousre that is obtained below is using the non-prepared version. This causes issues when the prepared and non-prepared versions are in different python modules. Test Plan: ``` buck2 run mode/opt caffe2/test:jit -- -r test_decorator ``` Differential Revision: D54308741 Re-exporting, as #120806 #121307 were not properly merged. Co-authored-by: Daniel Herrera <dherrera@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553 Approved by: https://github.com/huydhn, https://github.com/seemethere	2024-03-11 19:14:19 +00:00
Huy Do	b4160fd9c7	Clean up macOS x86 binaries build jobs (#116726 ) This will stop building binaries for MacOS x86 on PyTorch including nightly and all future releases. If we want this for 2.2, this can be cherry-picked there. * [x] https://github.com/pytorch/pytorch/pull/116725 * [ ] https://github.com/pytorch/pytorch/pull/116726 Fixes https://github.com/pytorch/pytorch/issues/114602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116726 Approved by: https://github.com/atalman	2024-03-11 19:09:39 +00:00
iefgnoix	8d03c59d59	Bring torch_xla pin to the latest torch_xla commit (03/08/2024). (#121529 ) Update the torch_xla pin to a more recent one (03/08/2024). We need to make sure the torch_xla pin stays up-to-date so that pytorch can test against a up-to-date version of torch_xla. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121529 Approved by: https://github.com/atalman	2024-03-11 18:25:42 +00:00
Aidyn-A	39ed038f41	[TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541 ) Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705: > I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases I would like to add that the version check is not necessary as in any case the outcome is the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541 Approved by: https://github.com/nWEIdia, https://github.com/albanD	2024-03-11 17:48:29 +00:00
Catherine Lee	6801595349	Fix round robin sharding (#121022 ) Fix round robin sharding when there are no test times and sort_by_time=False Adds more tests to test_test_selections for sort_by_time=False Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests Refactoring of dup code Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-03-11 17:30:12 +00:00
Aaron Gokaslan	e2ac2dc13a	Update NCCL submodule to v2.20.5 (#121635 ) Updates NCCL submodule to 2.20.5 . Includes a lot of bugfixes for reductions and connections issues. Should also improve performance. We have been running 2.20.5 internally for a few weeks, the binary pip wheels have finally been published so we can update main. Release notes here: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-20-5.html#rel_2-20-5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121635 Approved by: https://github.com/malfet	2024-03-11 17:23:59 +00:00
Natalia Gimelshein	89add71168	fix synchronization behavior for copies with type change (#121341 ) Fixes #121320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341 Approved by: https://github.com/albanD	2024-03-11 17:09:45 +00:00
CaoE	03717430cc	Fix lower precision check for MKLDNN on Windows (#121618 ) Fixes #120788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121618 Approved by: https://github.com/xuhancn, https://github.com/jgong5, https://github.com/mingfeima, https://github.com/seemethere	2024-03-11 16:09:20 +00:00
Nikita Shulga	e29004615f	Add NEON accelerated torch.mv kernel (#119992 ) This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table \| op \| original \| F32+NEON \| F16+NEON\| \| ---\| -------- \| ---------- \| ----- \| \| torch.mv(m, v) \| 209.53 usec \| 16.25 usec \| 14.68 usec \| \| torch.mv(m.t(), v) \| 104.80 usec \| 28.68 usec \| 24.82 usec \| Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used) To investigate: - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992 Approved by: https://github.com/mikekgfb	2024-03-11 16:00:01 +00:00
Catherine Lee	fac06a12c8	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-11 15:35:45 +00:00
Thiago Crepaldi	6c11d3ce0c	Add support to save safetensors checkpoint directly into onnx (#121001 ) Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for the newly exported ONNX model. This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished. Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001 Approved by: https://github.com/BowenBao, https://github.com/malfet	2024-03-11 15:21:59 +00:00
FFFrog	485f8ebc07	add __repr__ function to FunctionSchema for Python (#121484 ) Fixes #118566 Unlike OpOverload or OpOverloadPacket, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the \_\_repr__ function should show the class name as well as some other key information. If you have any choices, please show me, thank you. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484 Approved by: https://github.com/Skylion007	2024-03-11 15:16:50 +00:00
Xia Weiwen	d1510e01fa	Upgrade submodule onednn to v3.3.5 (#120767 ) This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700. Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2). 1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843) Validation results with this patch: Latency increased by 0.60% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 metrics-1484287.json { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 418.851717 } } oneDNN v3.3.4 { "name": "cpu", "environ": { "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0" }, "metrics": { "latency": 421.381313 } } ``` 2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592) Validation results with this patch: Latency reduced by 3.23% ``` Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake) oneDNN v3.1.1 (inductor speedup over eager mode) 2.876x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0 oneDNN v3.3.4 (inductor speedup over eager mode) 3.003x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0 ``` 3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962) Validation results with this patch: Latency reduced by 0.85% ``` Tested on an AWS spr metal instance oneDNN v3.1.1 (inductor speedup over eager mode) 1.120x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4 oneDNN v3.3.4 (inductor speedup over eager mode) 1.134x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4 ``` The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues. - https://github.com/pytorch/pytorch/issues/120211 - https://github.com/pytorch/pytorch/issues/120406 - https://github.com/pytorch/pytorch/issues/120547 ----- Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found. I. torchbench CPU userbenchmark test Suite \| Speedup -- \| -- eager_throughtput_bf16_infer \| 1.001848 eager_throughtput_fp32_infer \| 1.000257 eager_throughtput_fx_int8 \| 1.003069 jit_llga_throughtput_amp_bf16 \| 1.000682 jit_llga_throughtput_fp32 \| 1.000313 eager_throughtput_bf16_train \| 0.998222 eager_throughtput_fp32_train \| 1.003384 II. Inductor FP32/AMP inference tests i. FP32 static default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| timm_efficientnet \| multiple \| 64 \| 1.09 timm_models \| tinynet_a \| multiple \| 128 \| 1.14 ii. FP32 dynamic default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| alexnet \| multiple \| 128 \| 1.08 torchbench \| basic_gnn_edgecnn \| multiple \| 1 \| 0.98 torchbench \| timm_efficientnet \| multiple \| 64 \| 1.08 iii. AMP static default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| hf_distil_whisper \| multiple \| 1 \| 1.18 torchbench \| timm_efficientnet \| multiple \| 64 \| 1.32 huggingface \| BartForConditionalGeneration \| multiple \| 2 \| 1.19 timm_models \| eca_halonext26ts \| multiple \| 128 \| 1.13 timm_models \| nfnet_l0 \| multiple \| 128 \| 1.13 timm_models \| rexnet_100 \| multiple \| 128 \| 1.45 timm_models \| spnasnet_100 \| multiple \| 128 \| 1.15 timm_models \| tf_efficientnet_b0 \| multiple \| 128 \| 1.22 timm_models \| tinynet_a \| multiple \| 128 \| 1.49 torchbench \| hf_Bert_large \| single \| 1 \| 1.16 huggingface \| XLNetLMHeadModel \| single \| 1 \| 1.07 iv. AMP dynamic default suite \| name \| thread \| batch size \| Ratio Speedup(New/old) -- \| -- \| -- \| -- \| -- torchbench \| timm_efficientnet \| multiple \| 64 \| 1.32 huggingface \| PLBartForConditionalGeneration \| multiple \| 4 \| 1.14 timm_models \| nfnet_l0 \| multiple \| 128 \| 1.15 timm_models \| rexnet_100 \| multiple \| 128 \| 1.45 timm_models \| tinynet_a \| multiple \| 128 \| 1.34 huggingface \| XLNetLMHeadModel \| single \| 1 \| 1.09 ----- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman	2024-03-11 12:56:59 +00:00
Xilun Wu	605c0a28aa	[dtensor][debug] force visualize_sharding not to print for empty tensors (#121217 ) Summary Current `visualize_sharding` code cannot print for empty DTensor objects which leads to an exception. This PR skips the print logic if the DTensor passed in has 0 element. <img width="2165" alt="Pasted Graphic" src="https://github.com/pytorch/pytorch/assets/12968408/fa40b5e7-dad7-4d3a-a485-6a18067320ff"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121217 Approved by: https://github.com/wanchaol ghstack dependencies: #121385, #121382	2024-03-11 09:22:49 +00:00
Xilun Wu	3a5ab17bdc	[dtensor][debug] visualize_sharding skip if the current rank is not in mesh (#121382 ) Summary We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382 Approved by: https://github.com/wanchaol ghstack dependencies: #121385	2024-03-11 09:22:49 +00:00
Xilun Wu	b383123e37	[dtensor][debug] visualize_sharding only compute offset on the first rank in mesh (#121385 ) Summary avoid computing on ranks where we do not plan to visualize the DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121385 Approved by: https://github.com/wanchaol	2024-03-11 09:22:31 +00:00
kungyork	9c50ecc84b	Fix `get_rank` under a non-default group. (#120481 ) Fixes #120213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120481 Approved by: https://github.com/yifuwang	2024-03-11 05:40:54 +00:00
Jason Ansel	7cc476ea16	[dynamo] Fix support for nn.Parameter constructor (part 1) (#120163 ) This captures calls to `torch.nn.Parameter` by lifting them to graph inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163 Approved by: https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #121086	2024-03-11 05:14:42 +00:00
Jason Ansel	32488b0664	[dynamo] Support _unsafe_set_version_counter (#121086 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086 Approved by: https://github.com/yanboliang	2024-03-11 05:14:42 +00:00
Ze Sheng	7a4e451184	[Dynamo] Fix function overrides (#120885 ) To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case Fixes #120653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885 Approved by: https://github.com/yanboliang	2024-03-11 02:18:43 +00:00
Kefei Lu	f11f2b0d55	split predispatch pass into multiple passes (#121592 ) Summary: It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info. This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information. Reviewed By: frank-wei Differential Revision: D53579545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592 Approved by: https://github.com/frank-wei	2024-03-11 00:30:55 +00:00
Avik Chaudhuri	13e8181b7b	relax assertion on fake shape (#121599 ) Summary: Seems like if you use `capture_pre_autograd_graph` fake tensor shapes can be ints instead of symints. Test Plan: fixes the AssertionError in N5057219 Differential Revision: D54729142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121599 Approved by: https://github.com/angelayi, https://github.com/BoyuanFeng	2024-03-10 22:51:10 +00:00
Oguz Ulgen	660ec3d38d	[Export] Fix bug removing node from wrong graph (#121574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121574 Approved by: https://github.com/ydwu4	2024-03-10 04:46:11 +00:00
Yifu Wang	41286f1505	[IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575 ) `hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575 Approved by: https://github.com/Chillee	2024-03-10 00:55:25 +00:00
wz337	60cd2a43ca	[DeviceMesh] Add support for nD slicing (#119752 ) Fixes one of the issue mentioned in #118639 @mvpatel2000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752 Approved by: https://github.com/wanchaol	2024-03-10 00:16:37 +00:00
Bert Maher	e90cddb0d3	[inductor] Log triton kernel source and metadata on failure (#120494 ) If Triton compilation fails it's much easier to debug when given the kernel source directly, versus a PyTorch repro. This would have helped root cause https://github.com/pytorch/pytorch/issues/118589 almost immediately Differential Revision: [D54119568](https://our.internmc.facebook.com/intern/diff/D54119568/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120494 Approved by: https://github.com/peterbell10, https://github.com/eellison, https://github.com/jansel	2024-03-09 20:12:27 +00:00
Peter Bell	168a04e752	[inductor] Changes to support newer triton pin (#121267 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267 Approved by: https://github.com/lezcano ghstack dependencies: #121438	2024-03-09 18:17:36 +00:00
Peter Bell	459c5bca58	[inductor] Refactor common triton imports into one function (#121438 ) This means when codegen depends on a particular import we only need to add it in one place and it's applied to all triton kernels. This also changes codegen slightly so instead of generating `@pointwise` we now generate `@triton_heuristics.pointwise` just so the imports are the same for all kernel types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438 Approved by: https://github.com/lezcano	2024-03-09 18:17:36 +00:00
BowenBao	8c96b4367a	Remove opmath cast for im2col decomp (#121363 ) It is unclear why opmath cast is needed for im2col decomp, given that the decomposition is mainly performing padding, slicing, indexing and shape manipulation. There is no need for performing these operations in a higher precision, and in doing so it requires more memory and yields less performance. Sample script to demonstrate inserted cast before this change ```python import torch from torch._decomp.decompositions import im2col def func(x): return torch.nn.functional.unfold( x, kernel_size=[3, 1], padding=[2, 0], dilation=1, stride=1 ) x = torch.rand(1, 1, 5, 5, dtype=torch.float16) eo = torch._dynamo.export( func, aten_graph=True, decomposition_table={torch.ops.aten.im2col.default: im2col} )(x) eo.graph_module.print_readable() ``` ``` class GraphModule(torch.nn.Module): def forward(self, x): arg0: "f16[1, 1, s0, s0]"; arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) arg0_1 = arg0 _to_copy: "f32[1, 1, s0, s0]" = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32) ... constant_pad_nd: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.constant_pad_nd.default(_to_copy, [0, 0, 2, 2], 0.0); _to_copy = None ... slice_1: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(constant_pad_nd, 0, 0, 9223372036854775807); constant_pad_nd = None slice_2: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807); slice_1 = None index: "f32[1, 1, 3, s0 + 2, 1, s0]" = torch.ops.aten.index.Tensor(slice_2, [None, None, unsqueeze_5, add_3]); slice_2 = unsqueeze_5 = add_3 = None permute: "f32[1, 1, 3, 1, s0 + 2, s0]" = torch.ops.aten.permute.default(index, [0, 1, 2, 4, 3, 5]); index = None ... view: "f32[1, 3, s0*2 + 2s0]" = torch.ops.aten.view.default(permute, [1, 3, mul]); permute = mul = None _to_copy_1: "f16[1, 3, s0*2 + 2s0]" = torch.ops.aten._to_copy.default(view, dtype = torch.float16); view = None return pytree.tree_unflatten([_to_copy_1], self._out_spec) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121363 Approved by: https://github.com/lezcano	2024-03-09 15:37:27 +00:00
Yifu Wang	71d0202627	[dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-03-09 08:28:22 +00:00
PyTorch MergeBot	cf9742371c	Revert "Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 )" This reverts commit 752d164b2f0d401042de4a75f36f7e84bae91daa. Reverted https://github.com/pytorch/pytorch/pull/119685 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is crashing on ROCm `752d164b2f` ([comment](https://github.com/pytorch/pytorch/pull/119685#issuecomment-1986773384))	2024-03-09 07:20:53 +00:00
Florian	761783a4ff	[profiler] Fix recorded profiler step number (#121127 ) Fixes [121126](https://github.com/pytorch/pytorch/issues/121126) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121127 Approved by: https://github.com/briancoutinho	2024-03-09 06:54:51 +00:00
Wanchao Liang	242e03ba86	[dtensor] add async_op option to redistribute and some refactor (#121477 ) async output option was only available in `full_tensor()` call, but I think it's generally good to make this option available in the `redistribute` call directly so that user can control it This PR adds async_op option to redistribute call, to allow user control whether to perform tensor redistribution asynchronously or not. By default we set this to False, this is to follow the semantics of the c10d collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477 Approved by: https://github.com/wz337	2024-03-09 06:17:23 +00:00
Jerry Zhang	a6a67da333	[quant] Add error check for input_edge annotation (#121536 ) Summary: Raises error when an input edge contains non-Node elements like constant values etc in annotation. Test Plan: python test/test_quantization.py -k test_input_edge_sanity_check Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/121536 Approved by: https://github.com/andrewor14	2024-03-09 06:13:04 +00:00
angelayi	e8836759d0	[export] Add effect token to export (#121424 ) Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %attr : [num_users=2] = placeholder[target=attr] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {}) %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {}) %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {}) %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {}) return (getitem_2, add) ``` During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like: ``` graph(): %attr_1 : [num_users=2] = get_attr[target=attr] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {}) %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {}) return (add,) ``` Serialization support will be added in a followup. Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet. Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424 Approved by: https://github.com/tugsbayasgalan	2024-03-09 02:43:26 +00:00
Aidyn-A	eb3919944d	[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045 ) The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++. ``` [rank0]: Traceback (most recent call last): [rank0]: File "~/complex_ddp.py", line 72, in <module> [rank0]: main() [rank0]: File "~/complex_ddp.py", line 64, in main [rank0]: loss.backward() [rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward [rank0]: torch.autograd.backward( [rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward [rank0]: _engine_run_backward( [rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat ``` I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501? Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045 Approved by: https://github.com/eqy, https://github.com/kwen2501	2024-03-09 02:00:54 +00:00
Aleksandar Samardžić	752d164b2f	Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685 Approved by: https://github.com/cpuhrsch	2024-03-09 02:00:50 +00:00
Colin Peppler	13a25c647f	[export] improve binary op fast path broadcast check (#121546 ) # Context I believe we have an incorrect guard being created during FakeTensor's binary op fast path. Consider this case ``` # op.shape: (10, 192); final_shape: (s0, 10, 192) # Guard Ne(s0, 10) is created when we create SymBool(10 == s0) if isinstance(op, torch.Tensor) and op.shape == final_shape: break ``` As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape. * If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`). * If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`. This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`. Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases: 1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable. 2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins? 3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant. # Test ``` $ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic". - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546 Approved by: https://github.com/aakhundov	2024-03-09 01:49:42 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
albanD	6791b0c09e	Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632 ) This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632 Approved by: https://github.com/ezyang	2024-03-09 01:08:37 +00:00
Aidyn-A	ca9678405a	[CUDA graphs] Pool argument for make_graphed_callables (#121475 ) It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475 Approved by: https://github.com/eellison, https://github.com/eqy	2024-03-09 00:15:38 +00:00
Aidyn-A	b2f19dd284	[C10d][UCC] Retain CUDA context in progress_loop (#121446 ) UCC requires CUDA context be present, while `progress_loop` `f61192b014/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L333)` runs on the side thread and it does not have context present (even though it sets the device). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121446 Approved by: https://github.com/kwen2501	2024-03-09 00:09:47 +00:00
chilli	ed8eebd1c2	Changed cublas repdocubility URL (#121534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534 Approved by: https://github.com/Skylion007	2024-03-08 23:46:21 +00:00
Andrew Gu	b0a0850a5c	[DCP] Replaced `storage()` with `untyped_storage()` (#121538 ) Let us try to remove this warning 😄 : ``` [rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() [rank0]: if tensor.storage().size() != tensor.numel(): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538 Approved by: https://github.com/wz337, https://github.com/fegin	2024-03-08 23:46:17 +00:00
Peter Bell	8887c95004	[inductor] Skip welford combine on first reduciton loop iteration (#121488 ) On the first iteration we short circuit `welford_reduce` since we know the accumulators are filled with the default values. This is split out from #120330 to hopefully avoid the meta-internal failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488 Approved by: https://github.com/lezcano	2024-03-08 23:40:48 +00:00
Shengbao Zheng	fe78cf040b	[profiler] add a function to allow adding preset user-defined metadata to traces (#121487 ) Summary: `add_metadata_json` function in profiler can only work when being called during trace collection. However, sometimes we want to pass in some user-defined metadata and amend to the trace before trace collection starts, e.g. when the profiler is defined. This PR add a function `preset_metadata_json` for this purpose. The preset metadata will be stored and amended to the trace later. Differential Revision: D54678790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121487 Approved by: https://github.com/aaronenyeshi	2024-03-08 23:18:48 +00:00
PyTorch MergeBot	9eb8fae02d	Revert "Fix round robin sharding (#121022 )" This reverts commit effdea5fc62c6bf13cb8035f7bfcc205f05a8b6a. Reverted https://github.com/pytorch/pytorch/pull/121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](https://github.com/pytorch/pytorch/pull/121022#issuecomment-1986552662))	2024-03-08 23:16:24 +00:00
Wanchao Liang	bc02fca358	[dtensor] to_local backward grad placement passthrough (#121474 ) to_local accepts a `grad_placements` if user choose to pass, previously we enforce the grad_out to be the "same" placement as the current DTensor for safety. But I realized that we DO NOT need to enforce this constraint. Why? backward placement does not need to be the same as fwd tensor placement, this is already the case for param vs param.grad (i.e. param can be replicate and grad can be partial), so we should not restrict this to activation vs activation grad too Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474 Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang	2024-03-08 23:11:49 +00:00
eellison	9373ad0bb8	Switch cudagraph backend to cudagraph trees (#121019 ) Switch torch.compile(..., backend="cudagraphs") to use cudagraph trees. Enabled a few test in cudagraph_trees and note that there is another test suite existing for cudagraphs backend: https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_cudagraphs.py. This is basically the inductor cudagraphs without inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121019 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #121017, #121018	2024-03-08 22:56:26 +00:00
Mu-Chu Lee	7b3febdca7	Change assertion throw to error message for const_run_impl call. (#121396 ) Summary: Currently we do not have a easy mechanism to distinguish between models created with some specific config. We use a warning instead of failing directly. Test Plan: Messaging change only. Reviewed By: aakhundov Differential Revision: D54622522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121396 Approved by: https://github.com/chenyang78	2024-03-08 22:48:43 +00:00
Ke Wen	038b2e8780	[c10d] Add complex support for P2P (#121240 ) Fixes the following error when `tensor` is a complex tensor: ``` [rank0]: return pg.send([tensor], dst, tag) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240 Approved by: https://github.com/shuqiangzhang	2024-03-08 22:47:49 +00:00
eellison	4af0e634bf	Add Cudagraphs disable checking (#121018 ) Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018 Approved by: https://github.com/ezyang ghstack dependencies: #121017	2024-03-08 22:47:24 +00:00
Andrew Gu	7d0ad5c6f0	[FSDP2] Zeroed padded tensor in `_apply` (#121509 ) This PR replaces the `Tensor.resize_` with an explicit zero-ing of the padded tensor. Uninitialized padding is not good since it can give false positive NaNs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121509 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2024-03-08 22:31:19 +00:00
angelayi	f2d5e96db4	[export] Add docs for 2.3 release (#121466 ) - Added docs about non-strict export - Added example using derived dims - Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480) - Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466 Approved by: https://github.com/zhxchen17	2024-03-08 22:29:48 +00:00
PyTorch MergeBot	2c2d6ce515	Revert "CI sanity check test for env vars (#120519 )" This reverts commit f43b9c56c598b3a0f4d8e1d85f1e67b8f273d235. Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow `d27509c384` https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))	2024-03-08 22:01:35 +00:00
Boyuan Feng	35d3adb4b0	Add ATen Op _chunk_cat and _chunk_cat.out (#121081 ) # Motivation In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0. ### Example 1 Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2): Input tensors: ``` AAAA BBB CC AAAA BBB BBB ``` Reduce-scatter-copy-in Output: ``` AAAABBBCC AAAABBB00 0000BBB00 ``` ### Example 2 Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2): Input tensors: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Reduce-scatter-copy-in first pad: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Then chunk and cat along dim as the output: ``` AAAABBBBBBCCDDDD AAAABBB00000DDDD ``` The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance. # PR We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`: ``` _chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor ``` This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops. In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark. ## Requirements on input 1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim. 2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension. 3. Expect positive num_chunks 4. Expect non-empty input tensor list and each input tensor should have at least 1 element Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081 Approved by: https://github.com/albanD	2024-03-08 21:48:12 +00:00
Catherine Lee	a656e12bf5	Disable test_torch_name_rule_map_updated in code (#120627 ) I am getting tired of this test ;-; It gets disabled because it's broken, and then gets fixed, but something breaks it while it was disabled so its still broken and the infra is not handling it well. Disable issue is https://github.com/pytorch/pytorch/issues/114831 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120627 Approved by: https://github.com/yanboliang	2024-03-08 21:00:51 +00:00
Masaki Kozuki	82bb06334d	Update python binding for in-place foreach to return `List[Tensor]` (#121405 ) fixes #104817 taking over #118622 ```c++ // _foreach_atan_ static PyObject * THPVariable__foreach_atan_(PyObject* self_, PyObject* args, PyObject* kwargs) { HANDLE_TH_ERRORS static PythonArgParser parser({ "_foreach_atan_(TensorList self)", }, /traceable=/false); ParsedArgs<1> parsed_args; auto _r = parser.parse(nullptr, args, kwargs, parsed_args); if(_r.has_torch_function()) { return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch"); } // aten::_foreach_atan_(Tensor(a!)[] self) -> () // auto dispatch__foreach_atan_ = [](at::TensorList self) -> at::TensorList { auto dispatch__foreach_atan_ = [](at::TensorList self) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_atan_(self); }; dispatch__foreach_atan_(_r.tensorlist(0)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; Py_RETURN_NONE; END_HANDLE_TH_ERRORS } ... // _foreach_div_ static PyObject * THPVariable__foreach_div_(PyObject* self_, PyObject* args, PyObject* kwargs) { HANDLE_TH_ERRORS static PythonArgParser parser({ "_foreach_div_(TensorList self, ScalarList scalars)", "_foreach_div_(TensorList self, Tensor other)", "_foreach_div_(TensorList self, TensorList other)", "_foreach_div_(TensorList self, Scalar scalar)", }, /traceable=/false); ParsedArgs<2> parsed_args; auto _r = parser.parse(nullptr, args, kwargs, parsed_args); if(_r.has_torch_function()) { return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch"); } switch (_r.idx) { case 0: { // aten::_foreach_div_.ScalarList(Tensor(a!)[] self, Scalar[] scalars) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, scalars); }; dispatch__foreach_div_(_r.tensorlist(0), _r.scalarlist(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } case 1: { // aten::_foreach_div_.Tensor(Tensor(a!)[] self, Tensor other) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, other); }; dispatch__foreach_div_(_r.tensorlist(0), _r.tensor(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } case 2: { // aten::_foreach_div_.List(Tensor(a!)[] self, Tensor[] other) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, other); }; dispatch__foreach_div_(_r.tensorlist(0), _r.tensorlist(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } case 3: { // aten::_foreach_div_.Scalar(Tensor(a!)[] self, Scalar scalar) -> () // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> at::TensorList { auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> void { pybind11::gil_scoped_release no_gil; at::_foreach_div_(self, scalar); }; dispatch__foreach_div_(_r.tensorlist(0), _r.scalar(1)); PyObject* self_tensorlist = _r.args[0]; Py_INCREF(self_tensorlist); return self_tensorlist; } } Py_RETURN_NONE; END_HANDLE_TH_ERRORS } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121405 Approved by: https://github.com/soulitzer	2024-03-08 21:00:01 +00:00
Simon Fan	d27509c384	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-08 20:43:29 +00:00
Catherine Lee	f43b9c56c5	CI sanity check test for env vars (#120519 ) Make a test that fails on purpose to trigger retries. Check the opposite of success (that env vars exist) It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519 Approved by: https://github.com/huydhn	2024-03-08 20:28:50 +00:00
Adnan Akhundov	75bb049d38	Skip AOT Inductor test_cond_* tests on ROCm (#121522 ) Summary: The newly added tests in https://github.com/pytorch/pytorch/pull/121120 are failing in the `ciflow/periodic` jobs. Here we skip those on ROCm to avoid the need to disable those tests manually on ROCm. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond_nested ... ---------------------------------------------------------------------- Ran 6 tests in 72.122s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121522 Approved by: https://github.com/huydhn, https://github.com/malfet ghstack dependencies: #121120	2024-03-08 20:13:55 +00:00
albanD	53d5276d69	Improve Dynamo support for torch function and class methods in general (#121365 ) I was originally trying to solve https://github.com/pytorch/pytorch/issues/120799 but got sidetracked along the way. This PR contains a couple fixes. Let me know if you want me to split them up! - Properly handle invalid user code when "super()" is called from non-method/classmethod. It will now properly raise the same error as CPython - Fix base VariableTracker `__str__` method shadowing all `__repr__` methods defined in subclasses - Fix accessing a classmethod on a user object to bind "cls" and not "self" - Fix custom class handling of super() call to properly handle mixed regular/class/static methods Locally , test_repros.py -k test_batch_norm_act still fails where the generated graph module is: ``` Call using an FX-traced Module, line 8 of the traced Module's generated forward function: x = self.forward(l_x_); self = l_x_ = None x_1 = self.L__self___act(x); x = None ``` note that "self" is being unset on the first line even though it is used on the second one. For reference, this is the test `c268ce4a6d/test/dynamo/test_repros.py (L1368-L1369)` I cannot figure out where the generated forward function comes from though, any hint would be welcome! Pull Request resolved: https://github.com/pytorch/pytorch/pull/121365 Approved by: https://github.com/jansel	2024-03-08 20:03:49 +00:00
PyTorch MergeBot	c0996866f4	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit 4305c64fea154ee1ab566e19bd7568753fc30916. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))	2024-03-08 20:01:03 +00:00
Ke Wen	c78f72d7e7	[c10d] Deprecate torch.distributed.pipeline (#121464 ) In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464 Approved by: https://github.com/wz337, https://github.com/awgu	2024-03-08 19:55:02 +00:00
PyTorch MergeBot	27a0900946	Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621 )" This reverts commit 25c74a93cdf67545a4e3e1bedf2dbabbddfc5845. Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/izaitsevfb due to depends on #120076, which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-1986324796))	2024-03-08 19:50:57 +00:00
Elias Ellison	937e89f252	cudagraphs backend refactoring (#121017 ) This is just some refactoring.. no functional changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/121017 Approved by: https://github.com/ezyang	2024-03-08 19:47:41 +00:00
PyTorch MergeBot	bc117898f1	Revert "Update XLA pin (#121501 )" This reverts commit 9d83f9dc0e4535f6535389201bc3c4a37f3305e3. Reverted https://github.com/pytorch/pytorch/pull/121501 on behalf of https://github.com/malfet due to We are trying to revert underlying change first ([comment](https://github.com/pytorch/pytorch/pull/121501#issuecomment-1986289409))	2024-03-08 19:34:44 +00:00
Yifu Wang	22cd2658b4	Disable GroupRegistry's thread isolation by default (#121457 ) Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes). However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups. This PR fixes the issue by: - Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry. - Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly. Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457 Approved by: https://github.com/wanchaol	2024-03-08 19:31:24 +00:00
Denis Yaroshevskiy	2c9c57c061	Only profiling when it's enabled. (#121404 ) Summary: The profiling, even when disabled, takes up about 1.5% cpu for a model I'm looking into. This patch just splits into with/without profile runs. The potential downside is that now the script can't enable profiling in itself. It doesn't seem to be used anywhere. If that's a crusial usecase, we can do something about it but ideally we wouldn't. Test Plan: Link with profiles: https://fburl.com/scuba/strobelight_services/ihxsl7pj ``` buck2 run fbcode//caffe2/test/cpp/jit:jit ``` Reviewed By: zhxchen17 Differential Revision: D54066589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121404 Approved by: https://github.com/zhxchen17	2024-03-08 19:23:14 +00:00
lezcano	df06b94778	Add complex support to parametrizations.spectral_norm (#121452 ) Fixes https://github.com/pytorch/pytorch/issues/121091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121452 Approved by: https://github.com/ezyang, https://github.com/peterbell10	2024-03-08 19:17:20 +00:00
PyTorch MergeBot	0f3f4f5534	Revert "[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204 )" This reverts commit 4186c365313e909dfc8574c4469e5015439c2924. Reverted https://github.com/pytorch/pytorch/pull/121204 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/121204#issuecomment-1986252526))	2024-03-08 19:08:50 +00:00
Aaron Gokaslan	d55d803812	Add operator length hint support (#121495 ) Seemed like an easy operator to squeeze into Python 2.3 . Added a simple test. Partially addresses #116396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121495 Approved by: https://github.com/albanD	2024-03-08 19:08:33 +00:00
Nikita Shulga	9b03a06288	[BE] [MPS] Fix `out` resize logic in `torch.where` (#121476 ) By deleting `where_mps` and registering MPS dispatch for `where_kernel`. As result of this change resizing and type-checking logic is shared between MPS, CPU and CUDA backends. Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476 Approved by: https://github.com/albanD, https://github.com/Skylion007 ghstack dependencies: #121473, #121494	2024-03-08 18:59:37 +00:00
Nikita Shulga	9cc89970a9	[BE] Cleanup where_self_out (#121494 ) - Avoid extra assignments by using ternary instead of if-else - Do not call type-cast unless it is needed (in most cases only one of two arguments will need to be custed) - Avoid extra assignment for condition_, by calling `cast` under `if` condition Pull Request resolved: https://github.com/pytorch/pytorch/pull/121494 Approved by: https://github.com/albanD, https://github.com/Skylion007 ghstack dependencies: #121473	2024-03-08 18:59:37 +00:00
Nikita Shulga	1866ee6735	Enable `out` OpInfo testing for `torch.where` (#121473 ) And fix behavior discrepancy between CPU and CUDA by raising an error when `out.dtype` is unexpected Fixes https://github.com/pytorch/pytorch/issues/121397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121473 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-03-08 18:59:37 +00:00
Nitin Jain	0dd21c0c34	Update Quantizable LSTM to support QAT (#121448 ) Summary: Title. Test Plan: * CI * N3684627 Differential Revision: D54653542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121448 Approved by: https://github.com/andrewor14	2024-03-08 18:55:50 +00:00
rzou	b52e0bf131	Deprecate torch.autograd.function.traceable, is_traceable (#121413 ) - There are no usages of this internally. - There are very few usages of this in OSS (most of these are forks of old repositories). - This flag doesn't do anything. We're deprecating it to prevent confusion. I will delete it immediately after the branch cut. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/121413 Approved by: https://github.com/albanD, https://github.com/soulitzer	2024-03-08 18:41:07 +00:00
Wanchao Liang	08460f4bae	[tp] remove deprecated tp_mesh_dim arg (#121432 ) This PR removes the deprecated tp_mesh_dim arg to prepare for release. As we deprecated this arg for a while (by throwing deprecating messages), we should remove it before the release #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432 Approved by: https://github.com/wz337 ghstack dependencies: #121431	2024-03-08 17:46:44 +00:00
Wanchao Liang	30982ce072	[tp] doc fixes (#121431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431 Approved by: https://github.com/wz337	2024-03-08 17:46:44 +00:00
Catherine Lee	effdea5fc6	Fix round robin sharding (#121022 ) Fix round robin sharding when there are no test times and sort_by_time=False Adds more tests to test_test_selections for sort_by_time=False Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests Refactoring of dup code Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-03-08 17:01:34 +00:00
Nikita Shulga	9d83f9dc0e	Update XLA pin (#121501 ) To `8078b8f38c` Fixes regression caused by https://github.com/pytorch/pytorch/pull/120076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121501 Approved by: https://github.com/Skylion007, https://github.com/aakhundov, https://github.com/albanD	2024-03-08 16:53:10 +00:00
Peter Bell	a2a8c1fda0	[AOTDispatch] Return mutated inputs directly when keeping mutations (#120514 ) Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120514 Approved by: https://github.com/ezyang, https://github.com/oulgen, https://github.com/lezcano	2024-03-08 16:33:26 +00:00
Yeounoh Chung	f7ec984b1b	[DTensor][XLA] support XLA backend in distirbute_module API (#121355 ) Addresses #92909 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355 Approved by: https://github.com/wanchaol	2024-03-08 15:47:33 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
lezcano	c253d1c1db	Add links to _ex variants in all linalg functions that support them (#121451 ) Fixes https://github.com/pytorch/pytorch/issues/96632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121451 Approved by: https://github.com/ezyang	2024-03-08 12:19:16 +00:00
leslie-fang-intel	975d428425	[Quant] Add the operator of decomposed fake quant per channel (#121297 ) Summary Add the operator of `quantized_decomposed.fake_quant_per_channel` and test the forward and backward of this op with comparing to ATen. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_fake_quant_per_channel ``` Next Step Optimize the performance: from the generated code of forward and backward graph, the code didn't vectorize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121297 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2024-03-08 10:51:37 +00:00
Karol Blaszczak	8ed0932172	Update link to OpenVINO backend in torch.compiler.rst (#121303 ) This is a permalink, so it will remain active regardless of documentation version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303 Approved by: https://github.com/soulitzer	2024-03-08 08:17:13 +00:00
Avik Chaudhuri	b3f24b57fb	fix accidental specialization with faketensor input checks (#121460 ) Summary: When fake tensors are passed to a graph module and we do runtime assertions on them, we can accidentally trigger specialization guards. It's better to just relax the checking for these. Test Plan: confirmed that problem in T181400371 is now fixed Differential Revision: D54658960 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121460 Approved by: https://github.com/angelayi	2024-03-08 08:02:37 +00:00
Chien-Chin Huang	2e789ad522	[DCP][state_dict][doc] Update the distributed state_dict document (#121290 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290 Approved by: https://github.com/LucasLLC ghstack dependencies: #121273, #121276	2024-03-08 07:58:18 +00:00
Avik Chaudhuri	e628f2cc66	suggested fixes for congruences (#121418 ) Differential Revision: D54636152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121418 Approved by: https://github.com/zhxchen17	2024-03-08 07:19:51 +00:00
Lucas Pasqualin	96ed37ac13	[DCP] Makes async_save public (#121325 ) Makes async_save public Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325 Approved by: https://github.com/wz337 ghstack dependencies: #121317	2024-03-08 05:13:13 +00:00
Chien-Chin Huang	13366a101a	[DCP][state_dict][doc] Fix the documents for distributed_state_dict (#121276 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121276 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #121273	2024-03-08 03:29:47 +00:00
Sam Larsen	72dd9b2430	[inductor] Make some improvements to FX graph caching (#117888 ) Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes: * Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented * Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk) * Include various global settings that could affect compilation in the cache key calculation. * Handle a few config settings that break key calculation. * Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards) * Skip caching when freezing is enabled (Freezing can embed constants that wouldn't be static across runs). * Fix the clear() method to not throw when the cache /tmp dir doesn't exist Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888 Approved by: https://github.com/eellison	2024-03-08 02:30:49 +00:00
Lucas Pasqualin	909d73d8cb	[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's (#121317 ) [DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317 Approved by: https://github.com/fegin	2024-03-08 02:14:12 +00:00
Aaron Orenstein	23ac0cd561	more passing dynamo tests (#121378 ) These are just tests that I noticed passed on current main Running: ``` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/dynamo/test_dynamic_shapes.py test/dynamo/test_compile.py -k 'test_export_decomp_dynamic_shapes or test_export_dynamic_dim_cleanup_dynamic_shapes or test_export_multi_dynamic_dim_constraint_dynamic_shapes or test_export_multi_dynamic_dim_unsafe_relationship_dynamic_shapes or test_export_no_raise_dynamic_shapes or test_export_preserve_constraints_as_metadata_scalar_dynamic_shapes or test_export_raise_on_relationship_dynamic_shapes or test_exported_graph_serialization_dynamic_shapes or test_retracibility_dict_container_inp_out_dynamic_shapes or test_retracibility_nested_list_out_dynamic_shapes or test_exception_table_e2e_2_dynamic_shapes or test_exception_table_e2e_dynamic_shapes or test_exception_table_parsing_dynamic_shapes or test_inference_mode_dynamic_shapes or test_inplace_view_on_graph_input_dynamic_shapes or test_numpy_torch_operators_dynamic_shapes or test_py311_jump_offset_dynamic_shapes or test_lazy_module_no_cls_to_become_dynamic_shapes or test_batchnorm_e2e_dynamic_shapes or test_functools_wraps_dynamic_shapes or test_jit_trace_errors_dynamic_shapes or test_multi_import_dynamic_shapes or test_requires_grad_guards_with_grad_mode2_dynamic_shapese or test_dynamo_signatures' ``` BEFORE: `1 failed, 1 passed, 22 skipped, 1372 deselected` AFTER: `24 passed, 1372 deselected` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121378 Approved by: https://github.com/oulgen	2024-03-08 01:59:01 +00:00
wz337	4186c36531	[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121204 Approved by: https://github.com/Skylion007	2024-03-08 01:54:25 +00:00
David Berard	0f8c9acc29	Revert "[fake_impls] Fix seed/offset device for attention kernels (#120839 )" (#121447 ) This reverts commit df3c8b8390bc601072b0ee9b2c39e07adf370fe2. It regressed cudagraphs+PT2 performance on SDPA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121447 Approved by: https://github.com/Chillee	2024-03-08 01:48:23 +00:00
Tianyu Liu	dc514b967e	[dtensor][TP] check funcol calls and improve doc for loss parallel (#121366 ) Since CommDebugMode is fixed, we can check that loss parallel is working as expected. Under loss parallel, the forward computation should invoke 3 all-reduces, and the backward computation should invoke no functional collectives. Co-authored-by: Wanchao <wanchaol@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121366 Approved by: https://github.com/wanchaol	2024-03-08 01:41:31 +00:00
kareem	25c74a93cd	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-03-08 01:37:53 +00:00
Shunting Zhang	7dc1ab8989	make dyanmo work with _LazyGraphModule.lazy_forward (#121259 ) Fix https://github.com/pytorch/pytorch/issues/121198 . We previously already trigger the real recompilation for LazyGraphModule when it runs thru dynamo context. But people may pass in LazyGraphModule._lazy_forward rather than the LazyGraphModule instance itself. This PR handles that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121259 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-03-08 01:37:39 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Animesh Jain	c86a1ce125	[dynamo][guards-cpp-refactor] Func defaults and kwdefaults accessor (#121338 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121338 Approved by: https://github.com/jansel ghstack dependencies: #121327	2024-03-08 01:24:00 +00:00
Animesh Jain	79a04f2df9	[dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121327 Approved by: https://github.com/jansel	2024-03-08 01:24:00 +00:00
Gregory Comer	962c1b4c69	Update XNNPACK revision to fcbf55a (#120583 ) Update XNNPACK dependency to revision fcbf55a. This is part of a larger, synchronized update of the dependency version for PyTorch, ExecuTorch, and FB internal targets. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120583 Approved by: https://github.com/mcr229	2024-03-08 01:19:22 +00:00
Colin Peppler	090616d9a1	[Indutor] Support auto-tuned custom PT ops in abi compatible mode (#120877 ) Differential Revision: D54344556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120877 Approved by: https://github.com/aakhundov	2024-03-08 01:16:57 +00:00
Animesh Jain	04a5d6e8d3	[dynamo][guards] Use lazy variable tracker for func defaults (#121388 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388 Approved by: https://github.com/jansel	2024-03-08 01:10:46 +00:00
Ivan Zaitsev	5d8e4126b6	Fixup test_trace_rules (#121351 ) Summary: Fixes https://www.internalfb.com/intern/testinfra/diagnostics/7599824578133672.281475099376195.1709732674/ (for some reason this test didn't run in OSS)? Reached out to Yanbo Liang for additional context: {F1465435684} Test Plan: Local: https://www.internalfb.com/intern/testinfra/testconsole/testrun/16325548673376150/ Differential Revision: D54605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121351 Approved by: https://github.com/malfet, https://github.com/yanboliang	2024-03-08 00:50:45 +00:00
angelayi	af62a70fab	[export] Fix nn_module_stack in retracing (#121423 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1391916691446538/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121423 Approved by: https://github.com/zhxchen17	2024-03-08 00:34:11 +00:00
Tugsbayasgalan Manlaibaatar	4f120dc2a6	Clean up mode handling in python dispatcher (#121083 ) Things that were bad before this PR: 1. Temporarily unsetting functional tensor mode and proxy mode both had duplicate implementation 2. There are variants of mode handling private utils that has duplicate implementation. (different APIs calling repeated implementation, so i refactored) 3. _push_mode API used to take dispatch key argument which is not necessary. 4. There are unused APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121083 Approved by: https://github.com/zou3519	2024-03-08 00:30:34 +00:00
Chien-Chin Huang	0811f15270	[DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-03-08 00:24:29 +00:00
Jane Xu	f76e541ec7	[BE] NO MORE discrepancy between forloop foreach capturable YAY (#121269 ) and I will not let it happen again Pull Request resolved: https://github.com/pytorch/pytorch/pull/121269 Approved by: https://github.com/albanD ghstack dependencies: #121260, #121264	2024-03-08 00:00:30 +00:00
Jane Xu	9d6c5be781	Add ASGD capturable API for forloop (#121264 ) @tfsingh I got to it first--wanted to land this stack and close the gap ASAP. This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always. There are some next steps though: - ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though. ¯\_(ツ)_/¯ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264 Approved by: https://github.com/albanD ghstack dependencies: #121260	2024-03-08 00:00:30 +00:00
Jane Xu	24821fec26	Add RAdam capturable API for forloop (#121260 ) Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260 Approved by: https://github.com/mlazos	2024-03-08 00:00:30 +00:00
Dheeraj Peri	b1657beac1	feat: Add min, max ranges to mark_dynamic API (#119737 ) Fixes https://github.com/pytorch/pytorch/issues/115137 This PR adds: - mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim. - test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-03-07 23:26:03 +00:00
PyTorch MergeBot	e0c534fe02	Revert "[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590 )" This reverts commit 156954d6a2a05f3ce8288dd054691102e596e461. Reverted https://github.com/pytorch/pytorch/pull/105590 on behalf of https://github.com/ezyang due to https://github.com/pytorch/pytorch/issues/121288#issuecomment-1981980699 ([comment](https://github.com/pytorch/pytorch/pull/105590#issuecomment-1984745827))	2024-03-07 23:06:29 +00:00
Adnan Akhundov	3d089de851	Add torch.cond support to AOT Inductor (#121120 ) Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends). Notable additions: - A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config). - Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers. - More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions). - New unit tests to cover the added AOT Inductor + `torch.cond` functionality. Codegen examples from the new unit tests: - [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e) - [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb) - [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6) - [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223) - [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711) - [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12) - [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690) Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 42 tests in 170.619s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-07 22:39:57 +00:00
Tiago Quelhas	26740f853e	Remove unnecessary use of ctx.resolve_tools. (#120493 ) In this case, it's simpler to use ctx.actions.run(executable = ...), which already ensures that the runfiles associated with the executable are present. (It's also possible to use ctx.actions.run_shell(tools = ...) with a custom command line, but it's unclear to me that indirecting through the shell is needed here.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120493 Approved by: https://github.com/ezyang	2024-03-07 22:33:17 +00:00
William Wen	d14d62b7aa	[dynamo] add more refleak tests (#120657 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120657 Approved by: https://github.com/jansel	2024-03-07 22:25:43 +00:00
Edward Z. Yang	6490441d8f	Remove dead get_shape_groups (#120813 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120813 Approved by: https://github.com/albanD	2024-03-07 22:20:30 +00:00
Oguz Ulgen	18d574a07a	[Inductor] Use indices for constants in triton_meta (#121427 ) @bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true `392370b303/python/triton/runtime/jit.py (L381-L385)` I'm guessing both indices and names work here but lets be consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427 Approved by: https://github.com/aakhundov	2024-03-07 21:59:43 +00:00
Antoni Viros	f61192b014	Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121428 Approved by: https://github.com/yifuwang	2024-03-07 21:29:25 +00:00
Zhengxu Chen	76f1461892	[export] Serialize union fields with single entry dict. (#121263 ) (#121337 ) Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly. bypass-github-export-checks Test Plan: CI Differential Revision: D54600943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121337 Approved by: https://github.com/tugsbayasgalan	2024-03-07 21:24:28 +00:00
Scott Wolchok	4c58f2b675	[PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335 ) We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity. Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335 Approved by: https://github.com/Skylion007	2024-03-07 21:15:05 +00:00
Joel Schlosser	ea8f6e2e54	Subclass view fake-ification via reified ViewFuncs (#118405 ) This PR: * Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification * Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach * Covers the following view types: * subclass -> dense * dense -> subclass * subclass -> subclass * Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405 Approved by: https://github.com/ezyang	2024-03-07 19:56:16 +00:00
Catherine Lee	63ec5cd158	TD Heuristic for tests mentioned in PR body, less verbose TD printing (#120621 ) Move tests that are mentioned in PR body or commit message to front. Also attempts to find any issues/PRs mentioned in the PR body and search for those too (ex if you link a disable issue and that issue contains the test file that it was failing on) looking for: dynamo/test_export_mutations Also removes some printed information in TD Pull Request resolved: https://github.com/pytorch/pytorch/pull/120621 Approved by: https://github.com/osalpekar	2024-03-07 19:36:11 +00:00
Nikita Shulga	c7a65f58b0	[CI] Script to fetch creds from current AWS session (#121426 ) Because some implementations, like OpenDAL does not work with AWS IMDSv2, but this script will bridge the gap and enables more recent `sccache` releases(that switched from simple-s3 to OpenDAL) to work in current CI system When launched it prints something like: ``` export AWS_ACCESS_KEY_ID=XXXXX export AWS_SECRET_ACCESS_KEY=YYYY export AWS_SESSION_TOKEN=ZZZZ ``` which can be `eval`ed and passed then sccache can use those failures. Validated in https://github.com/pytorch/pytorch/pull/121323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121426 Approved by: https://github.com/Skylion007	2024-03-07 19:25:54 +00:00
PyTorch MergeBot	2b1661c7a0	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit 05c256849b464deee16ccd70152fd54071c6c79c. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))	2024-03-07 18:53:51 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
Jane Xu	83d095c213	[BE] Remove unnecessary requires_cuda in common_optimizers.py (#121249 ) @mlazos had already added the needed decorator on the test itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121249 Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/albanD ghstack dependencies: #121183	2024-03-07 17:57:02 +00:00
Jane Xu	53bdae736d	Add capturable single tensor Adamax (#121183 ) Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop. Next steps: * This PR discovered two bugs: #121178 and #121238. * Move the now hefty graph optim tests in test_cuda to use OptimInfo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183 Approved by: https://github.com/albanD	2024-03-07 17:57:02 +00:00
Catherine Lee	af88425cdc	Forward fix lint after 121202 (#121425 ) Forward fix after #121202, where the lintrunner job failed due to being unable to checkout the pytorch repo Pull Request resolved: https://github.com/pytorch/pytorch/pull/121425 Approved by: https://github.com/ezyang, https://github.com/aakhundov, https://github.com/malfet	2024-03-07 17:20:13 +00:00
suo	c3c15eb9a6	[export] update docs to not export raw functions (#121272 ) as title Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272 Approved by: https://github.com/zhxchen17	2024-03-07 17:18:07 +00:00
PyTorch MergeBot	862b99b571	Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 )" This reverts commit 3239f86a3df133b5977d988324639e0de7af8749. Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/malfet due to Breaks internal tests, likely due to the increased memory requirements ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1983875400))	2024-03-07 16:16:07 +00:00
Shengbao Zheng	eea37c6db4	[profiler] record nccl version in distributed info (#121044 ) Summary: Add a field of NCCL version in distributed info if backend is NCCL Differential Revision: D54432888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121044 Approved by: https://github.com/aaronenyeshi	2024-03-07 15:56:02 +00:00
cyy	3aa512cd72	[Clang-tidy header][23/N] Enable clang-tidy coverage on aten/src/ATen/*.{cpp,h} (#121380 ) This PR finishes the works beginning with #https://github.com/pytorch/pytorch/pull/120763 by enabling clang-tidy on aten/src/ATen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121380 Approved by: https://github.com/Skylion007	2024-03-07 15:11:07 +00:00
IvanKobzarev	9a45001905	[dynamo] relax missing symbols runtime assert (#121339 ) Differential Revision: [D54603361](https://our.internmc.facebook.com/intern/diff/D54603361) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121339 Approved by: https://github.com/ezyang	2024-03-07 14:53:38 +00:00
Bin Bao	0339f1ca82	[Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310 ) Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard. Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310 Approved by: https://github.com/chenyang78 ghstack dependencies: #121309	2024-03-07 14:24:21 +00:00
Bin Bao	7e598c0053	[Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309 ) Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309 Approved by: https://github.com/chenyang78	2024-03-07 14:22:06 +00:00
Kai Londenberg	57fc35a3af	[ Inductor ] Shape padding honors output stride preservation (#120797 ) This fix makes sure that shape padding honors inductors 'keep_output_strides' setting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120797 Approved by: https://github.com/eellison	2024-03-07 13:52:29 +00:00
cyy	4305c64fea	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-07 09:52:21 +00:00
Shunting Zhang	1ce5049692	[inuctor] fix the layout problem for nll_loss2d_backward (#121173 ) Fixes https://github.com/pytorch/pytorch/issues/120759 . The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op. Not sure if we can improve the cuda kernel to release the constraints though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-07 09:05:07 +00:00
mingfeima	b3065f6899	add int8 packed gemm support on CPU device (#118056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056 Approved by: https://github.com/mikekgfb	2024-03-07 08:41:43 +00:00
Andrew Gu	e8e3049f57	[FSDP2] Relaxed check for parent mesh (#121360 ) Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #120351, #121328	2024-03-07 08:09:25 +00:00
Valentine233	db36d21f5c	Add SDPA pattern for HuggingFace models BF16 (#121202 ) ### Description - Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM) - Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny) ### Newly matched models Dtype: bf16, machine: SPR #### Dynamo HuggingFace models - ElectraForCausalLM (speedup=2.09x) - ElectraForQuestionAnswering (speedup=4.22x) - AlbertForQuestionAnswering (speedup=1.36x) - AlbertForMaskedLM (speedup=1.39x) #### OOB HuggingFace models - multiple-choice+google-electra-base-discriminator - text-classification+prajjwal1-bert-tiny - text-classification+prajjwal1-bert-mini - text-classification+google-electra-base-generator - text-classification+bert-large-cased - casual-language-modeling+xlm-roberta-base - text-classification+roberta-base - text-classification+xlm-roberta-base - text-classification+albert-base-v2 - token-classification+google-electra-base-generator - masked-language-modeling+bert-base-cased Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-07 07:40:00 +00:00
Oguz Ulgen	953c6c37cb	Wrap remote cache creation with a try-catch (#121340 ) Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch. Test Plan: CI Differential Revision: D54604339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340 Approved by: https://github.com/aakhundov	2024-03-07 07:05:49 +00:00
Chen_Liqing	291ce86a6c	Modify StorageImplCreateHelper (#118459 ) I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``: `bb6eba189f/torch/csrc/Storage.cpp (L525-L540)` Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459 Approved by: https://github.com/albanD	2024-03-07 06:26:55 +00:00
Xia, Weiwen	f848e9c646	[Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984 ) Fixes #120869 Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point. Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32. Test plan python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-07 06:23:52 +00:00
Yeounoh Chung	4f9d4e1ab0	[DTensor][XLA] refactor DTensor _xla API (#113214 ) In response to the change pytorch/xla#5776 and #92909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214 Approved by: https://github.com/wanchaol	2024-03-07 06:18:05 +00:00
cyy	c723514ef4	[CUDACachingAllocator] Simplify update_stat and avoid casts (#120964 ) update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964 Approved by: https://github.com/albanD	2024-03-07 05:55:38 +00:00
drisspg	55232c4e1c	Make CausalBias a torch.Tensor subclass again (#121358 ) # Summary This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-03-07 05:20:47 +00:00
Xilun Wu	df2ad1fecc	[dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216 ) Summary In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216 Approved by: https://github.com/wanchaol ghstack dependencies: #121179, #120260	2024-03-07 04:50:15 +00:00
Xilun Wu	77873f6fe5	[dtensor][1/N] add torchrec even row-wise sharding example (#120260 ) Summary our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change. This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case. Test Run `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260 Approved by: https://github.com/wanchaol ghstack dependencies: #121179	2024-03-07 04:50:15 +00:00
Xilun Wu	9cc0f23e5c	[dtensor][debug] allow visualize_sharding to print header (#121179 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179 Approved by: https://github.com/wanchaol	2024-03-07 04:50:06 +00:00
jmarin	a2854ae904	Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464 ) This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present. In the original code, ``_metadata`` was handled as a ``key``. ``` # also strip the prefix in metadata if any. if "_metadata" in state_dict: ``` This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to: ``` # also strip the prefix in metadata if any. if hasattr(state_dict, "_metadata"): ``` This PR also includes the necessary test. Fixes #106942 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464 Approved by: https://github.com/mikaylagawarecki	2024-03-07 04:00:49 +00:00
Aaron Orenstein	edd80f87b8	Prevent infinite recursion within Tensor.__repr__ (#120206 ) `Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself). The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error. Repro: ``` import torch from torch.testing import make_tensor from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2)) t2 = FakeTensor.from_tensor(t, FakeTensorMode()) print(repr(t2)) ``` and run with `TORCH_LOGS=+all` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206 Approved by: https://github.com/yanboliang, https://github.com/pearu	2024-03-07 02:24:45 +00:00
laith sakka	eb4d87f237	graph break on sparse tensors constructions (#120458 ) Fix some tests in https://github.com/pytorch/pytorch/issues/119780 sparse_bsc_tensor is not supported https://github.com/pytorch/pytorch/pull/117907 Also more about the issue here. https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458 Approved by: https://github.com/ezyang	2024-03-07 02:17:41 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Eddie Yan	967dd31621	[cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862 ) Follow-up of #95722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862 Approved by: https://github.com/Skylion007	2024-03-07 01:46:25 +00:00
briancoutinho	b9087f8571	[profiler] Add execution_trace_observer as an optional argument to profiler (#119912 ) # Update Profiler API to collect Execution Traces ## TLDR We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware. ``` import torch def main(): with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], … excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW ) as prof: ... prof.step() ``` See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API. ## What are Execution Traces? [Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies. - Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too. - At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki) Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)] ## Why correlate Execution Trace with PyTorch/Kineto Trace Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly. Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths. ## Proposal The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section # Testing Updated the unit test for collecting kineto and Execution Trace together. - Check the collected ET has right range of events. - Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference. ``` pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP Running 1 items in this shard test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [W execution_trace_observer.cpp:694] Disabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-03-07 01:30:26 +00:00
Lucas Pasqualin	eb1145436a	[DCP] Adds main in format utils (#120128 ) Adds main in format utils. Usage: `python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt` or `python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir` Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128 Approved by: https://github.com/fegin, https://github.com/wz337	2024-03-07 01:18:17 +00:00
cyy	5cc511f72f	Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123 ) This PR follows the suggestions in #121066 and changes most loops to c10::irange. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123 Approved by: https://github.com/soulitzer	2024-03-07 00:11:27 +00:00
Xiaodong Wang	c268ce4a6d	Make ATen-cpu cuda/rocm agnostic (#121082 ) Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system. Test Plan: sandcastle + oss ci Differential Revision: D54453492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082 Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD	2024-03-06 23:51:40 +00:00
Yichen Yan	e50ded03a6	Use type check for also `is_not` (#113859 ) Handle `is_not` for: `9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)` I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859 Approved by: https://github.com/Skylion007	2024-03-06 23:12:42 +00:00
Wanchao Liang	a88356f45c	[dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294 ) add_.Tensor and div_.Scalar should support linearity so that we delay the partial results. This fixes the additional collective in the layernorm layer that we seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294 Approved by: https://github.com/tianyu-l	2024-03-06 22:52:18 +00:00
Edward Z. Yang	2f064d895c	Switch TORCH_TRACE to accept a directory by default (#121331 ) Directory is better because it works smoothly with distributed runs; otherwise you'd need to modify torchrun to setup distinct log names for each file. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331 Approved by: https://github.com/albanD	2024-03-06 22:46:18 +00:00
Andrew Gu	372f192050	[DTensor] Initialized RNG tracker if needed (#121328 ) Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`). ``` pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328 Approved by: https://github.com/wanchaol ghstack dependencies: #120351	2024-03-06 22:21:44 +00:00
Denis Yaroshevskiy	b0e2ed4d67	removing some macros (#120314 ) Summary: Will be making some changes in the surrounding code, they are going to be easier without macros Differential Revision: D54001770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314 Approved by: https://github.com/zhxchen17	2024-03-06 22:06:05 +00:00
Lourencom	69cedc16c5	Add padding dimension checks and tests (#121298 ) Fixes #121093 Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault: ``` torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d ``` To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298 Approved by: https://github.com/mikaylagawarecki	2024-03-06 21:55:34 +00:00
Yifu Wang	d7a5e59647	[dynamo] support group=None when rewriting collectives (#121043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043 Approved by: https://github.com/awgu	2024-03-06 21:37:19 +00:00
lezcano	3fee05f242	Triage the remaining fallbacks (#121312 ) Building off work from @amjames. There may be some missclassifications, feel free to flag them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121312 Approved by: https://github.com/jansel	2024-03-06 21:23:47 +00:00
Andrew Gu	e865700f6a	[FSDP2] Added initial meta-device init support (#120351 ) This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`. We override `_apply` to achieve the following: - Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this - Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`. ``` # Pre-training flow (no checkpoint) global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp")) dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"] with torch.device("meta"): model = ... parallelize_module(model, tp_mesh, ...) fully_shard(model, mesh=dp_mesh, ...) for param in model.parameters(): assert param.device.type == "meta" model.to_empty(device="cuda") random.manual_seed(42, global_mesh) for module in model.modules(): if hasattr(module, "reset_parameters"): module.reset_parameters() ``` This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351 Approved by: https://github.com/wanchaol	2024-03-06 21:18:25 +00:00
Johannes Aalto	3cf02c5e06	[Dev Container] Fix container build by preventing conda prompt (#121128 ) Without this the build will freeze with prompt: Proceed ([y]/n)? I'm using rootless podman in vscode instead of docker but I think it should not affect this. ..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything. Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128 Approved by: https://github.com/drisspg	2024-03-06 20:50:40 +00:00
angelayi	58ac4a2007	Remove llava from ci_expected_accuracy as it's flaky (#121322 ) https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322 Approved by: https://github.com/desertfire	2024-03-06 20:47:01 +00:00
PyTorch MergeBot	23fb37fa41	Revert "[export] Serialize union fields with single entry dict. (#121263 )" This reverts commit 7feabe9b73e6ba7724b62ea91df27049defdf378. Reverted https://github.com/pytorch/pytorch/pull/121263 on behalf of https://github.com/osalpekar due to A large number of inductor benchmarking jobs failing starting this PR. See for details: `7feabe9b73` ([comment](https://github.com/pytorch/pytorch/pull/121263#issuecomment-1981680049))	2024-03-06 19:58:55 +00:00
Tobias Ringwald	76f3663efe	Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156 ) …unsupported dtype. Fixes #121138. The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156 Approved by: https://github.com/soulitzer, https://github.com/albanD	2024-03-06 19:37:38 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
Thiago Crepaldi	418568d2e3	Add Float8 support to onnx exporter (#121281 ) Fixes #106877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281 Approved by: https://github.com/BowenBao, https://github.com/titaiwangms	2024-03-06 18:46:56 +00:00
cyy	5a2527db22	[Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#121102 ) This PR continues to fix clang-tidy warnings in aten/src/ATEN/, following #120763. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102 Approved by: https://github.com/Skylion007	2024-03-06 18:36:31 +00:00
Michael Lazos	c5ef4df274	guard on grads being `None` in compiled optimizers (#121291 ) Fixes #115607 We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-03-06 18:33:23 +00:00
Zhengxu Chen	7feabe9b73	[export] Serialize union fields with single entry dict. (#121263 ) Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly. Test Plan: CI Differential Revision: D54553770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121263 Approved by: https://github.com/tugsbayasgalan	2024-03-06 18:16:16 +00:00
PaulZhang12	c66d68ba51	[PT2] Add tolist() to FunctionalTensor for torch.export (#121242 ) Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242 Approved by: https://github.com/ezyang	2024-03-06 18:10:44 +00:00
Simon Fan	05c256849b	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-06 18:01:56 +00:00
blorange-amd	b27d76949b	[ROCm] Enable several fake_crossref UTs on ROCm (#121112 ) Enabled unit tests: test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_subgradients_at_zero_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_norm_subgradients_at_zero_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_norm_nuc_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_svd_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_svd_cuda_float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121112 Approved by: https://github.com/ezyang	2024-03-06 17:36:47 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
jyomu	8dd4b6a78c	Fix venv compatibility issue by updating python_lib_path (#121103 ) Reference by sys.executable is the absolute path of the executable binary for the Python interpreter, which may not be appropriate. Instead, sys.base_exec_prefix is more suitable, and this change will correctly resolve the library when using venv. I have tested it with a venv created by rye. https://docs.python.org/3.6/library/sys.html#sys.executable > A string giving the absolute path of the executable binary for the Python interpreter, on systems where this makes sense. If Python is unable to retrieve the real path to its executable, [sys.executable](https://docs.python.org/3.6/library/sys.html#sys.executable) will be an empty string or None. https://docs.python.org/3.6/library/sys.html#sys.exec_prefix > A string giving the site-specific directory prefix where the platform-dependent Python files are installed; by default, this is also '/usr/local'. This can be set at build time with the --exec-prefix argument to the configure script. Specifically, all configuration files (e.g. the pyconfig.h header file) are installed in the directory exec_prefix/lib/pythonX.Y/config, and shared library modules are installed in exec_prefix/lib/pythonX.Y/lib-dynload, where X.Y is the version number of Python, for example 3.2. https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix > Set during Python startup, before site.py is run, to the same value as [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix). If not running in a [virtual environment](https://docs.python.org/3.6/library/venv.html#venv-def), the values will stay the same; if site.py finds that a virtual environment is in use, the values of [prefix](https://docs.python.org/3.6/library/sys.html#sys.prefix) and [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix) will be changed to point to the virtual environment, whereas [base_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_prefix) and [base_exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix) will remain pointing to the base Python installation (the one which the virtual environment was created from). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121103 Approved by: https://github.com/ezyang	2024-03-06 17:00:46 +00:00
mingfeima	a427d90411	add int4 packed gemm support on CPU device (#117475 ) This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-03-06 16:25:53 +00:00
Guilherme Leobas	54d92f2e37	Add jacrev support in torch.compile (#121146 ) Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146 Approved by: https://github.com/zou3519	2024-03-06 16:05:33 +00:00
vfdev-5	49d1fd31cf	Fuse nodes with sizes (s0s1...,) and (s0, s1, s2, ...) (#120077 ) Description: - PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes. - this should influence only cpu device Example: ```python from unittest.mock import patch import torch from torch._inductor.graph import GraphLowering from torch._inductor import config # Force multple scheduler nodes creation to fuse them config.realize_opcount_threshold = 1 @torch.compile(fullgraph=True, dynamic=True) def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor: o1 = x * w1.view(1, 1, 1, -1) o2 = x * w2.view(1, 1, 1, -1) output = o1 + o2 return output in_nodes = [] outputs = [] run_node = GraphLowering.run_node graph_lowering_obj = None def run_node_alt(self, n): global graph_lowering_obj graph_lowering_obj = self in_nodes.append(n) output = run_node(self, n) outputs.append(output) return output x = torch.rand(1, 3, 32, 32) w1 = torch.randn(32) w2 = torch.randn(32) with patch.object(GraphLowering, "run_node", run_node_alt): fn(x, w1, w2) print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers) print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes) ``` Output on `main`: ``` graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg1_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul, origins={mul} )), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg4_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul_1, origins={mul_1} )), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf0, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(buf1, i3 + i1 * s0*2 + i2 s0) tmp2 = tmp0 + tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=add, origins={add} ))] graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')] ``` Output on this PR: ``` graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg1_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul, origins={mul} )), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg4_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul_1, origins={mul_1} )), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf0, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(buf1, i3 + i1 * s0*2 + i2 s0) tmp2 = tmp0 + tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=add, origins={add} ))] graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)] ``` Context: While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10	2024-03-06 12:19:45 +00:00
Yukio Siraichi	aa0b0944d5	[dynamo] Re-dispatch `torch.Tensor.new` into `torch.Tensor.new_empty` method. (#121075 ) Fix: https://github.com/pytorch/xla/issues/6009 This PR adds another case to `TensorVariable.method_new` special case, where it re-dispatches `new` into `new_empty`. Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding backend (e.g. XLA). So, things like the following might happen: ```python @torch.compile(backend="openxla") def foo(x): new_x = x.new(x.size()) # new_x.device() == "xla" # x.device() == "xla:0" return new_x + x a = torch.arange(10) foo(a.to(xm.xla_device())) ``` Resulting in the following error: ```python Traceback (most recent call last): ... File "torch/_dynamo/utils.py", line 1654, in get_fake_value ret_val = wrap_fake_exception( File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception return fn() File "torch/_dynamo/utils.py", line 1655, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "torch/_dynamo/utils.py", line 1776, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "torch/_dynamo/utils.py", line 1758, in run_node return node.target(args, *kwargs) File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl return self.wrap_meta_outputs_with_default_device_logic( File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic return tree_map(wrap, r) File "torch/utils/_pytree.py", line 900, in tree_map return treespec.unflatten(map(func, flat_args)) File "torch/utils/_pytree.py", line 736, in unflatten leaves = list(leaves) File "torch/_subclasses/fake_tensor.py", line 1550, in wrap ) = FakeTensor._find_common_device(func, flat_args) File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device merge_devices(arg) File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices raise RuntimeError( torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>((FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), *{}): Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0 ``` Using `new_empty`, instead, fixes this error because it uses the device from the source tensor, instead of inferring from the current dispatch key set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075 Approved by: https://github.com/jansel	2024-03-06 11:49:27 +00:00
Animesh Jain	e3bd6efe72	[dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147, #121154	2024-03-06 08:36:45 +00:00
Animesh Jain	b6b2d5b00a	[dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147	2024-03-06 08:36:45 +00:00
Animesh Jain	52d89d8491	[dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147 Approved by: https://github.com/jansel ghstack dependencies: #121121	2024-03-06 08:36:45 +00:00
Animesh Jain	af7f55ffc8	[dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121 Approved by: https://github.com/jansel	2024-03-06 08:36:45 +00:00
Avik Chaudhuri	0b9bfcf9bb	[non-strict export] support tensor attribute without other args (#121176 ) Summary: Without args we have a hard time detecting fake modes. This causes a fake mode mismatch error in non-strict (specifically, `aot_export_module`) when the module contains tensor attributes, because we create a fresh fake mode when we cannot detect one. The fix is to pass the same fake mode throughout. Test Plan: added test Differential Revision: D54516595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121176 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-03-06 08:10:00 +00:00
PyTorch MergeBot	8087912622	Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 )" This reverts commit 0ab2ec37383e44fa00c520de6e2b40845fccc6f3. Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))	2024-03-06 06:39:51 +00:00
lancerts	099ff51d45	torch check the division by zero in batch_norm_update_stats (#120882 ) Fixes #120803 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882 Approved by: https://github.com/CaoE, https://github.com/malfet	2024-03-06 05:40:21 +00:00
Masaki Kozuki	2eec0e7c5f	[BE] Remove `__iniline__` from `__global__` (#121246 ) in layer_norm_kernel.cu since the qualifier seems to be ignored according to: ``` [18/263] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o /home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" /home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300): warning #20050-D: inline qualifier ignored for "__global__" function Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246 Approved by: https://github.com/eqy, https://github.com/malfet	2024-03-06 05:16:52 +00:00
Sheng Fu	31bfa59970	Capture primitive data type arguments for profiling python_function (#120949 ) RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive. This PR is to support primitive (or its container) arguments in RECORD_FUNCTION. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949 Approved by: https://github.com/soulitzer	2024-03-06 05:09:22 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Driss Guessous	f72eb5ae4c	__grid__constant is only suported on cuda version >= 11.8 (#121275 ) Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8. Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding Differential Revision: D54556796 Co-authored-by: Driss Guessous <drisspg@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275 Approved by: https://github.com/drisspg	2024-03-06 03:44:59 +00:00
Joel Schlosser	dad1b76584	Introduce EphemeralSource for symbols that should be simplified out (#120948 ) Context: view fake-ification should handle closed-over state in ViewFuncs for use in view replay by: * fake-ifying tensors * symbolicizing SymInts This avoids invalid specialization during view replay. However, the symbols / tensors created as intermediates in the view chain should not stick around or be guarded on. This PR introduces an `EphemeralSource` intended to be used as a source for this purpose. It has the following properties: * Considered first to be simplified out in symbol simplification logic * Errors if guarded on Differential Revision: [D54561597](https://our.internmc.facebook.com/intern/diff/D54561597) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120948 Approved by: https://github.com/ezyang	2024-03-06 02:30:52 +00:00
Wei (Will) Feng	d968fc442b	[FSDP] restore fully_shard after exit from mock.patch (#121058 ) manually restore fully_shard after \_\_exit\_\_ from mock.patch ctx. This will fix flaky CIs in trunk ``` pytest test/distributed/_composable/fsdp/test_fully_shard_training.py ``` this is a workaround to make mock.patch(fully_shard) work with multi-thread * thread 1 set func.\_\_module\_\_[fully_shard] = patched function * thread 2 read func.\_\_module\_\_[fully_shard], thought it is original and fail to restore fully_shard during \_\_exit\_\_ * this PR manually restore fully_shard after \_\_exit\_\_ Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121058 Approved by: https://github.com/awgu	2024-03-06 02:14:59 +00:00
eqy	8dafc81ba9	[cuBLAS][cuBLASLt] Fix expected failures for `int_mm` on `sm75` (turing) (#121277 ) CC @malfet @atalman @ptrblck @tinglvv Pull Request resolved: https://github.com/pytorch/pytorch/pull/121277 Approved by: https://github.com/malfet	2024-03-06 01:51:01 +00:00
Rodolfo Guerrero	ce6a7d56fc	Don't merge qnnpack (#120676 ) Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack. Test Plan: 1. Measure the binary size impact 1. Release build failed previously; now it should succeed Differential Revision: D54048156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676 Approved by: https://github.com/kimishpatel	2024-03-06 01:42:13 +00:00
Mikayla Gawarecki	4b3903379a	Add assign argument to torch.Tensor.module_load (#121158 ) Make `torch.__future__.get_swap_module_params_on_conversion() == True` account for `assign` argument to `nn.Module.load_state_dict` Similar to when `torch.__future__.set_swap_module_params_on_conversion()` is `False`, `assign=True` means that we do not incur a `self.copy_(other)` and the properties of `other` will be preserved Pull Request resolved: https://github.com/pytorch/pytorch/pull/121158 Approved by: https://github.com/albanD ghstack dependencies: #121157	2024-03-06 01:32:06 +00:00
Mikayla Gawarecki	27389e03f0	[easy] Fixed requires_grad preservation for nn.Module.load_state_dict(assign=True) (#121157 ) Always preserve requires_grad of param in module. Documentation fixed in PR stacked above. Also fix test case to test load a state_dict generated with `keep_vars=False` (the default) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121157 Approved by: https://github.com/albanD	2024-03-06 01:32:06 +00:00
Denis Yaroshevskiy	87a533ed1b	c10:intrusive_ptr, self assignment (#119275 ) Summary: In C++ books/sources, self assignment check is often considered a bad practice, since it is very very unlikely. See, for example libc++ doesn't have it: `cf94e0082e/libcxx/include/__memory/shared_ptr.h (L651)` How about we remove it? Test Plan: This check is like 1% of cycles assinged to intrusive_ptr::operator= https://fburl.com/scuba/strobelight_services/9qqnrkdn This is not a lot in purely cycles but since it's gpu machines, can be substantial Differential Revision: D53471639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119275 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-03-06 01:11:56 +00:00
CaoE	412c687e2e	Fix permuted sum precision issue for lower precision on CPU (#108559 ) Fixes #83149 There is a limitation of `TensorIterator` reductions: The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim). Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559 Approved by: https://github.com/mingfeima, https://github.com/peterbell10	2024-03-06 01:01:35 +00:00
mingfeima	34e3f6f3c9	fix segfault in torch.native_channel_shuffle when input is empty (#121199 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): fix https://github.com/pytorch/pytorch/issues/121092 `torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel. * __->__ #121199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199 Approved by: https://github.com/malfet	2024-03-06 00:46:36 +00:00
Jinzhe Zeng	8473cd92e4	remove compute capability 3.5 for CUDA 12 (#114930 ) CUDA 12 has removed compute capability 3.5. NVCC throws the error: `nvcc fatal : Unsupported gpu architecture 'compute_35'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114930 Approved by: https://github.com/malfet	2024-03-06 00:40:57 +00:00
Sunita Nadampalli	d13ed8503c	CI: Add aarch64 docker build and ciflow tags (#120931 ) adding workflows for aarch64 linux docker build with ACL installed as system dependency Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120931 Approved by: https://github.com/atalman, https://github.com/malfet	2024-03-06 00:31:22 +00:00
Scott Wolchok	cac36e232e	[PyTorch] Split StaticModule out of test_static_runtime (#121028 ) I want to use StaticModule in another (internal) test, so splitting it out. Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028 Approved by: https://github.com/suo	2024-03-05 23:14:07 +00:00
drisspg	f5391dad82	Update docs to point to new sdpa_kernel context manager (#121180 ) # Summary Updates the SDPA docs to fix some small inaccuracies and points to the new sdpa_kernel context manger. The Enum like type binded from cpp SDPBackend does not render its fields for some reason. Manually list them instead for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/121180 Approved by: https://github.com/mikaylagawarecki	2024-03-05 22:19:48 +00:00
Valentin Andrei	8bb3e0b643	[pytorch] Name the main and autograd threads for better debugging (#121170 ) The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads. I used Kineto traces to verify if the thread names were propagated: <img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f"> Also: ``` nvidia-smi +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 3065920 C ...me#python#py_version_3_10 1968MiB \| \| 1 N/A N/A 3065926 C ...me#python#py_version_3_10 1978MiB \| \| 2 N/A N/A 3065930 C ...me#python#py_version_3_10 2084MiB \| \| 3 N/A N/A 3065936 C ...me#python#py_version_3_10 2016MiB \| \| 4 N/A N/A 3065939 C ...me#python#py_version_3_10 1998MiB \| \| 5 N/A N/A 3065943 C ...me#python#py_version_3_10 2070MiB \| \| 6 N/A N/A 3065948 C ...me#python#py_version_3_10 2026MiB \| \| 7 N/A N/A 3065952 C ...me#python#py_version_3_10 2070MiB \| +-----------------------------------------------------------------------------+ [me@myhost ~]$ ps -T -p 3065920 PID SPID TTY TIME CMD 3065920 3065920 pts/14 00:01:04 pt_main_thread ... 3065920 3092181 pts/14 00:00:40 pt_autograd_d0 3065920 3092182 pts/14 00:00:00 pt_autograd_d1 3065920 3092183 pts/14 00:00:00 pt_autograd_d2 3065920 3092184 pts/14 00:00:00 pt_autograd_d3 3065920 3092185 pts/14 00:00:00 pt_autograd_d4 3065920 3092186 pts/14 00:00:00 pt_autograd_d5 3065920 3092187 pts/14 00:00:00 pt_autograd_d6 3065920 3092188 pts/14 00:00:00 pt_autograd_d7 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170 Approved by: https://github.com/albanD	2024-03-05 22:15:39 +00:00
Tongzhou Wang	24944f6717	[doc] Fix math display in ChannelShuffle doc (#121247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121247 Approved by: https://github.com/mikaylagawarecki	2024-03-05 21:30:51 +00:00
Catherine Lee	b3a9d677a3	[ez] Add super() calls in test_custom_ops (#121239 ) Some disable issues are getting spammed Check that test_impl_invalid_devices gets skipped by the disable issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/121239 Approved by: https://github.com/zou3519	2024-03-05 21:16:06 +00:00
Peter Bell	34a28f01dd	[Autograd] Improve error for leaf tensors as out argument to fallback (#121089 ) Closes #120988 Currently operators that hit the autograd fallback call `check_inplace` on all mutated inputs, including out arguments. This leads to a slightly confusing error message: ``` RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ``` Compared to functions that don't fallback, which raise ``` RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad. ``` This changes the error message to make clear the issue is with the out argument, but does not tighten the check to outright ban out arguments that require grad. Instead, I use the same checks from `check_inplace` which allows non-leaf tensors that require grad to pass without error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089 Approved by: https://github.com/lezcano, https://github.com/soulitzer ghstack dependencies: #121142	2024-03-05 21:13:27 +00:00
Peter Bell	eae9751e82	Fix linalg_eigvals invalid use of composite dispatch key (#121142 ) `linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals` also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op as not all types support out variants. Instead, I add a new helper `_linalg_eigvals` which does the same thing in a non-composite operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142 Approved by: https://github.com/lezcano	2024-03-05 21:13:27 +00:00
y-sq	393b4ab432	Fixes issue_119785 (#121048 ) Fixes #ISSUE_119785 - Removed all sentinel files of `test_causal_variants_.`. - The `test_causal_variants_causal_variant_` tests could pass after removing the dynamo_skips files. - The `test_causal_variants_compile_causal_variant` fails with `PYTORCH_TEST_WITH_DYNAMO=1`. These tests already call torch.compile, so added @skipIfTorchDynamo to skip them for `PYTORCH_TEST_WITH_DYNAMO`. Tests* ``` $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test_transformers.py -v -k "test_causal_variants" ================================================================== test session starts ================================================================== platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python cachedir: .pytest_cache rootdir: /data/users/shuqiyang/pytorch configfile: pytest.ini collected 77250 items / 77218 deselected / 32 selected Running 32 items in this shard test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.7745s] [ 3%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.8020s] [ 6%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0385s] (Lower righ...) [ 9%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.5046s] [ 12%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.6483s] [ 15%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.8537s] [ 18%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.8388s] [ 21%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.4859s] [ 25%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu SKIPPED [0.0084s] (Th...) [ 28%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu SKIPPED [0.0086s] (Th...) [ 31%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0081s] (Th...) [ 34%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu SKIPPED [0.0085s] (Th...) [ 37%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu SKIPPED [0.0082s] (Thi...) [ 40%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu SKIPPED [0.0085s] (Thi...) [ 43%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu SKIPPED [0.0081s] (Thi...) [ 46%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu SKIPPED [0.0085s] (Thi...) [ 50%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.4185s] [ 53%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.4273s] [ 56%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0280s] (Lower ri...) [ 59%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [8.0999s] [ 62%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.3785s] [ 65%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.3818s] [ 68%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.3864s] [ 71%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.7668s] [ 75%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda SKIPPED [0.0089s] (...) [ 78%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda SKIPPED [0.0087s] (...) [ 81%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0087s] (...) [ 84%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda SKIPPED [0.0084s] (...) [ 87%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda SKIPPED [0.0087s] (T...) [ 90%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda SKIPPED [0.0087s] (T...) [ 93%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda SKIPPED [0.0084s] (T...) [ 96%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda SKIPPED [0.0087s] (T...) [100%] =================================================== 14 passed, 18 skipped, 77218 deselected in 39.72s =================================================== ``` ``` $ pytest test_transformers.py -v -k "test_causal_variants" ================================================================== test session starts ================================================================== platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python cachedir: .pytest_cache rootdir: /data/users/shuqiyang/pytorch configfile: pytest.ini collected 77250 items / 77218 deselected / 32 selected Running 32 items in this shard test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.2410s] [ 3%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.3984s] [ 6%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lower righ...) [ 9%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.0095s] [ 12%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.1749s] [ 15%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.2138s] [ 18%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.2715s] [ 21%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0108s] [ 25%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.4864s] [ 28%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.5346s] [ 31%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lo...) [ 34%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.1722s] [ 37%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.2341s] [ 40%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.4786s] [ 43%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.4635s] [ 46%] test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0861s] [ 50%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.7579s] [ 53%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.0044s] [ 56%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0007s] (Lower ri...) [ 59%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [9.2065s] [ 62%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0081s] [ 65%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0063s] [ 68%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0059s] [ 71%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.0055s] [ 75%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [0.1200s] [ 78%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.1032s] [ 81%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0010s] (...) [ 84%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [0.1151s] [ 87%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0705s] [ 90%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0713s] [ 93%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0696s] [ 96%] test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.1516s] [100%] =================================================== 28 passed, 4 skipped, 77218 deselected in 39.23s ==================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121048 Approved by: https://github.com/zou3519	2024-03-05 20:19:02 +00:00
Kurt Mohler	8ccf8b2c47	Avoid COW input materialize in more forward ops (#121070 ) Affected operators are: addr, cdist, sparse.sampled_addm, sparse.mm, matrix_exp, softmax, cross_entropy Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121070 Approved by: https://github.com/ezyang	2024-03-05 19:47:24 +00:00
Sunita Nadampalli	81dbc487c7	ci: add "typing_extensions" package to ci requirements list (#121136 ) this is required for torchgen Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121136 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-05 18:26:01 +00:00
Aidyn-A	3239f86a3d	[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 ) According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925 Approved by: https://github.com/eqy, https://github.com/malfet	2024-03-05 18:13:05 +00:00
Zhengxu Chen	8aeb247a3d	[export] Remove WrapperModule. (#121042 ) Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D54326331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042 Approved by: https://github.com/angelayi	2024-03-05 18:10:22 +00:00
Chengji Yao	0e604becc5	[NJT] support chunk on batch dim (#119713 ) - support chunk op on batch dim - support empty_like op - add tests for the like ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/119713 Approved by: https://github.com/jbschlosser	2024-03-05 17:57:50 +00:00
angelayi	ae4c85960f	Add Deberta pass (#121206 ) Adding DebertaForQuestionAnswering to inductor benchmark pass, as it did not show up before Pull Request resolved: https://github.com/pytorch/pytorch/pull/121206 Approved by: https://github.com/desertfire	2024-03-05 17:56:25 +00:00
Chien-Chin Huang	5abf7972d1	[DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378 ) Summary This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`. This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict. Performance improvement ``` # The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB. # The micro-benchmark is run on a H100 machine with PCIe 5 cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True) cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True) # GPU->CPU memory: 4.6556 seconds cpu_state_dict = _offload_state_dict_to_cpu(state_dict) # GPU->pin memory: 0.1566 seconds _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2) # GPU->shared memory: 0.5509 seconds (variation is quite large) _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3) # GPU->pin memory->shared memory: 0.2550 seconds _offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2) _offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3) ``` Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378 Approved by: https://github.com/LucasLLC	2024-03-05 17:48:15 +00:00
cyy	6ecd65886a	Remove unnecessary const_casts (#121225 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225 Approved by: https://github.com/soulitzer	2024-03-05 17:34:24 +00:00
Zhengxu Chen	85c807b3fd	[export] Ensure optional fields always have default value. (#121163 ) Summary: Add additional check to make sure we can always unset an optional field. Test Plan: CI Differential Revision: D54504243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121163 Approved by: https://github.com/tugsbayasgalan	2024-03-05 17:16:49 +00:00
Jason Ansel	35004b8ab4	[dynamo] Fix handling of invalid args (#121110 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121110 Approved by: https://github.com/yanboliang ghstack dependencies: #121106	2024-03-05 17:16:04 +00:00
Jason Ansel	4f19b5f7ef	[dynamo] Remove extra guard for tensor constant attrs (#121106 ) Also deletes some unused code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121106 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-03-05 17:16:04 +00:00
Oguz Ulgen	e4352182bd	Disable remote cache test on ROCM (#121210 ) Fixes #121194 Fixes #121166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121210 Approved by: https://github.com/aakhundov	2024-03-05 16:35:40 +00:00
angelayi	f25a25fde5	Fix lintrunner-noclang (#121205 ) Fix lintrunnner-noclang Pull Request resolved: https://github.com/pytorch/pytorch/pull/121205 Approved by: https://github.com/Skylion007	2024-03-05 16:18:36 +00:00
bhack	fbf36d01a0	Update Triton (#119457 ) Fix pytorch nightly compilation for cuda linking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457 Approved by: https://github.com/lezcano	2024-03-05 15:04:12 +00:00
redwrasse	59d9f1e227	Spectral norm value test (#121068 ) Spectral norm implementation has extensive tests, but there doesn't appear to be any checking that indeed the spectral norm (= top singular value) is correctly calculated. There should at least be one such testcase. This adds one such testcase for the parameterizations.py implementation of spectral norm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121068 Approved by: https://github.com/soulitzer	2024-03-05 14:46:31 +00:00
Mikayla Gawarecki	d621e3e3b8	Add exhaustive module and optimizer tests for torch.load(state_dict, weights_only=True) (#121049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121049 Approved by: https://github.com/janeyx99	2024-03-05 14:27:50 +00:00
Aidyn-A	42821d462a	[ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746 ) There will be some changes in CUDA 12.4 that would require smaller number of threads per block with double precision in `ctc_loss`. This PR addresses the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120746 Approved by: https://github.com/ptrblck, https://github.com/janeyx99	2024-03-05 14:14:41 +00:00
atalman	12191f4b3e	Fix make triton command on release branch (#121169 ) Fixes #120044 Should fix build from source instructions on release branch here: https://github.com/pytorch/pytorch#from-source Please note we are using /test/ channel for release here to make sure it works, before actual release is completed. Test main: ``` make triton pip3 uninstall -y triton WARNING: Skipping triton as it is not installed. Looking in indexes: https://download.pytorch.org/whl/nightly/ Collecting pytorch-triton==3.0.0+a9bc1a3647 Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Ba9bc1a3647-cp310-cp310-linux_x86_64.whl (239.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 MB 8.7 MB/s eta 0:00:00 Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==3.0.0+a9bc1a3647) (3.13.1) Installing collected packages: pytorch-triton Attempting uninstall: pytorch-triton Found existing installation: pytorch-triton 2.2.0 Uninstalling pytorch-triton-2.2.0: Successfully uninstalled pytorch-triton-2.2.0 Successfully installed pytorch-triton-3.0.0+a9bc1a3647 ``` Test release/2.2: ``` make triton pip3 uninstall -y triton WARNING: Skipping triton as it is not installed. Looking in indexes: https://download.pytorch.org/whl/test/ Collecting pytorch-triton==2.2.0 Using cached https://download.pytorch.org/whl/test/pytorch_triton-2.2.0-cp310-cp310-linux_x86_64.whl (183.1 MB) Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==2.2.0) (3.13.1) Installing collected packages: pytorch-triton Attempting uninstall: pytorch-triton Found existing installation: pytorch-triton 3.0.0+a9bc1a3647 Uninstalling pytorch-triton-3.0.0+a9bc1a3647: Successfully uninstalled pytorch-triton-3.0.0+a9bc1a3647 Successfully installed pytorch-triton-2.2.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121169 Approved by: https://github.com/seemethere	2024-03-05 13:53:53 +00:00
Sun, Jiayi	ee557d8f61	skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` * Root Cause is Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. However, the inputs of `detectron2_fcos_r_50_fpn` are as follows: ``` ([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124., 82., ..., 3., 4., 5.], [125., 104., 65., ..., 3., 3., 4.], [ 87., 68., 34., ..., 2., 2., 2.], ..., [ 47., 45., 41., ..., 45., 45., 45.], [ 46., 44., 40., ..., 44., 45., 46.], [ 46., 44., 40., ..., 43., 45., 46.]], [[154., 129., 84., ..., 3., 4., 5.], [133., 110., 69., ..., 3., 3., 4.], [ 95., 76., 43., ..., 2., 2., 2.], ..., [ 44., 42., 38., ..., 34., 37., 39.], [ 43., 41., 37., ..., 35., 39., 41.], [ 43., 41., 37., ..., 35., 40., 43.]], [[171., 140., 85., ..., 3., 4., 5.], [147., 120., 71., ..., 3., 3., 4.], [103., 83., 47., ..., 2., 2., 2.], ..., [ 46., 44., 40., ..., 16., 20., 22.], [ 45., 43., 39., ..., 17., 22., 26.], [ 45., 43., 39., ..., 18., 24., 28.]]])}, ... ],) ``` None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-03-05 12:12:18 +00:00
Tobias Ringwald	c4a1570864	Temporarily increased compile time limit of #GPUs to 120. (#121076 ) Fixes #115331. This is a temporary fix to increase the compile time number of GPUs to 120 until #119639 can be merged. Changing the parameter to 128 leads to annoying errors, as some checks would be tautological (`int8_t` is always < 128). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121076 Approved by: https://github.com/albanD	2024-03-05 11:39:14 +00:00
wz337	de8af28083	[FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837 ) Fixes #120722 TL;DR for the issue: As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError. More context in #120722. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837 Approved by: https://github.com/Skylion007	2024-03-05 10:03:44 +00:00
cyy	507611f9ae	[CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969 ) Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969 Approved by: https://github.com/albanD	2024-03-05 09:53:05 +00:00
Yanbo Liang	46c9d646dd	[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812 ) Fixes #118793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812 Approved by: https://github.com/zou3519	2024-03-05 09:05:26 +00:00
Lei Mao	311cc564f6	Fix README Typo (#120892 ) Fixes a README typo so that the prompt is consistent with VSCode 1.87.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120892 Approved by: https://github.com/albanD, https://github.com/drisspg	2024-03-05 09:05:21 +00:00
angelayi	a7e93c341f	[hoo] Add with_effects to handle side effectful ops (#120296 ) Proposal: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bnm38nu3yfno Implementation discussion: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bj61609o1buq Result with print: ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %with_effects : [num_users=1] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, aten.print.default, moo), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %arg1_1), kwargs = {}) return [getitem, add] ``` Follow ups: * Add handling to auto_functionalize * Add support for tokens on the export side * Add support for tokens on the inductor side Pull Request resolved: https://github.com/pytorch/pytorch/pull/120296 Approved by: https://github.com/zou3519	2024-03-05 08:58:32 +00:00
Oguz Ulgen	29976519a1	Make configs hash part of remote cache key (#121152 ) Summary: While testing I noticed that if we generate different configs, we will fail to use the remote cache, so lets include configs in the cache key. Not sure how to write a deterministic test for this. Test Plan: existing tests Differential Revision: D54500957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121152 Approved by: https://github.com/aakhundov	2024-03-05 08:01:24 +00:00
Oguz Ulgen	43416e3059	Correctly read the cache key for remote cache (#121151 ) Summary: While investigating why we were calling put each time, I noticed that memcache backend returns a list instead of direct result, which means that we were correctly fetching the cached result but not using it. Test Plan: The test should now work as expected Differential Revision: D54500851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121151 Approved by: https://github.com/aakhundov	2024-03-05 07:33:20 +00:00
Oguz Ulgen	9e16622397	Move JK check to on-demand (#121182 ) Summary: Some tests are failing due to checking JK during forking. Lets move the JK check to on-demand. Differential Revision: D54518293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121182 Approved by: https://github.com/aakhundov	2024-03-05 07:03:25 +00:00
Adnan Akhundov	9ccff0aff9	Remove ids_of_folded_args from test_triton_kernel_equal_to_1_arg (#121192 ) Summary: Due to the Triton pin update in https://github.com/pytorch/pytorch/pull/119457, `test_triton_kernel_equal_to_1_arg` started to break, as `ids_of_folded_args` has vanished from the upstream Triton codebase. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 6 tests in 6.790s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121192 Approved by: https://github.com/oulgen, https://github.com/bertmaher	2024-03-05 06:35:04 +00:00
Angela Yi	4b49bc19e8	[export][reland] Disable exported_program.__call__ (#120019 ) Summary: Reland of D53075378 / https://github.com/pytorch/pytorch/pull/119466 Test Plan: CI Differential Revision: D53827930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120019 Approved by: https://github.com/ydwu4	2024-03-05 05:29:46 +00:00
Bin Bao	6ddf5cf85e	[AOTI] Update cpp wrapper codegen to use v2 C shim (#120714 ) Summary: To use the torchgen-ed v2 C shim interface, cpp wrapper codegen needs to update its rule for generating the right parameter and function call. Because changing the emitted code will cause a FC breakage, we add a flag to control the behavior. Differential Revision: [D54258086](https://our.internmc.facebook.com/intern/diff/D54258086) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120714 Approved by: https://github.com/chenyang78 ghstack dependencies: #120513	2024-03-05 04:32:32 +00:00
Bin Bao	bd19d6d822	[AOTI] Use torchgen to generate C shim functions (#120513 ) Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as * Use plain C data types to pass parameters * Use AtenTensorHandle to pass at::Tensor * Use pointer type to pass optional parameter * Use pointer+length to pass list * Use device_type+device_index to pass device * When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis. This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage. Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513 Approved by: https://github.com/jansel	2024-03-05 04:28:44 +00:00
Stephen Jia	ffe45a8188	[ATen-vulkan] Implement global shader registry (#121088 ) Differential Revision: D54447700 ## Context This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime. Before: * `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables. After: * Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file * Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function` * Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration * `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry. Benefits: * Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.` * Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088 Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415	2024-03-05 03:56:57 +00:00
angelayi	c3c618c750	Update torchbench pin (#121029 ) Fixes https://github.com/pytorch/pytorch/issues/117280 after bumping the HF version in https://github.com/pytorch/benchmark/pull/2179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121029 Approved by: https://github.com/desertfire	2024-03-05 03:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	a15c02562a	Fix dynamo failure (#121167 ) Summary: Title Test Plan: CI Differential Revision: D54509198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121167 Approved by: https://github.com/izaitsevfb	2024-03-05 03:19:59 +00:00
PyTorch MergeBot	3381f282c3	Revert "Update Triton (#119457 )" This reverts commit d49864f6a526d3def25f8da2fa9b8815b3347b9d. Reverted https://github.com/pytorch/pytorch/pull/119457 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing test_triton_kernels in trunk `d49864f6a5` ([comment](https://github.com/pytorch/pytorch/pull/119457#issuecomment-1977792634))	2024-03-05 01:46:44 +00:00
Aaron Gokaslan	9deaa2e812	[BE]: FURB187 Use inplace reverse on lists: faster, more readable. (#121140 ) Use `reverse()` method as it's faster and inplace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121140 Approved by: https://github.com/albanD	2024-03-05 01:36:17 +00:00
Shunting Zhang	ec4146c535	[inductor] skip foreach kernel for benchmark fusion (#121168 ) benchmark fusion currently does not support foreach kernel. If we don't explicitly skip foreach kernels, we end up with exceptions in `codegen_node_schedule` because individual nodes in a foreach kernel may have incompatible shapes from pointwise/reduction perspective. cc Manman Ren ( @manman-ren ) who reported the issue when turning on benchmark fusion on BertForMaskedLM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121168 Approved by: https://github.com/Chillee	2024-03-05 01:27:55 +00:00
Vikram Srivastava	bcf35c6ae6	[tensorboard] Handle bfloat16 type in add_histogram (#120087 ) Summary: add_histogram fails for this data type. Updating conversion code to handle it. Stack trace for the failure - ` [trainer0]Traceback (most recent call last): [trainer0] File "<torch_package_0>.tensorboard/logging/summary_v2.py", line 203, in unscriptable_record_summary [trainer0] unscriptable_histogram(name, t, step, ranks) [trainer0] File "<torch_package_0>.tensorboard/logging/fx_v1.py", line 146, in unscriptable_histogram [trainer0] Adhoc.writer().add_histogram(tag, x, step.int()) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer.py", line 40, in wrapper [trainer0] resp = super_method(args, *kwargs) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer_oss.py", line 526, in add_histogram [trainer0] histogram(tag, values, bins, max_bins=max_bins), global_step, walltime [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/summary.py", line 482, in histogram [trainer0] values = make_np(values) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 23, in make_np [trainer0] return _prepare_pytorch(x) [trainer0] File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 30, in _prepare_pytorch [trainer0] x = x.detach().cpu().numpy() [trainer0]TypeError: Got unsupported ScalarType BFloat16 ` Test Plan: Updated unit test that was failing before but passes after this change. Reviewed By: hamzajzmati, jcarreiro Differential Revision: D53841197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120087 Approved by: https://github.com/jcarreiro, https://github.com/yanboliang	2024-03-05 00:27:21 +00:00
Wei-Sheng Chin	a3a8137484	[onnxrt, dynamo] Fix run with inputs on mix devices (#121159 ) `onnxrt` assumes all tensors are on the same device before, and this PR fixes that by setting individual device for each tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121159 Approved by: https://github.com/thiagocrepaldi	2024-03-04 23:39:33 +00:00
chilli	83c312990f	Add missing newline to repro and some utility thing in repro (#121051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121051 Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eellison	2024-03-04 22:52:54 +00:00
Jorge Pineda	eba28a6f91	[VK-API][Op Redesign][3/n] Expose new Context and Resource APIs (#121060 ) Summary: For use in the next diff. Test Plan: sc Differential Revision: D54397862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121060 Approved by: https://github.com/SS-JIA	2024-03-04 22:26:07 +00:00
PyTorch MergeBot	70c23a51ac	Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 )" This reverts commit 0a38a6ac8046e4d3f9cfaba86b7ec6517038646f. Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/clee2000 due to broke inductor models and caused accuracy regression on nightly dashboard `0a38a6ac80` https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228 ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1977556485))	2024-03-04 22:13:23 +00:00
David Berard	df3c8b8390	[fake_impls] Fix seed/offset device for attention kernels (#120839 ) 1) Fix fake_impls to return the correct device for these attention kernels. 2) Remove special-casing and test file xfails Pull Request resolved: https://github.com/pytorch/pytorch/pull/120839 Approved by: https://github.com/drisspg	2024-03-04 22:02:32 +00:00
Stephen Jia	6a5c7d5f95	[ATen-vulkan] Enable deferred descriptor pool initialization (#121134 ) Differential Revision: D54487619 ## Context Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount. ## Implementation Details * Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized * Introduce `DescriptorPool::init()` function to trigger intialization * Introduce safeguards against using an uninitialized descriptor pool Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134 Approved by: https://github.com/manuelcandales	2024-03-04 21:37:32 +00:00
PyTorch MergeBot	0c07c0c15f	Revert "add int4 packed gemm support on CPU device (#117475 )" This reverts commit 30befa592e0675cc694f87a4f6fb80894709e719. Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))	2024-03-04 21:20:57 +00:00
Wanchao Liang	74b19fa8b9	fix fsdp device mesh depenency issue (#121061 ) as reported in https://github.com/pytorch/torchtrain/pull/103 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121061 Approved by: https://github.com/awgu	2024-03-04 21:20:09 +00:00
lancerts	7a065e3b23	improve the constantLR doc (#120852 ) Fixes #120716 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120852 Approved by: https://github.com/janeyx99	2024-03-04 21:15:27 +00:00
atalman	cb812c9832	Add windows constraint to mkl package in wheel (#121014 ) Follow up on: https://github.com/pytorch/pytorch/pull/102604 Address this comment: https://github.com/pytorch/pytorch/pull/102604#discussion_r1419944305 Whl metadata for all wheels published to pypi must match, otherwise poetry install will fail see this comment: https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121014 Approved by: https://github.com/malfet	2024-03-04 20:54:26 +00:00
angelayi	4cdc2d7096	[dynamo] Remove expected dynamo test failures (#120836 ) Fixes some of the tests in #120643 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120836 Approved by: https://github.com/zou3519	2024-03-04 20:41:49 +00:00
PyTorch MergeBot	a98c17edc7	Revert "add int8 packed gemm support on CPU device (#118056 )" This reverts commit f84375ca5db623a6a53cbce2864d27dfad626228. Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))	2024-03-04 20:09:40 +00:00
PyTorch MergeBot	9ff65d56a5	Revert "delete useless cast_outputs call in unary_op_impl_float_out (#120486 )" This reverts commit d053dcfa69a52e6b9f9f2ba997b6bffbc9b29bb5. Reverted https://github.com/pytorch/pytorch/pull/120486 on behalf of https://github.com/izaitsevfb due to Fails meta internal tests ([comment](https://github.com/pytorch/pytorch/pull/120486#issuecomment-1977343125))	2024-03-04 19:52:23 +00:00
Francesco Fusco	26431db939	[ONNX] Perform implicit casting of constants for the onnx::where operator (#118733 ) (#120619 ) This PR fixes the problem of having the `Where` operator bound to different types in cases where the dtype is not explicitly set. The PR extends the implicit casting to the onnx::Where operator to fix the issue, and includes the corresponding unit test. Fixes #118733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120619 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-04 19:27:30 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	58047205ed	Delete unnecessary code (#120365 ) Summary: Title Test Plan: CI Differential Revision: D53828357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120365 Approved by: https://github.com/Skylion007	2024-03-04 18:02:58 +00:00
drisspg	2e6c08a14b	Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 ) # Summary Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5). The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935 Approved by: https://github.com/cpuhrsch	2024-03-04 17:36:22 +00:00
bhack	d49864f6a5	Update Triton (#119457 ) Fix pytorch nightly compilation for cuda linking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457 Approved by: https://github.com/bertmaher	2024-03-04 17:04:59 +00:00
Oguz Ulgen	6566b3db67	Add an autotune cache for inductor generated kernels (#120963 ) Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally. Test Plan: tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1` Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk The plan is to land this diff with this turned off and gradually introduce this. Differential Revision: D54398076 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963 Approved by: https://github.com/jansel	2024-03-04 16:58:37 +00:00
rzou	3ef0befdc9	Better error messages for impl_abstract_pystub (#120959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959 Approved by: https://github.com/drisspg	2024-03-04 15:24:36 +00:00
Pearu Peterson	ce2903080c	Add sparse compressed fake tensor support (#120920 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920 Approved by: https://github.com/ezyang	2024-03-04 14:38:45 +00:00
Pearu Peterson	c06499981d	Add a decomposition for torch.put, 2. (#120179 ) As in the title. It is an updated copy of https://github.com/pytorch/pytorch/pull/115306 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120179 Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5	2024-03-04 14:37:30 +00:00
pratiklp00	8ba49d0e53	Fix compilation error: load_fp32_from_fp16’ was not declared in this scope for ppc64le (#120307 ) This patch adds missing Implementation of load_fp32_from_fp16 for half. Fixes the error load_fp32_from_fp16’ was not declared in this scope . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120307 Approved by: https://github.com/jgong5	2024-03-04 11:08:39 +00:00
Jianyu Huang	27ac73073b	Fix hipification issue (#121107 ) Differential Revision: D54470055 ``` buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:201:61: error: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare] return ((threadIdx.x + thread_work_elemnum_threads()) < remaining); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~ ``` ``` buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:223:15: error: unused variable 'to' [-Werror,-Wunused-variable] scalar_t to = reinterpret_cast<scalar_t >(data[0]) + block_work_size() idx; ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121107 Approved by: https://github.com/chenyang78	2024-03-04 09:41:21 +00:00
Wanchao Liang	2e50566722	[dtensor] change distribute_module input/output_fn to accept module (#120895 ) This is a BC breaking change to distribute_module. The underlying rationle for this change is that sometimes in the input_fn/output_fn, user would want to access to the current module for some attributes. This might not be common enough, but in some cases it's worth to access to the module. An outstanding use case we want to support is float8, if we want to make float8 works with the TP API, the input_fn/output_fn of TP parallel styles would need to get access to the module, where the module might encapsulates `dynamic_linear.emulate` attribute, that is useful for input/output casting Since this is needed for fp8 and DTensor still under prototype release, I feel it's worth the change and it's better we make the change as early. Right now making it a soft BC breaking, which means we maintain BC still but throw deprecation messages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895 Approved by: https://github.com/tianyu-l	2024-03-04 07:22:32 +00:00
CK Luk	3045b16488	Do not use warm_pool() if TorchTnT is used (#121047 ) Summary: This diff is needed to avoid QPS drop when parallel compilation is used with TorchTNT. Test Plan: On TNT * https://www.internalfb.com/mast/job/torchx-ldm_train-hxjhl0k1wjz93 On PyPer * f537224855 Differential Revision: D54430900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121047 Approved by: https://github.com/yanboliang	2024-03-04 06:14:11 +00:00
cyy	4b494d0750	Fix comparison of integer expressions of different signedness (#121066 ) Fixes these warnings ``` src/aten/src/ATen/native/cuda/ForeachReduceOp.cu:190:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121066 Approved by: https://github.com/tringwald, https://github.com/Skylion007	2024-03-04 02:14:10 +00:00
Menglu Yu	c83dfc8854	[PT2][Inductor] Fix missing "example_value" for nodes introduced by group batch fusion (#120974 ) Summary: Similar to D54140488, we fix more such bugs Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion ``` Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` Differential Revision: D54399360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120974 Approved by: https://github.com/jackiexu1992	2024-03-04 02:11:57 +00:00
David Berard	cead0363a8	[jit][nested strided tensor] support nested tensor in check_trace (#121039 ) Summary: torch.testing.assert_equal doesn't support nested strided tensors because sizes is not implemented. This adds special handling for nested tensors by checking for nested tensors unbinding if they are found. Test Plan: test_trace_with_nested_strided_tensor_output Differential Revision: D54430238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121039 Approved by: https://github.com/YuqingJ	2024-03-04 01:15:45 +00:00
Edward Z. Yang	089f4c0bd9	If data dependent, check if guard_size_oblivious would fix problem and report if so (#121011 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121011 Approved by: https://github.com/lezcano	2024-03-03 23:23:14 +00:00
cyy	13fadea888	[Clang-tidy header][21/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#120763 ) This PR continues to fix clang-tidy warnings in aten/src/ATEN/, following #120574. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120763 Approved by: https://github.com/Skylion007	2024-03-03 23:18:43 +00:00
Jackie (Jiaqi) Xu	4f0481e1d5	[inductor] add decompostition for mm in backward (#120933 ) Summary: 1) As a follow up in D53602514. Found a new way to decompose mm in backward. Sum the permuted input and reduce along 0 dim. Some benchmark result P1190140001. 30x speedup Some explanations on why the original mm decomposition is slow. For mxkxn mm, when m is small and k is large, the stride for lhs is [m,1], hence it need to access memory k times to load all the data. As a result, decomposition will be effective with permute since the stride will be [k,1]. 2) add another pattern for large k. benchmark result P1190596489 28x speedup 3) fix the value not found error in ig ctr. f536115499 Test Plan: pt2 decompose: {F1462894821} decompose: f536159404 baseline: f536282578 705k vs 725k 4% for ig ctr Differential Revision: D54294491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120933 Approved by: https://github.com/mengluy0125	2024-03-03 18:46:42 +00:00
Animesh Jain	b7f2522692	[dynamo][compile-time] Remove unnecessary tree_map_only (#121052 ) Reduces the torch.compile(backend="eager") for this code by 1-2 seconds. ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121052 Approved by: https://github.com/jansel ghstack dependencies: #121053	2024-03-03 06:59:43 +00:00
PyTorch MergeBot	368f242e37	Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 )" This reverts commit 8c2e569928a200893fe971e615b82a2f9ce32630. Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))	2024-03-03 02:58:47 +00:00
Jason Ansel	0e0a621e0c	[dynamo] Minor refactors (#120966 ) These are changes I pulled out of the above PRs due to not being related. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120966 Approved by: https://github.com/yanboliang	2024-03-03 02:20:48 +00:00
Animesh Jain	8e4301077e	[dynamo][comp-time] BuiltinVariableTracker - inspect signature only on failure (#121053 ) Reduces the torch.compile(backend="eager") for this code by 1-2 seconds. ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/121053 Approved by: https://github.com/jansel	2024-03-02 23:03:00 +00:00
Lucas Pasqualin	7aced61c46	[DCP] deletes legacy formatting test (#120127 ) Should no longer be necessary Differential Revision: [D53791345](https://our.internmc.facebook.com/intern/diff/D53791345/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120127 Approved by: https://github.com/fegin ghstack dependencies: #119816	2024-03-02 22:04:39 +00:00
Animesh Jain	7f81563e5e	[dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120739 Approved by: https://github.com/jansel ghstack dependencies: #120673	2024-03-02 13:15:53 +00:00
Animesh Jain	82d1465d8d	[dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120673 Approved by: https://github.com/jansel	2024-03-02 13:15:53 +00:00
IvanKobzarev	bab4b5a341	[dist][sharded_tensor] Fix ChunkShardingSpec metadata offsets for empty shards (#121002 ) ChunkShardingSpec generated metadata where offsets exceed the tensor size. Example: Torchrec prepared ShardedTensorMetadata: ``` ShardedTensorMetadata(shards_metadata=[ ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6) ], size=torch.Size([10, 512] ), ``` Calling ShardedTensor._init_from_local_shards_and_global_metadata() ShardedTensor ShardingSpec builds metadata ``` ShardedTensorMetadata(shards_metadata=[ ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0), ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1), ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2), ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3), ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4), ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5), ShardMetadata(shard_offsets=[12, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6) ], size=torch.Size([10, 512]), tensor_properties=TensorProperties(dtype=torch.float16, layout=torch.strided, requires_grad=False, memory_format=torch.contiguous_format, pin_memory=False)) ``` The deduced ChunkShardingSpec: ``` ChunkShardingSpec(dim=0, placements=[rank:0/cuda:0, rank:1/cuda:1, rank:2/cuda:2, rank:3/cuda:3, rank:4/cuda:4, rank:5/cuda:5, rank:6/cuda:6]) ``` The fix is to limit offsets by dim size. Differential Revision: [D54419513](https://our.internmc.facebook.com/intern/diff/D54419513) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121002 Approved by: https://github.com/wz337	2024-03-02 08:58:48 +00:00
suo	66b20b4297	[export][ez] minor variable rename (#121040 ) since `_export()` now takes an `nn.Module` only (which is asserted against at an upper layer), we should change this variable name from `f` to `mod` and remove some unnecessary isinstance checks Differential Revision: [D54430381](https://our.internmc.facebook.com/intern/diff/D54430381/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121040 Approved by: https://github.com/angelayi ghstack dependencies: #121037	2024-03-02 08:49:06 +00:00
suo	505637198a	[export] cleanup to rewrite steps (#121037 ) 1. Some underscores for consistency of private functions. 2. remove dead code in `_replace_param_buffer_names` Differential Revision: [D54429206](https://our.internmc.facebook.com/intern/diff/D54429206/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121037 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-03-02 08:45:50 +00:00
Kurman Karabukaev	b0cfa96e82	[Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942 ) Summary: Expose an option to users to specify name of the LogsSpec implementation to use. - Has to be defined in entrypoints under `torchrun.logs_specs` group. - Must implement LogsSpec defined in prior PR/diff. Test Plan: unit test+local tests Reviewed By: ezyang Differential Revision: D54180838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942 Approved by: https://github.com/ezyang	2024-03-02 08:07:52 +00:00
Avik Chaudhuri	f351a71dbb	remove constraints from capture_pre_autograd_graph (#120981 ) Differential Revision: D54407296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120981 Approved by: https://github.com/zhxchen17	2024-03-02 07:00:51 +00:00
Xia, Weiwen	83d848e1c7	[Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605 ) description Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear. The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case. This feature is targeting PyTorch 2.3 release. Test plan ``` python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear ``` Performance before and after lowering `choose_qparam` to Inductor Before - latency for shape (32, 32) = 0.151 ms latency for shape (128, 128) = 0.153 ms latency for shape (1024, 1024) = 0.247 ms After - latency for shape (32, 32) = 0.049 ms - latency for shape (128, 128) = 0.052 ms - latency for shape (1024, 1024) = 0.133 ms Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-02 05:11:17 +00:00
Tianyu Liu	af5376c444	[dtensor] add support for loss parallel (#119877 ) Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input replicated on the class dimension. 3. However when the input of this loss calculation is sharded on the class dimension, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives in the middle of those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to decompose these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877 Approved by: https://github.com/wanchaol	2024-03-02 05:06:26 +00:00
Shunting Zhang	c4ed456fc3	[inductor] fix accuracy failure for a few models under freezing (#121054 ) Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn. For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass. For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now. One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054 Approved by: https://github.com/eellison	2024-03-02 04:53:59 +00:00
mingfeima	f84375ca5d	add int8 packed gemm support on CPU device (#118056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056 Approved by: https://github.com/mikekgfb ghstack dependencies: #117475	2024-03-02 04:35:49 +00:00
Stephen Jia	5258c3645d	[ATen-vulkan][EZ] Bug fixes: only create the image view when memory has been bound, invalidate cmd on flush (#121027 ) Summary: ## Context Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android. 1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory. 2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`. Test Plan: Build the test binary for Android: ``` buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output ``` Push and run the test binary on a local android phone. Differential Revision: D54425370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027 Approved by: https://github.com/mcr229, https://github.com/cbilgin	2024-03-02 04:35:46 +00:00
lancerts	2d9efad38f	Add the bound check for flatten with out_dim (#120894 ) Fixes #120762 The bound is not valid in the example but unchecked. ``` a = torch.tensor([1, 2, 3]) a.flatten(start_dim=0, end_dim=1, out_dim='a') ``` The same is checked for the case ``` a = torch.tensor([1, 2, 3]) a.flatten(start_dim=0, end_dim=1) ``` - Therefore, just apply the same check. @malfet @janeyx99 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120894 Approved by: https://github.com/malfet, https://github.com/spzala	2024-03-02 03:56:55 +00:00
Colin Peppler	06fe6ed82b	[dynamo bug burndown] update tensor creation to support sequences of tensors (#120872 ) Fixes https://github.com/pytorch/pytorch/issues/120645 `_internal_new_from_data` calls `_recursive_build`, but we run into an error such as the cases. ``` Failed running call_function <function tensor at 0xDEADBEEF>: scalar_tensor(): argument (position 1) must be Number, not FakeTensor # e.g. cases 1. [FakeTensor(..., size=(20, 1), dtype=torch.float64), ..., FakeTensor(..., size=(20, 1), dtype=torch.float64)] - Here, we call _recursive_build(sizes=[4] ...) which hits the base case `if dim == ndim:` in the 2nd level of recursion. - So, we try to return `scalar_tensor(FakeTensor)` 2. [[(FakeTensor(..., size=(1,), dtype=torch.int64), FakeTensor(..., size=(), dtype=torch.int64)]] # site note: when can size = ()? Probably from scalar_tensor. >>> torch.scalar_tensor(1).shape torch.Size([]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120872 Approved by: https://github.com/ezyang	2024-03-02 02:22:59 +00:00
Nikita Shulga	a3b81666b1	[Dynamo] Fix guards for code objects (#120909 ) By comparing them only by id, and raising an assert if someone calls into `EQUALS_MATCH` Which render following example compileable: ```python import torch @torch.compile() def foo(x, y): code = compile(y, "foo", "exec") exec(y) return x print(foo(torch.rand(3), "print('Hello World')")) ``` Fixes https://github.com/pytorch/pytorch/issues/120647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120909 Approved by: https://github.com/jansel	2024-03-02 02:17:17 +00:00
Yifu Wang	f7a2bae0ac	Change TestOpWaitiness to use MultiProcessTestCase (#121046 ) The test has been failing sporadically rencetly in CI and the failures are not reproducible locally, likely due to some nasty race conditional related a combination of MultiThreadedTestCase, the use of global state and finalizers, and the recently introduced test decorator for native funcol migration. Switching to the test to use MultiProcessTestCase to provide better isolation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046 Approved by: https://github.com/weifengpy	2024-03-02 01:12:14 +00:00
Andrew Gu	4cf6d1172b	[FSDP2] Used `ReduceOp.AVG` if fp32 reduce-scatter (#120919 ) This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919 Approved by: https://github.com/yifuwang, https://github.com/wanchaol ghstack dependencies: #120238, #120910	2024-03-02 00:39:16 +00:00
David Berard	85157af784	Fix more xfails for scaled_dot_product_attention (#121032 ) Followup to #120928. - should fix #120921 . I missed one test in #120928 - test_dispatch_symbolic_meta_outplace_all_strides. This wasn't caught because #120921 was open at the time, disabling the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121032 Approved by: https://github.com/drisspg	2024-03-02 00:28:44 +00:00
Andrew Gu	7c71d7f32b	[DTensor] Supported `foreach=True` for `clip_grad_norm_` (#120910 ) This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`. `foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910 Approved by: https://github.com/wanchaol, https://github.com/janeyx99 ghstack dependencies: #120238	2024-03-02 00:28:09 +00:00
Andrew Gu	f0e8e7cf43	[DTensor] Supported `foreach=False` for `clip_grad_norm_` (#120238 ) This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`). To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238 Approved by: https://github.com/wanchaol	2024-03-02 00:25:16 +00:00
mingfeima	30befa592e	add int4 packed gemm support on CPU device (#117475 ) This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-03-02 00:17:34 +00:00
Shuqiang Zhang	c8e56b4965	[c10d] dump from one and only one thread (PG0's monitor thread) (#120893 ) Summary: When there are multiple PGs in a process and a hardware failure happens, we found that multiple PGs/ threads in the same process are competing to dump the same records at the same time. The affects the reliability of dumps. In this PR, we will try to make the change such that only one thread/PG could dump: PG0's monitor thread. We use a static variable to indicate that something (e.g., collective timeout) has triggered the dump locally. monitor thread would dump debug info under any one of the 3 conditions: 1: this static variable is set to true by the watchdog thread when it detects a timeout or pipe dump signal 2: timeout signal is received from other ranks through tcpstore 3: no heartbeat of watchdog Test Plan: python test/distributed/test_c10d_nccl.py -k test_timeout_dumps_on_stuck_ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893 Approved by: https://github.com/wconstab	2024-03-02 00:13:13 +00:00
PyTorch MergeBot	3d7cf8f392	Revert "Limit loop unrolling (#120023 )" This reverts commit 6cc7f9a2e6bedff3109ea066278e9805713da4bb. Reverted https://github.com/pytorch/pytorch/pull/120023 on behalf of https://github.com/anijain2305 due to breaks llms export ([comment](https://github.com/pytorch/pytorch/pull/120023#issuecomment-1974104633))	2024-03-02 00:04:08 +00:00
BowenBao	d8395830ea	[ONNX][dynamo_export] Skip instance_norm decomp for export (#120866 ) Otherwise, instance_norm is decomposed into batch_norm with training set to True. Downstream exporter has no way to figure out that training is actually not needed. On the other hand, ONNX does have InstanceNormalization operator defined, however due to decomp, it unnecessarily exports as batch norm and glue code. Depends on https://github.com/microsoft/onnxscript/pull/1284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120866 Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms	2024-03-01 23:51:16 +00:00
Will Constable	581fe26792	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito	2024-03-01 23:45:43 +00:00
Aidyn-A	0a38a6ac80	[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925 ) According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925 Approved by: https://github.com/eqy, https://github.com/malfet	2024-03-01 23:32:59 +00:00
Catherine Lee	06b52dd103	TD outside of test job (#118250 ) Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues * Move test discovery to its own file that is not dependent on torch so it can be run without building torch * Cannot do cpp test discovery before building pytorch * Move TD calculation to own file that will create a json file with the final results * TD is now job/build env agnostic * TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250 Approved by: https://github.com/huydhn	2024-03-01 23:08:10 +00:00
Simon Fan	d08ce51881	[compiled autograd] refactor eager test loading and run custom ops tests (#120679 ) TestCustomOp's tests uses helper attributes and functions from a util parent class. To support arbitrary test classes, we need to refactor the current approach. Instead of allowlisting certain methods, we can instead copy the whole class and only overwrite the "test_.*" methods. Compiled autograd fails on ~10/90 of the newly added tests. test_autograd_function_backed_op is the example we discussed in PT-2D meeting about requiring c++ autograd::Function support. I'm addressing this in #120732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120679 Approved by: https://github.com/jansel, https://github.com/zou3519	2024-03-01 22:48:17 +00:00
albanD	8cb4855d1e	Release the GIL in serialization when it is safe to do so (#120818 ) In particular this ensures we release the GIL when serializing: - PyBytes objects (this is how we get the pickle object) - Storage objects Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818 Approved by: https://github.com/colesbury	2024-03-01 22:37:26 +00:00
Menglu Yu	fd2ab1f613	[PT2][Inductor] Change the split cat log to debug (#120823 ) Summary: Address the report in https://github.com/pytorch/pytorch/issues/120771. Test Plan: see signal Differential Revision: D54323475 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120823 Approved by: https://github.com/jackiexu1992	2024-03-01 22:34:23 +00:00
Zhengxu Chen	797d4fbdf4	[export] Log operator set. (#120951 ) Summary: as title. We want to count the number of total operator calls, and the distinct set of operators in the exported graph. Test Plan: CI Differential Revision: D54390298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120951 Approved by: https://github.com/tugsbayasgalan	2024-03-01 20:58:31 +00:00
Xuehai Pan	d3876f73e7	Preserve metadata for `MutableMapping` and `MutableSequence` in `pin_memory` and `collate_fn` (#120553 ) For the user-defined `Mapping` type, it may contain some metadata (e.g., pytorch/tensordict#679, https://github.com/pytorch/pytorch/pull/120195#issue-2141716712). Simply use `type(mapping)({k: v for k, v in mapping.items()})` do not take this metadata into account. This PR uses `copy.copy(mapping)` to create a clone of the original collection and iteratively updates the elements in the cloned collection. This preserves the metadata in the original collection via `copy.copy(...)` rather than relying on the `__init__` method in the user-defined classes. Reference: - pytorch/tensordict#679 - #120195 Closes #120195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120553 Approved by: https://github.com/vmoens	2024-03-01 20:43:42 +00:00
Jacob Szwejbka	a7c799fb85	[executorch] Add support for method variants in aten executorch code gen (#121016 ) Summary: Title. Test Plan: The added unittest Differential Revision: D54423028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016 Approved by: https://github.com/larryliu0820	2024-03-01 20:33:02 +00:00
Xu Zhao	7a64eb65e4	Fix Dynamo tests failing with "Failed running call_function <built-in function linalg_norm" (#120993 ) When iterating the ord value through an array, we are sharing the same torchdynamo context. This makes dynamo treat the `ord` variable as dynamic shape, causing problems. In the `vector_norm` decomposition, casting the int type ord to float will fix this problem. Fixes https://github.com/pytorch/pytorch/issues/119795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120993 Approved by: https://github.com/lezcano	2024-03-01 20:27:45 +00:00
Tugsbayasgalan Manlaibaatar	39e4d1a535	Make TestEmbeddingNNDeviceTypeCPU::test_EmbeddingBag_per_sample_weights_and_no_offsets_cpu_int32_float32 compatible with TorchDynamo (#120831 ) Previously, the test case directly accesses the tensor data via tensor.data which is not supported on FakeTensor. So we manually copy the tensor as a workaround. Fixes: https://github.com/pytorch/pytorch/issues/119788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120831 Approved by: https://github.com/janeyx99	2024-03-01 20:27:41 +00:00
Aaron Gokaslan	e02047add4	[BE][Ez]: Update ruff to 0.3.0 (#121003 ) Update ruff to 0.3.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121003 Approved by: https://github.com/malfet	2024-03-01 20:20:55 +00:00
Zhenghao Zhao	af93849a3a	[pt2 export] small fix on non_persistent buffer unlift (#120715 ) Summary: Change to get_buffer from the input plain_graph_module instead of the new stateful_gm when restoring non_persistent buffers, since the stateful_gm doesn't contain the buffer yet. Test Plan: Added test case. `buck test caffe2/test:test_export -- test_unlift_nonpersistent_buffer` Differential Revision: D54216772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120715 Approved by: https://github.com/zhxchen17	2024-03-01 20:20:00 +00:00
Andrew M. James	19fcf6de1a	Add lowering for fraction_max_pool2d (#120460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120460 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-03-01 20:13:20 +00:00
Avik Chaudhuri	cdb50d0380	remove constraints from aot_compile (#120979 ) Differential Revision: D54405986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120979 Approved by: https://github.com/zhxchen17	2024-03-01 20:06:21 +00:00
DanilBaibak	55ae8fb1f6	Switched m1 runners to the lable macos-m1-stable (#120997 ) Switched m1 runners to use `macos-m1-stable` label, which points to exactly the same M1 running MacOS-13.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120997 Approved by: https://github.com/malfet	2024-03-01 19:52:34 +00:00
Nikita Shulga	de3202abea	[EZ][BE] Remove Python-2 installation logic (#121015 ) Not sure why it's still there in 2024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121015 Approved by: https://github.com/jeffdaily, https://github.com/atalman	2024-03-01 19:39:02 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	b474a523c6	Ban passing in free function into capture_pre_autograd_graph (#120817 ) Summary: Today we don't allow free functions to be tracing callable in torch.export. As a part of migrating capture_preautograd_graph usages to torch.export, we need to ban free functions to capture_preautograd_graph as well Test Plan: CI Differential Revision: D54319597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120817 Approved by: https://github.com/zhxchen17, https://github.com/andrewor14	2024-03-01 19:38:58 +00:00
Edward Z. Yang	ce50db22c2	Handle transposition pattern seen in SDPA with unbacked SymInts (#121005 ) Fixes https://github.com/pytorch/pytorch/issues/121000 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121005 Approved by: https://github.com/lezcano	2024-03-01 18:58:19 +00:00
Wei-Sheng Chin	11f2e8beac	[Dynamo, Compiled] Save some python overhead when calling compiled function with many tangents (#118730 ) When a dynamo backend captures the entire forward pass and the entire backward pass without graph break, there could be many (per my memory, hundreds or thousands for big model) `contiguous` calls. Here we can save those overhead by checking `is_contiguous` before `contigous` call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118730 Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang	2024-03-01 18:57:18 +00:00
Andrew Gu	0b18ed1c47	[FSDP] Added warning about unsupported double backwards (#120926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120926 Approved by: https://github.com/Skylion007	2024-03-01 18:40:30 +00:00
Tugsbayasgalan Manlaibaatar	f01a23d01b	Don't aggressively rewrite asserts for symbolic expressions (#120564 ) Fixes: https://github.com/pytorch/pytorch/issues/118417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120564 Approved by: https://github.com/ezyang	2024-03-01 17:46:36 +00:00
angelayi	c844b377fa	[dynamo] Reorder logs (#116106 ) Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792. Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600 There are some limitations to the printing right now: * You can only register logging functions, not methods * Inputs to the logging functions can only be tensors, constants, and format strings * Inputs to the logging functions which will later be mutated in-place will not be printed correctly TODO: Add the following tests * print function with argument of nested data structure; * print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly); * custom defined logging functions with nn.Module or nn.Module attribute arguments; * custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value); * custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage); Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106 Approved by: https://github.com/yanboliang	2024-03-01 17:04:24 +00:00
Edward Z. Yang	9fc56f8209	Exclude operators that produce unbacked symbols (#120917 ) Unbacked symbols vary at runtime which means they are not CUDA graphable. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120917 Approved by: https://github.com/eellison	2024-03-01 16:56:08 +00:00
Adnan Akhundov	ea7149aa22	Replace TTIR string parsing with structured MLIR walk in Triton kernel mutation analysis (#120476 ) Summary: Previously, we relied on the `lark`-based parsing of the string TTIR representation dumped by the Triton compiler. However, this has proven to be brittle in the face of changes both in the user-written Triton kernel code and in the Triton compiler code. In this PR, we add an alternative way of mining the function information from the TTIR based on walking the tree of structured MLIR entities. To this end, we rely on the MLIR bindings exposed by `libtriton` (related PR in Triton: https://github.com/openai/triton/pull/3191). For now, we introduce gating based on whether `ttir_module.hasattr("walk")`. This will allow switching to the newly introduced TTIR analysis approach only when the new MLIR bindings (including that of `ModuleOp::walk`) become available in the Triton pin. Before then, we'll keep using the old string TTIR parsing-based approach. Test Plan: The new functionality was tested locally with the latest Triton version compiled with the added new MLIR bindings: all Triton kernel mutation tests in `test_triton_kernels.py` are passing. Here we rely on the CI for regression testing, but it won't cover the new functionality due to gating. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120476 Approved by: https://github.com/oulgen	2024-03-01 16:20:24 +00:00
Aaron Orenstein	8861507ba3	Fix guard for SUPPORTED_NODES (#120798 ) The special-case code for handling SUPPORTED_NODES was producing a guard that looked like: ``` "G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type" ``` resulting in a eval error trying to evaluate the guard. This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly. Also added a unit test which fails before this change and passes after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798 Approved by: https://github.com/anijain2305	2024-03-01 16:03:21 +00:00
Pearu Peterson	b8e6ca6f76	Add sparse compressed meta tensor support (#120707 ) As in the title. Replaces https://github.com/pytorch/pytorch/pull/120498 and https://github.com/pytorch/pytorch/pull/120562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120707 Approved by: https://github.com/ezyang ghstack dependencies: #120703	2024-03-01 13:28:47 +00:00
Pearu Peterson	70d4d109f2	Make SparseCsr a functionality dispatch key (#120703 ) As in the title. To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703 Approved by: https://github.com/ezyang	2024-03-01 13:28:46 +00:00
Lei,zhenyuan	eee040c939	expose nested header to wheel (#120603 ) expose nested header to pytorch wheel, help with developers for reuse pytorch nested tensor related utils header inside wheel Pull Request resolved: https://github.com/pytorch/pytorch/pull/120603 Approved by: https://github.com/jbschlosser, https://github.com/gujinghui	2024-03-01 09:59:45 +00:00
Tugsbayasgalan Manlaibaatar	c646030cd2	Support higher order op functionalization in predispatch IR (#115314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115314 Approved by: https://github.com/bdhirsh	2024-03-01 09:13:47 +00:00
Simon Fan	82b356193d	Move VariableInfo into its own file to avoid circular dependency (#120732 ) VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way. Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732 Approved by: https://github.com/jansel	2024-03-01 08:48:13 +00:00
Chien-Chin Huang	8c2e569928	[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 ) With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model. Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454 Approved by: https://github.com/yf225, https://github.com/xmfan	2024-03-01 08:35:22 +00:00
cyy	77ef9d4022	Add verbose parameter to torch.hub.list (#120717 ) This PR adds ```verbose``` to ```torch.hub.list``` to let users being able to disable extraneous outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120717 Approved by: https://github.com/ezyang	2024-03-01 07:39:48 +00:00
PyTorch MergeBot	63b259492a	Revert "[dynamo] Reorder logs (#116106 )" This reverts commit c5472628ff9dedff57722941ac1b2a50af880197. Reverted https://github.com/pytorch/pytorch/pull/116106 on behalf of https://github.com/clee2000 due to landrace with 342e7929b804ec56121e82e92d6a199b549c38b1, which removed the import for warnings. Should be an easy fix after rebase `c5472628ff` ([comment](https://github.com/pytorch/pytorch/pull/116106#issuecomment-1972586180))	2024-03-01 06:25:46 +00:00
eqy	86e6497c6f	[Inductor][cuDNN] Disable tf32 in `test_mutate_view_for_conv_output` (#120953 ) Another disablement of TF32 to unblock #120642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120953 Approved by: https://github.com/Skylion007	2024-03-01 05:51:29 +00:00
David Berard	6ed26392b3	Update xfails for scaled_dot_product_attention (#120928 ) Update xfails for test_dispatch_meta_outplace and test_dispatch_symbolic_meta_outplace. These tests are sometimes expected to fail, because we moved the registrations from meta_registrations.py to fake_impls.py. AFAIK, this is okay because fake tensors will still work because we have special handling in fake_impls.py. The purpose of this PR is to update the xfails so they are correctly xfailing the failing tests. Previously, I set these to xfail only for bfloat16, float16, and float32, but not float64; but this isn't really correct. Explanation below: Scaled dot product attention (SDPA) has multiple implementations, including efficient_attention, flash_attention, and unfused attention. flash_attention supports fp16, bf16. efficient_attention supports fp16, bf16, fp32. unfused attention supports all dtypes. efficient_attention and flash_attention implementations will fail the meta tests, but the unfused attention will not. Certain platforms may support none, both, or one of efficient_attention and flash_attention. Unfused attention will pass because it falls back to constituent ops which have registered meta kernels. So: on CUDA, we have all 3 available: in bf16, fp16, fp32, we'll select one of the fused implementations (where this test will fail). On ROCM, we don't have efficient_attention: so fp32 will use the unfused implementation, where the test will pass. Fix in this PR: * If any fused impl is available, then xfail float16 & bfloat16 * If efficient_attention is available, then also xfail float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120928 Approved by: https://github.com/drisspg	2024-03-01 05:16:11 +00:00
Edward Z. Yang	2a08a51738	Add _assert_scalar and teach Inductor to codegen it (#114148 ) Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor. So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed. I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148 Approved by: https://github.com/jansel ghstack dependencies: #120800	2024-03-01 05:06:36 +00:00
Kurt Mohler	77aea289ae	Add test to check that COW inputs are not materialized (#119507 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507 Approved by: https://github.com/ezyang ghstack dependencies: #120455	2024-03-01 05:05:28 +00:00
Kurt Mohler	13a54ce279	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-03-01 05:05:28 +00:00
Shan19900305	d053dcfa69	delete useless cast_outputs call in unary_op_impl_float_out (#120486 ) cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486 Approved by: https://github.com/ezyang	2024-03-01 04:54:11 +00:00
Angela Yi	c5472628ff	[dynamo] Reorder logs (#116106 ) Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792. Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600 There are some limitations to the printing right now: * You can only register logging functions, not methods * Inputs to the logging functions can only be tensors, constants, and format strings * Inputs to the logging functions which will later be mutated in-place will not be printed correctly TODO: Add the following tests * print function with argument of nested data structure; * print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly); * custom defined logging functions with nn.Module or nn.Module attribute arguments; * custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value); * custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage); Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106 Approved by: https://github.com/yanboliang	2024-03-01 04:48:44 +00:00
Edward Yang	02a410ee12	Enable TORCH_TRACE by default in all Tupperware like environments (#120915 ) Summary: This is a reimplemented version of the FB specific code in https://www.internalfb.com/diff/D54230697 The new strategy is that we unconditionally install an FB handler to trace_log logger (and always set level to DEBUG). When the first log message is emitted, we check the JK/filesystem to see if we should actually do logging. If we decide we don't do logging, we remove the handler from trace_log and are done. build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo_inductor,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smartpytorchgithub_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor] Test Plan: sandcastle ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/inference/tests:test_single_gpu_executor -- --exact 'torchrec/inference/tests:test_single_gpu_executor - TorchDeployGPUTest.NestedModelSingleGPU' buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test -- --exact 'dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test - test_global_fixed_interval_accumulator (dper_lib.silvertorch.modules.dynamic_stats.tests.accumulators_test.GlobalFixedIntervalUnivalentAcculumatorTest)' ``` Also running a test flow with/without JK enabled Differential Revision: D54275086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120915 Approved by: https://github.com/yanboliang	2024-03-01 04:47:13 +00:00
Ma Jian	518a23bb03	support bool as Scalar Type in TorchScript (#113835 ) Fixes #112402 Fixes #75465 From the description in #75465 , the bool type should subtype from the int. and `register_prim_ops.cpp` already supports converting from bool to int or float. So this patch can fix bool as Scalar in TorchScirpt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113835 Approved by: https://github.com/davidberard98	2024-03-01 04:20:15 +00:00
PyTorch UpdateBot	2e84d01d05	[executorch hash update] update the pinned executorch hash (#120747 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120747 Approved by: https://github.com/pytorchbot	2024-03-01 04:02:09 +00:00
PyTorch MergeBot	65d568680c	Revert "[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812 )" This reverts commit 1104e0798c8206e0226f2d68f6bb065645e6276f. Reverted https://github.com/pytorch/pytorch/pull/120812 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure test_simple_model look legit `1104e0798c` ([comment](https://github.com/pytorch/pytorch/pull/120812#issuecomment-1972460001))	2024-03-01 03:53:27 +00:00
Wei-Sheng Chin	e49f31ca02	[onnxrt, dynamo] Enable custom ONNX model transforms in `onnxrt` dynamo backend (#120854 ) A global transorm list is created. All backend instances call the transform functions in that list sequentially to modify the exported ONNX model before sending model to ORT session. For example, `record_onnx_model_transform` below is a no-op transform and only records the ONNX graphs sent to ONNXRuntime. ```python recorded_models = [] def record_onnx_model_transform(onnx_model): # Record the ONNX model seen by the transform. recorded_models.append(onnx_model) from torch.onnx import ( register_backend_graph_transform, unregister_backend_graph_transform, ) # Register so that `onnxrt` backend calls it to modify ONNX model. register_backend_graph_transform(record_onnx_model_transform) def example_model(x: torch.Tensor): y = torch.sigmoid(x) z = x + y return z # During the compilation, the exported ONNX model will be # modified by calling `record_onnx_model_transform` before # sending the model to `onnxruntime.InferenceSession`. compiled_model = torch.compile( example_model, backend="onnxrt", dynamic=True, ) # Now, `recorded_models` should contain one `onnx.ModelProto` representing # `example_model(x: torch.Tensor)`. # Remove the pass when not needed. If `record_onnx_model_transform` is not # removed, it will be applied to all models compiled by `backend="onnxrt"`. unregister_backend_graph_transform(record_onnx_model_transform) ``` In the future, we plan to use this mechanism to register all graph transforms such ash graph fusion and general ONNX optimization for `onnxrt`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120854 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-01 03:24:17 +00:00
lancerts	67c97a9aad	fix the scale dot attention doc (#120859 ) Fixes #120810 The code verifies the broadcast behavior (from the issue), ``` import torch B = 3 S = 5 L = 7 E = 16 EV = 32 additional_batches = [2, 4] query_shape = [B] + additional_batches + [L, E] key_shape = [B] + additional_batches + [S, E] value_shape = [B] + additional_batches + [S, EV] query = torch.rand(query_shape) key = torch.rand(key_shape) value = torch.rand(value_shape) mask = torch.zeros((1, 1, S), dtype=torch.bool) mask[:, :, S // 2 :] = True # query.to("cuda") # key.to("cuda") # value.to("cuda") # mask.to("cuda") attention = torch.nn.functional.scaled_dot_product_attention(query, key, value, mask) print(f"query shape = {query.shape}") print(f"key shape = {key.shape}") print(f"value shape = {value.shape}") print(f"mask shape = {mask.shape}") print(f"attention shape = {attention.shape}") #in both CPU and cuda, output shape is: # query shape = torch.Size([3, 2, 4, 7, 16]) # key shape = torch.Size([3, 2, 4, 5, 16]) # value shape = torch.Size([3, 2, 4, 5, 32]) # mask shape = torch.Size([1, 1, 5]) # attention shape = torch.Size([3, 2, 4, 7, 32]) ## test add is broadcasting mask to query@(key.mT) res = query@(key.mT) print(res.shape) res2 = torch.add(res, mask) print(res2.shape) ``` At code level, in the default backend, `ab38354887/aten/src/ATen/native/transformers/attention.cpp (L735)` the add operation is broadcasting the `attn_mask` to `auto attn = at::matmul(query, key.transpose(-2, -1) scaling_factor);` - Changed the doc in [torch/nn/functional.py](https://github.com/pytorch/pytorch/pull/120859/files#diff-c358c214f663ba0c8b9c6846fbe0042fa29494cf02fe4714a17dcd0d268b035b). - Also fixed a few inconsistencies in the cpp comments. @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/120859 Approved by: https://github.com/drisspg	2024-03-01 02:54:08 +00:00
Oguz Ulgen	b35551f357	Ban reset_to_zero argument to triton.autotune in user defined kernels (#120938 ) Fixes #120802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120938 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-03-01 02:37:24 +00:00
Sam Larsen	06f8af30fa	Change FakeTensor serialization to consider only an _active_ FakeTensor mode (#120848 ) Summary: https://github.com/pytorch/pytorch/pull/108186 make some changes related to FakeTensor serialization such that saving and loading a tensor will give us a meta tensor, even if FakeTensor mode is not enabled. This means we can't properly save and load Tensors as part of Fx graph caching. This PR changes the logic to check if there's an _active_ FakeTensor mode. Test Plan: * New unit tests * Validated unit tests introduced in https://github.com/pytorch/pytorch/pull/108186 still pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/120848 Approved by: https://github.com/eellison, https://github.com/thiagocrepaldi	2024-03-01 02:37:21 +00:00
Jason Ansel	e3dbd194f4	[dynamo] Support module backwards hooks (#120685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120685 Approved by: https://github.com/yanboliang, https://github.com/xmfan	2024-03-01 02:24:26 +00:00
Simon Fan	9b2c35b4fe	[dynamo] Fix convolution meta kernel when input channel is 0 (#120944 ) Addresses https://github.com/pytorch/pytorch/issues/118797 Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0): `67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944 Approved by: https://github.com/zou3519	2024-03-01 01:18:21 +00:00
rzou	d534a49767	Reinplace auto_functionalized (#120829 ) Fixes https://github.com/pytorch/pytorch/issues/120441 We follow how triton_kernel_wrapper_functional gets re-inplaced: - If we see auto_functionalized, then first we compute what inputs we actually need to clone ("tensors_to_clone") and fixup the graph. This happens in `reinplace_and_refine_tensors_to_clone`, which I have refactored out of the triton_kernel_wrapper_functional reinplacing code. - Later on, after the reinplacing pass, we have a decomposition pass for auto_functionalized. In that decomposition pass, we make use of the "tensor_to_clone" info and only clone those inputs in the decomposition. - We shepherd "tensor_to_clone" from the first step to the second step by setting the .meta field on the auto_functionalized node. Test Plan: - existing tests - tested locally by reading the output of TORCH_LOGS="post_grad_graphs" - added assertExpectedInline tests for the post_grad_graphs Pull Request resolved: https://github.com/pytorch/pytorch/pull/120829 Approved by: https://github.com/oulgen	2024-03-01 00:55:19 +00:00
wz337	791f8ef350	[Composable APIs] Add composable API `fully_shard` deprecation warning (#120929 ) `fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py) will be used by new FSDP2 and we want to add a deprecation warning to the existing composable API's `fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fully_shard.py#L40). Planned release schedule is as follows https://dev-discuss.pytorch.org/t/release-cadence-for-year-2023-2024/1557: Minor Version \| Release branch cut \| Release date \| First patch release date \| Second patch release date -- \| -- \| -- \| -- \| -- 2.3 \| Mar 2024 \| Apr 2024 \| May 2024 \| Jun 2024 2.4 \| May 2024 \| Jul 2024 \| Aug 2024 \| Sep 2024 2.5 \| Aug 2024 \| Oct 2024 \| Nov 2024 \| Dec 2024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120929 Approved by: https://github.com/awgu	2024-03-01 00:55:16 +00:00
Guilherme Leobas	fd35aafc26	Teach dynamo about vjp (#119405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119405 Approved by: https://github.com/zou3519 ghstack dependencies: #118407	2024-03-01 00:21:10 +00:00
Lucas Pasqualin	9d5dea7812	[DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816 ) as title Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816 Approved by: https://github.com/fegin	2024-03-01 00:21:05 +00:00
PyTorch MergeBot	33da8d5c12	Revert "Fix guard for SUPPORTED_NODES (#120798 )" This reverts commit 1b8bb027f676aa8c4260a3f6b9a5c98c37d25dc7. Reverted https://github.com/pytorch/pytorch/pull/120798 on behalf of https://github.com/kit1980 due to the new test fails internally, see D54343456 ([comment](https://github.com/pytorch/pytorch/pull/120798#issuecomment-1972134227))	2024-02-29 23:19:22 +00:00
Elias Ellison	7ebfe21724	Fix nll loss dynamo failure (#120805 ) Fix for https://github.com/pytorch/pytorch/issues/119791 Part of dynamo bug bash Pull Request resolved: https://github.com/pytorch/pytorch/pull/120805 Approved by: https://github.com/Skylion007, https://github.com/zou3519, https://github.com/malfet	2024-02-29 22:34:49 +00:00
Elias Ellison	d03b11ad5b	Pass inductor strides forward in ddp optimizer (#120523 ) # Note: Returning Fake Tensors on First AOT Autograd Call # # Inductor will optimize strides of outputs when it deems it profitable. # For instance, converting to channels last. When we split the graph here # into multiple inductor compilations, we need to make sure that the # output strides of one compilation is appropriately passed to the subsequent # compilations. However, the mapping from inductor output to dynamo output # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing, # subclass handling, etc. In order to replay all this logic we set a flag such that # the first invocation of inductor in aot_autograd will return Fake Tensors with # appropriate strides. Then, all of aot autograd's runtime logic is replayed. # This gives us the appropriately strided outputs here which will reflect runtime strides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523 Approved by: https://github.com/yf225, https://github.com/bdhirsh	2024-02-29 22:25:00 +00:00
Matthias Reso	772db2a3ae	Fix handling of torch.return_types in dynamo (#120826 ) Handle quasi-namedtuples as a special case in dynamo Fixes #120651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120826 Approved by: https://github.com/anijain2305	2024-02-29 22:11:35 +00:00
Jane Xu	da559c98e3	Fix isin decomp and add python meta registration (#120821 ) Fixes #119792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821 Approved by: https://github.com/malfet, https://github.com/peterbell10	2024-02-29 22:08:50 +00:00
PyTorch MergeBot	76d3a6bb4a	Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 )" This reverts commit 381a7ad3f1cd38bf8e814ae9d275f101a2136139. Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))	2024-02-29 22:06:13 +00:00
Animesh Jain	e7039e3a0b	[dynamo][easy] Dynamo test changes (#120927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120927 Approved by: https://github.com/yanboliang ghstack dependencies: #120864, #120730	2024-02-29 22:05:41 +00:00
drisspg	39c092d242	Skip semi-structured-sparse on windows (#120807 ) # Sumary We can see that in this job on the other PR: https://github.com/pytorch/pytorch/actions/runs/8086597674/job/22096699337?pr=120641#step:11:11272 building the SemiStrucutredSparse kernel is erroring on windows machine so I think we she land this. ### Details Introduced in here: https://github.com/pytorch/pytorch/pull/120434 we don't compile for windows so we should have skipped this test. There is another PR: https://github.com/pytorch/pytorch/pull/120641 which removes this skip for windows, so if that is green we should do that otherwise skip windows tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/120807 Approved by: https://github.com/alexsamardzic, https://github.com/jcaip	2024-02-29 21:48:52 +00:00
Yang Chen	1a1f58ffbe	[rocm][cmake] retrieve rocm location from ROCM_SOURCE_DIR env if specified (#120898 ) This PR allows us to build PyTorch with a rocm that is not installed to the default location, i.e. /opt/rocm Pull Request resolved: https://github.com/pytorch/pytorch/pull/120898 Approved by: https://github.com/jianyuh	2024-02-29 21:32:45 +00:00
wz337	b2dddcfe27	[FSDP2][DCP][DSD] Add test to ensure FSDP2 model/optim state_dict work after a full training loop (#120871 ) This PR adds tests to test distributed state dict work properly for FSDP2's model and optimizer state_dict after a full training loop. We test the combination of these options on a evenly sharded model. ``` { "reshard_after_forward": [True, False], "optimizer_class": [torch.optim.Adam], "compile_model": [True, False], }, ``` Followup: 1. Add test for unevenly sharded model. 2. Add test to include `torch.optim.AdamW` (seems to have some gaps currently, still investigating) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120871 Approved by: https://github.com/fegin	2024-02-29 21:24:00 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Andrew Gu	277bc97709	[FSDP2][ez] Combined communication test files (#120904 ) This just combines the unit tests for the collectives ops for copy-in/all-gather/copy-out and copy-in/reduce-scatter/view-out with the unit tests for communication schedule. I was mainly thinking to try to not have too many test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120904 Approved by: https://github.com/Skylion007, https://github.com/wanchaol ghstack dependencies: #120659	2024-02-29 20:36:04 +00:00
PyTorch MergeBot	0b924d7cde	Revert "[inductor] Optimize welford reduction (#120330 )" This reverts commit 7eb7ac815f0247a62b621897cea95ec4ca56d52e. Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/kit1980 due to Broke internal tests, see D54230858 ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1971878323))	2024-02-29 20:12:50 +00:00
Edward Z. Yang	0a7666801d	SymIntify prod_backward (#120776 ) Fixes https://github.com/pytorch/pytorch/issues/120608 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120776 Approved by: https://github.com/albanD	2024-02-29 20:05:22 +00:00
Shuqiang Zhang	313abcdba2	[c10d] fix the unwanted reason (#120863 ) Summary: Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false postiive errors. This is a quick fix, but we need to revisit the error handling logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863 Approved by: https://github.com/kwen2501	2024-02-29 19:58:11 +00:00
Edward Z. Yang	f94933ed42	Refine value ranges on inequalities (#120800 ) This is basically done the obvious way. For better or worse, I jammed this into what used to be `_maybe_guard_eq` but now is `_maybe_guard_rel`. I was careful to test all the off by one conditions, and each permutation. Let me know if you think I missed anything. Importantly, this now works for unbacked SymInts. While testing, I noticed we are silently duck sizing all symbolic variables in `test_dynamic_shapes.py`. This may or may not be covering up bugs. Along the way, I had to fix a bug in export constraints, where we weren't checking that the final var_to_range was consistent with what the user requested at top level. After I implemented all this, I realized that applying this to non-unbacked SymInts was duplicative with @ysiraichi's previous work on https://github.com/pytorch/pytorch/pull/97963 . The upside is I now understand what Yukio was trying to do in the original PR, and I think my new logic is simpler and less error prone. In Yukio's earlier diff, Yukio tried very hard to avoid changing what guards we actually issue (since this would cause tests to wobble). Thus, when he refined a range, he also saved the guard that actually caused the range to refine. In this PR, I don't bother saving these guards; instead I just tighten var_to_range directly and rely on generating guards on this to be correct. The key insight is that if I assert `x < y`, it's always safe to emit (potentially) more restrictive range guards, because this won't invalidate our guards, it will just make them a little too strong (but actually, I think we are precise along the way.) If these guards make it unnecessary to test `x < y`, because now the ranges for x and y are disjoint, this is fine, we've subsumed the x < y guard and can just not bother testing it. If I've gotten it right, TV will agree with me. In fact, I had a bug in this PR which TV didn't catch, which is that when we have a recorded var_to_guards for a symbol, we unconditionally never generate the range guard for it, even if the var_to_guards is potentially inconsistent with var_to_range (because var_to_range was updated separately). With var_to_guards removed, I don't have to worry abou this inconsistency. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120800 Approved by: https://github.com/Skylion007, https://github.com/avikchaudhuri, https://github.com/ysiraichi	2024-02-29 19:41:51 +00:00
Yifu Wang	81c4c0dda2	[functional collecitve] don't import torchdynamo when running torchdeploy (#120900 ) Summary: Importing torchdynamo in `functional_collective_impl.py` seems to break loading of torchdeploy models. Test Plan: CI Differential Revision: D54355011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120900 Approved by: https://github.com/fegin	2024-02-29 19:20:54 +00:00
Avik Chaudhuri	f7a809c96a	fix dupe deprecated warning in dynamo export (#120896 ) Summary: When we convert `dynamic_shapes` to `constraints` and pass them to `_dynamo.export`, we shouldn't give a deprecation warning. Such conversion happens when calling `torch.export.export`, e.g. But it can also happen when calling `capture_pre_autograd_graph` (which itself has this deprecation warning when `constraints` are passed directly as well). Since `_log_export_usage` is an indicator of a top-level call (it is `True` by default but set to `False`, or at least passed through, by callers), we can (ab)use it to indicate when to give this deprecation warning. Test Plan: none Differential Revision: D54350172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120896 Approved by: https://github.com/BoyuanFeng, https://github.com/zhxchen17	2024-02-29 18:57:42 +00:00
Catherine Lee	0290fe65bd	Test TD (test removal) on crossref (#119426 ) Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut. test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min) The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426 Approved by: https://github.com/huydhn	2024-02-29 18:53:43 +00:00
PyTorch MergeBot	1458f1de66	Revert "Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 )" This reverts commit 4b7a521856ca5fb0fc28edd18591f77fff5a6ba1. Reverted https://github.com/pytorch/pytorch/pull/118935 on behalf of https://github.com/atalman due to Significantly increases build time. Optimization is needed ([comment](https://github.com/pytorch/pytorch/pull/118935#issuecomment-1971723284))	2024-02-29 18:42:21 +00:00
Kai Londenberg	96eff4ef70	[inductor max autotune] Detailed autotuning result logs ( machine-readable ) (#119004 ) This diff introduces a new separate logging of autotuning results, with the intention of making the results analyzable, specifically those for the new experimental Cutlass backend. Results are logged as text files with one JSON document corresponding to a single benchmark result per line. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119004 Approved by: https://github.com/jansel ghstack dependencies: #120620	2024-02-29 18:24:13 +00:00
James Wu	a911eb74ae	[dynamo] Graph break when faking named tensors (#120779 ) Fixes #120644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120779 Approved by: https://github.com/zou3519	2024-02-29 18:22:15 +00:00
Yanbo Liang	1104e0798c	[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812 ) Fixes #118793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812 Approved by: https://github.com/zou3519	2024-02-29 18:19:14 +00:00
Yang Chen	ca679384c2	[rocm][cmake] correctly check the ROCM_SOURCE_DIR environment (#120858 ) The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be not working correctly, e.g. ``` $ cmake --version cmake version 3.26.4 $ cat CMakeList.txt cmake_minimum_required(VERSION 3.18 FATAL_ERROR) project(FOO) if(NOT ENV{ROCM_SOURCE_DIR}) message(INFO ": not defined 1") else() message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}") endif() if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "") message(INFO ": not defined 2") else() message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}") endif() $ ROCM_SOURCE_DIR=/tmp cmake . INFO: not defined 1 INFO: defined 2: /tmp -- Configuring done (0.0s) -- Generating done (0.0s) -- Build files have been written to: /home/yangche/tmp/tmp ``` This PR replace it with a STREQUAL check. Note that the choice of STREQUAL is to avoid cases like: ``` $ ROCM_SOURCE_DIR= cmake . ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858 Approved by: https://github.com/jianyuh, https://github.com/jeffdaily	2024-02-29 17:49:00 +00:00
Catherine Lee	9e016debeb	[dynamo] Fix inference_mode context variable (#120830 ) <idk what im doing> Fixes #120646 The module for torch.inference_mode should be torch The input to `create` is a bool (mode?) and `_enter_inference_mode` expects a bool but [BlockStackEntry](`50073248ed/torch/_dynamo/symbolic_convert.py (L206)`) expects `target_values` to be a list? [inference_mode](`50073248ed/torch/autograd/grad_mode.py (L205)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120830 Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/tugsbayasgalan	2024-02-29 17:10:06 +00:00
Nikita Shulga	98c4ba683e	[EZ][BE] Fix ResourceWarning (#120886 ) By closing the file handle Fixes ``` /Users/nshulga/git/pytorch/pytorch/test/quantization/core/test_docs.py:132: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/nshulga/git/pytorch/pytorch/docs/source/quantization.rst' mode='r' encoding='UTF-8'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120886 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-02-29 17:07:39 +00:00
Edward Z. Yang	664dd61b29	Add some more symbolic shapes related files to ciflow/inductor (#120887 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120887 Approved by: https://github.com/janeyx99, https://github.com/malfet	2024-02-29 16:59:32 +00:00
Oguz Ulgen	558316b5f4	Emit grid wrapper inlined with the user defined triton kernel (#120824 ) Fixes #120801 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120824 Approved by: https://github.com/chenyang78, https://github.com/jansel ghstack dependencies: #120809	2024-02-29 16:17:45 +00:00
Oguz Ulgen	84e2accd6c	Make triton_meta be part of user defined triton kernel cache (#120809 ) Tensors with different shapes will generate different triton meta (divisibility rules), we need this to be part of the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120809 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-02-29 16:17:45 +00:00
Avik Chaudhuri	342e7929b8	[export] kill deprecated constraints API (#120860 ) Summary: Previously `export` would take `constraints` built with `dynamic_dim(...)`s. This has been deprecated for a while; one can now pass in a `dynamic_shapes` spec built with `Dim(...)`s. Here we kill this deprecated API. Eventually this will lead to simplification of the underlying implementation, since the new `Dim`-based specs can map 1-1 with symbolic shapes concepts without going through indirect machinery of `dynamic_dim`-based constraints. It is expected that internal APIs like `_dynamo.export` and `_trace._export_to_torch_ir` will change when that happens. Leaving `aot_compile` and `capture_pre_autograd_graph` entry points alone for now. This will eventually be updated anyway. Test Plan: updated tests Differential Revision: D54339703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120860 Approved by: https://github.com/suo, https://github.com/tugsbayasgalan	2024-02-29 16:15:50 +00:00
Bin Bao	3cfed01228	[AOTI] Store OpOverload in ir.ExternKernel (#120629 ) Summary: Currently the logics for filling the default value for optional arguments are scattered in several places. By storing OpOverload in the base ExternKernel class, we can simplify codegen_kwargs, and this is a preparation step for enabling the torchgen-ed C shim. The default value filling logic for FallbackKernel can also be simplified, but that can come later. Differential Revision: [D54258089](https://our.internmc.facebook.com/intern/diff/D54258089) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120629 Approved by: https://github.com/chenyang78 ghstack dependencies: #119987, #120592	2024-02-29 15:51:33 +00:00
Bin Bao	fa7241ed79	[AOTI] Change the cpp wrapper codegen for sdpa (#120592 ) Summary: Switch codegen for sdpa to always point to v2 in the C shim. Since aoti_torch__scaled_dot_product_flash_attention_v2 has been introduced for a while, there shouldn't be any FC issue in production. Differential Revision: [D54258090](https://our.internmc.facebook.com/intern/diff/D54258090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120592 Approved by: https://github.com/chenyang78 ghstack dependencies: #119987	2024-02-29 15:49:23 +00:00
Bin Bao	52e3c78a43	[AOTI][refactor] Move a few util functions in atoi_torch (#119987 ) Summary: Move these util functions from an anonymous namespace to a common header so that later torchgen-ed files can use them. Differential Revision: [D54258088](https://our.internmc.facebook.com/intern/diff/D54258088) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119987 Approved by: https://github.com/chenyang78	2024-02-29 15:46:47 +00:00
Shengbao Zheng	5b9e5f854b	[profiler] Log process group id instead of backend id (#120475 ) Summary: https://github.com/pytorch/pytorch/pull/104373 introduced backend_id > an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object. However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution. This PR change the ID information exposted in record_param_comms from backend_id to pg_id. Differential Revision: D53558257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475 Approved by: https://github.com/aaronenyeshi	2024-02-29 15:04:33 +00:00
cpuhrsch	576c0482a5	Remove hard numpy dependency from guards.py (#119519 ) I'm not sure if this is the ideal behavior / best fix for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119519 Approved by: https://github.com/albanD	2024-02-29 14:37:33 +00:00
atalman	5db5049b34	Move TRITON_CONSTRAINT setting to common binary_populate_env.sh, BE - Cleanup unused build scripts (#120744 ) 1. This moves TRITON_CONSTRAINT to common binary_populate_env.sh so that this is set for all wheels. test in CI via ``ciflow/binaries`` label. Please note we only setting this constraint when PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set. And this variable is set for all the wheels that gets uploaded to pypi. Hence triton wheels need to be set at the same place. This is done for regular wheels and rocm wheels separately, since rocm wheels using different triton package 3. Cleanup legacy unused code Test: `` git grep setup_linux_system_environment.sh `` Needs: https://github.com/pytorch/builder/pull/1712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120744 Approved by: https://github.com/huydhn	2024-02-29 14:25:34 +00:00
Yifu Wang	f988f649be	[IntraNodeComm] accept P2P buffer size as constructor argument (#120856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120856 Approved by: https://github.com/wanchaol ghstack dependencies: #120855	2024-02-29 11:43:52 +00:00
Yifu Wang	22b5548f5d	[IntraNodeComm] refactor all_reduce variants as private methods (#120855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120855 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2024-02-29 11:43:52 +00:00
Jeff Daily	96793e0f10	[ROCm] enable scaled_gemm (#117822 ) scaled_gemm for ROCm using hipblaslt. As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported. A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result. For this reason the feature should be considered beta/preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822 Approved by: https://github.com/jianyuh, https://github.com/xw285cornell	2024-02-29 10:20:48 +00:00
Sergii Dymchenko	09aefe1502	Fix ouput typos (#120870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870 Approved by: https://github.com/clee2000	2024-02-29 08:29:14 +00:00
Nikita Shulga	14c5ebc8a1	[Dynamo] Do not attempt to make nditer spawned arrays writable (#120868 ) As they are not, converting `numpy.nditer` to writable is too expensive and tensor values are copied anyway Minimal reproducer: ```python import numpy as np import torch @torch.compile def f(x): return x + 1.0 for x in np.nditer(np.arange(3)): print(f(x)) ``` Fixes https://github.com/pytorch/pytorch/issues/119787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120868 Approved by: https://github.com/jansel	2024-02-29 07:49:59 +00:00
Yanbo Liang	169c220bf8	[torch.compile] Provide capability to register callback on compile start/stop (#120764 ) This is a requirement from Meta internal cases, where ppl wants to register a callback function to detect if a job is stuck during compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120764 Approved by: https://github.com/jansel	2024-02-29 07:37:52 +00:00
Animesh Jain	82cbd9b131	[dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120730 Approved by: https://github.com/jansel ghstack dependencies: #120864	2024-02-29 07:25:13 +00:00
Animesh Jain	66d05a8900	[dynamo] Fix source for default dict default_factory (#120864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120864 Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel	2024-02-29 07:25:13 +00:00
David Berard	df1e855313	[fake_impls] fix max_seqlen return values in efficient_attention_forward (#120842 ) To match the actual implementation, we should return the max_seqlen_q/k, not M, N, when in the sparse case `7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L981-L996)` Note that although the .cu file sets max_seqlen_k = 0 in the sparse case, it actually returns max_seqlen_k or N: `7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L1224-L1231)` Tests - added in the next PR (#102839, which also fixes other parts of the test_fake tests so that we can un-xfail them and actually run the tests) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120842 Approved by: https://github.com/YuqingJ ghstack dependencies: #120682	2024-02-29 07:12:27 +00:00
eqy	d1d50d2e4c	[Inductor][cuDNN] Disable tf32 in `test_mutate_base_for_conv_output` (#120867 ) Looks like there is a sum? comparison where TF32 may not provide the necessary accuracy, leading to failures on sm86. CC @Skylion007 , hopefully this unblocks #120642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120867 Approved by: https://github.com/Skylion007	2024-02-29 06:59:32 +00:00
cyy	8a42cff7b1	[DeviceIndex][7/N] Use DeviceIndex in XPU (#120576 ) Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120576 Approved by: https://github.com/guangyey, https://github.com/Skylion007	2024-02-29 05:54:23 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Adnan Akhundov	0a46102b37	Add equal_to_1 to triton_meta for user-written Triton kernels (#120579 ) Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization. Fixes #120478. The repro from the issue, on A100: Before this PR: ``` Triton matmul: 0.0167 seconds Triton matmul compiled: 0.0751 seconds ``` After this PR: ``` Triton matmul: 0.0168 seconds Triton matmul compiled: 0.0072 seconds ``` Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 3 tests in 3.545s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78	2024-02-29 05:19:39 +00:00
Shunting Zhang	4407138bf6	[inductor][eazy] fix a typo in test (#120832 ) In theory we can test anything, but the test name mentioned attention so we should multiple the inv_scale rather than divide it. And I guess that the initial intension of the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120832 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-02-29 05:04:04 +00:00
Adnan Akhundov	2d17230212	[inductor] Do not reuse buffers across scopes in mem planning (#120777 ) Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes. In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state. Test Plan: ``` python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters ... ---------------------------------------------------------------------- Ran 27 tests in 149.413s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #120665	2024-02-29 03:52:02 +00:00
Will Constable	f5b99976ad	[C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850 ) Fixes #120847 Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-29 03:41:15 +00:00
PaliC	26d6ddc232	[bug burndown]Fix #119784 (#120846 ) Addresses https://github.com/pytorch/pytorch/issues/119784. Interestingly, the test seem to just pass (yay!). Tested locally that the failing set of tests pass using `PYTORCH_TEST_WITH_DYNAMO=1 pytest functorch/test_vmap.py -v` Will wait for CI to pass first before bugging people for reviews. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120846 Approved by: https://github.com/Skylion007	2024-02-29 03:30:40 +00:00
Yifu Wang	fad228c7cc	Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833 ) Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition. This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833 Approved by: https://github.com/wconstab	2024-02-29 03:19:44 +00:00
youkaichao	2c0c70f763	[Dynamo] enumerate imported names for eval_frame.py (#120778 ) Fixes https://github.com/pytorch/pytorch/issues/120699 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120778 Approved by: https://github.com/Skylion007	2024-02-29 03:08:43 +00:00
Shruthi GN	ef9e89984c	[pytorch] Support output types that are non tensors (#120804 ) Summary: per title This is needed because some modules return None and non tensors as output Test Plan: sandcastle? Reviewed By: zhxchen17 Differential Revision: D54311609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120804 Approved by: https://github.com/zhxchen17	2024-02-29 02:49:10 +00:00
Adnan Akhundov	0dbef1618f	[inductor] Apply fx passes recursively to nested subgraphs (#120665 ) Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`. In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph. For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 26 tests in 59.252s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665 Approved by: https://github.com/eellison	2024-02-29 02:34:54 +00:00
PyTorch MergeBot	db1cc781db	Revert "[dynamo] Function => FunctionCtx for placeholder obj (#120577 )" This reverts commit ee01d0807b924874a329be78c6ee880f556645db. Reverted https://github.com/pytorch/pytorch/pull/120577 on behalf of https://github.com/jansel due to Causing breakages internally ([comment](https://github.com/pytorch/pytorch/pull/120577#issuecomment-1970254363))	2024-02-29 01:56:09 +00:00
Edward Z. Yang	b2e4b621cc	Reduce create_env log level to DEBUG (#120772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120772 Approved by: https://github.com/albanD	2024-02-29 01:33:16 +00:00
Brian Hirsh	9e0631cc8a	get CommsDebugMode to work with DTensor (#118769 ) Tested with Wanchao's repro: ``` from typing import Tuple, List, Dict, cast import torch import torch.nn as nn from torch.distributed.device_mesh import init_device_mesh from torch.distributed._tensor import distribute_tensor, DTensor, Shard, Placement, Replicate mesh = init_device_mesh(device_type="cuda", mesh_shape=(2,)) x = torch.randn(4, 8, requires_grad=True) y = torch.randn(4, 32, requires_grad=True) x_dtensor = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) y_dtensor = DTensor.from_local(y, mesh, [Shard(0)], run_check=False) from torch.distributed._tensor.debug import CommDebugMode comm_mode = CommDebugMode() with comm_mode: z = torch.mm(x_dtensor, y_dtensor) print(comm_mode.get_comm_counts()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118769 Approved by: https://github.com/wanchaol	2024-02-29 01:11:05 +00:00
Will Constable	381a7ad3f1	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito ghstack dependencies: #120724, #120270	2024-02-29 01:03:31 +00:00
Will Constable	f85d3a022c	[C10D] Fix pointToPoint op Flight Recording (#120270 ) Fix and test issues with both coalesced and individual send/recv ops Considered an alternate approach and then ditched it - alternate approach: #119757 - reason ditched: prefer recording individual collective events inside coalescing region instead of just the event at the end of the region, which also would not have tensor sizes or opnames without additional state variables added Another approach also ditched - record events on workEnqueue instead of initWork - reason ditched: too messy to get input/output shapes tagged on recording when recording in workEnqueue. Adding the info onto the Work obj would be possible, but adds to overhead of copying Works which we do on every collective. We can get info off the input/output tensors directly in initWork, but we don't want to keep refs to those tensors alive while the work is Enqueued, so we'd have to specifically copy size lists or something. This PR instead avoids creating a work inside pointToPoint when coalescing is active. Instead, only at endCoalescing() is a work finally intialized and enqueued. But it adds a record() call inside pointToPoint() instead of creating a work, during coalescing. This record() call picks up tensor shapes and op names. It ALSO changes initWork to accept a 'record' argument. This defaults to false, and should only be set to true if the caller ensures the work will be enqueued by workEnqueue, ensuring its cuda events are live when used by flight recorder's update_state(). The testing uncovers some odd pre-existing behavior and leaves them alone for now. We could change some of these - seq starts off at 1, not 0 for first op (but this is inconistent) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #120724	2024-02-29 01:03:31 +00:00
Will Constable	7f4d673885	[C10D] Add record_id to flight recorder (#120724 ) In cases where sequence number is shared between events (e.g. coalesced collectives) we want to ensure a unique (and ordered) ID per record. Note: the records are already in a list, so their ID could be implicitly observed. But (1) it's a ring buffer, so absolute ID is lost once the buffer rolls over once, (2) users may sort or process or filter their flight records, so having the ID be an explicit member of an entry is still useful Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724 Approved by: https://github.com/zdevito	2024-02-29 01:03:31 +00:00
leslie-fang-intel	950b484356	skip three pyhpc models with dynamic shape test (#120599 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR. * Error msg is ``` File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 1048576 ``` * Root Cause is * Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](`26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16)`) ``` shape = ( math.ceil(2 * size ** (1/3)), math.ceil(2 * size ** (1/3)), math.ceil(0.25 * size ** (1/3)), ) ``` * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (`c617e7b407/benchmarks/dynamo/common.py (L3456)`) and `math.ceil(2 * size ** (1/3))` happens equaling to 4. * Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-02-29 00:38:06 +00:00
Chien-Chin Huang	3179107629	[DDP][PT2D] Ignore gradient sync if the gradient is not defined (#120419 ) From the test, accum_grad_hook can still be fired even if the gradient is None. We need to ignore the gradient sync for this case. Differential Revision: [D54076485](https://our.internmc.facebook.com/intern/diff/D54076485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120419 Approved by: https://github.com/yf225, https://github.com/XilunWu	2024-02-29 00:27:54 +00:00
Pian Pawakapan	ab38354887	Allow str inputs in non-strict tracing (#120536 ) Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass. Failing test case: ``` class Foo(torch.nn.Module): def forward(self, a, b, mode): return torch.div(a, b, rounding_mode=mode) foo = Foo() inps = (torch.randn(4, 4), torch.randn(4), "trunc") exported = export(foo, inps) with self.assertRaisesRegex( RuntimeError, "to be equal to trunc, but got floor" ): _ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor") self.assertTrue(torch.allclose(exported.module()(inps), foo(inps))) ``` Before: ``` (pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str E ====================================================================== ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper method(args, kwargs) File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str exported = export(foo, inps) ^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export return export(args, *kwargs, strict=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export return _export( ^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper raise e File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper ep = fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export ) = make_fake_inputs(f, args, kwargs, constraints) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs fake_args, fake_kwargs = tree_map_with_path( ^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path return treespec.unflatten(func(xs) for xs in zip(all_keypath_leaves)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten leaves = list(leaves) ^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr> return treespec.unflatten(func(xs) for xs in zip(*all_keypath_leaves)) ^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda> lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify raise ValueError("Only tensors allowed as input") ValueError: Only tensors allowed as input To execute this test, run the following from the base repo dir: python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.008s FAILED (errors=1) ``` After: ``` (pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str . ---------------------------------------------------------------------- Ran 1 test in 0.237s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm	2024-02-28 23:56:30 +00:00
Aaron Orenstein	1b8bb027f6	Fix guard for SUPPORTED_NODES (#120798 ) The special-case code for handling SUPPORTED_NODES was producing a guard that looked like: ``` "G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type" ``` resulting in a eval error trying to evaluate the guard. This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module. It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly. Also added a unit test which fails before this change and passes after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798 Approved by: https://github.com/anijain2305	2024-02-28 23:34:17 +00:00
Aaron Enye Shi	aa36821615	[Memory Snapshot] Stop clearing history when changing context (#120436 ) Summary: This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`. Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history. Test Plan: # Ran on the following local Resnet50 example: - At iteration=0, record_memory_history(context=None, stacks="python") - At iteration=3, record_memory_history(context="all", stacks="python") - After iteration=4, export_memory_snapshot() ## Before: - Only collects the last 2 iterations with python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3) ## After: - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks. ![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8) ![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4) Differential Revision: D54084017 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436 Approved by: https://github.com/zdevito, https://github.com/leitian	2024-02-28 22:46:26 +00:00
PyTorch MergeBot	86ff31c4a0	Revert "Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 )" This reverts commit cabc09a5f259f1cc1e3bad1d80b5e5274838bced. Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))	2024-02-28 22:30:18 +00:00
PyTorch MergeBot	dbe0967a0a	Revert "Add test to check that COW inputs are not materialized (#119507 )" This reverts commit 2ebf2c88baa4667d55eda92f4c8424db505af781. Reverted https://github.com/pytorch/pytorch/pull/119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/119507#issuecomment-1970022840))	2024-02-28 22:26:59 +00:00
Eddie Yan	7e185277cd	[cuDNN] bump cuDNN-frontend submodule to 1.1.2 (#120761 ) Hopefully resolves additional `CUDNN_STATUS_SUCCESS` failures that we have been seeing on H100 (though curiously not on upstream CI, perhaps due to the different hardware being tested) Need to confirm the fix on our end before merging CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120761 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia	2024-02-28 22:15:43 +00:00
Elias Ellison	9c9bde515c	Factor out Submod compilers (#120527 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120527 Approved by: https://github.com/kadeng	2024-02-28 22:11:47 +00:00
Edward Z. Yang	5b5bcf0470	Test that tlparse understands the structured logs we output (#120658 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120658 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #120712, #120289	2024-02-28 21:58:39 +00:00
David Berard	d6c202975c	Move attention kernels from meta_registrations to fake_impls (#120682 ) This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns. This PR: * Move the `_meta_registrations.py` implementations to `fake_impls.py` * Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them * Wrap all the returned tensors in FakeTensors Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682 Approved by: https://github.com/drisspg	2024-02-28 21:49:13 +00:00
lancerts	50073248ed	add a note wrt torch.nn.functional.scaled_dot_product_attention (#120668 ) followup change of https://github.com/pytorch/pytorch/pull/120565 - Added a note in the transformer class pointing out the mask definition is opposite to that of :attr:`attn_mask` in torch.nn.functional.scaled_dot_product_attention. @mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120668 Approved by: https://github.com/mikaylagawarecki	2024-02-28 21:16:34 +00:00
Shubhraprakash Das	e2ee87d48b	Fix segfault on mac when running vulkan tests (#120337 ) Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes. Test Plan: See that there is no segfault on mac with fix and tests can run: Compile gtests: buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output" Crash w/o diff bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 85 tests from 1 test suite. [----------] Global test environment set-up. [----------] 85 tests from VulkanAPITest [ RUN ] VulkanAPITest.uniform_buffer_copy [ OK ] VulkanAPITest.uniform_buffer_copy (88 ms) [ RUN ] VulkanAPITest.copy_to_buffer Segmentation fault: 11 With diff there is no crash: bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc [==========] Running 85 tests from 1 test suite. [----------] Global test environment set-up. [----------] 85 tests from VulkanAPITest [ RUN ] VulkanAPITest.uniform_buffer_copy [ OK ] VulkanAPITest.uniform_buffer_copy (296 ms) ..... [ FAILED ] VulkanAPITest.gelu_quint8_self (23 ms) [----------] 85 tests from VulkanAPITest (1494 ms total) [----------] Global test environment tear-down [==========] 85 tests from 1 test suite ran. (1494 ms total) [ PASSED ] 72 tests. [ FAILED ] 13 tests, listed below: [ FAILED ] VulkanAPITest.linear_2d_flat [ FAILED ] VulkanAPITest.linear_2d_small [ FAILED ] VulkanAPITest.linear_2d_large [ FAILED ] VulkanAPITest.linear_3d_flat [ FAILED ] VulkanAPITest.linear_3d_small [ FAILED ] VulkanAPITest.linear_3d_large [ FAILED ] VulkanAPITest.linear_4d_flat [ FAILED ] VulkanAPITest.linear_4d_small [ FAILED ] VulkanAPITest.linear_4d_large [ FAILED ] VulkanAPITest.gelu_qint8 [ FAILED ] VulkanAPITest.gelu_qint8_self [ FAILED ] VulkanAPITest.gelu_quint8 [ FAILED ] VulkanAPITest.gelu_quint8_self The above failing tests were failing before as well and are being worked on. Differential Revision: D54023146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337 Approved by: https://github.com/SS-JIA	2024-02-28 20:55:47 +00:00
yuanx749	e317e39a02	Fix `nonlinearity` arg issue in RNN (#120234 ) Fixes #114617 This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg. Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234 Approved by: https://github.com/mikaylagawarecki	2024-02-28 20:53:18 +00:00
Yanbo Liang	8b22fe9594	[FX passes] Set group/batch fusion log to DEBUG level (#120780 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120780 Approved by: https://github.com/jackiexu1992	2024-02-28 20:48:11 +00:00
PyTorch MergeBot	4903e33e19	Revert "Capture non tensor arguments in record_function (#120017 )" This reverts commit 5c5b71b6eebae76d744261715231093e62f0d090. Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))	2024-02-28 20:43:33 +00:00
Jason Ansel	01ec8df6d8	[Compiled Autograd] Introduce BackwardState capture (#120382 ) This adds support for backwards hooks that are both: 1) Interior to the graph; and 2) Dynamically generated (e.g. lambdas) We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo after the forwards runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382 Approved by: https://github.com/xmfan	2024-02-28 20:36:47 +00:00
Will Constable	c016ffed5b	[C10D] Fix logic for default group=None in _set_pg_timeout (#120686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686 Approved by: https://github.com/yifuwang	2024-02-28 20:31:14 +00:00
Shengbao Zheng	11de40f82f	[flight recorder] record process group configuration (#120262 ) Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging. Differential Revision: D53792087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262 Approved by: https://github.com/shuqiangzhang	2024-02-28 20:31:08 +00:00
Hongtao Yu	5aa7f8646f	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120742 ) Relanding https://github.com/pytorch/pytorch/pull/120639 + a fix to drop `matrix_instr_nonkdim` that does not align with `BLOCK_M` or `BLOCK_N` Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 0 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120742 Approved by: https://github.com/xw285cornell	2024-02-28 20:27:14 +00:00
Scott Wolchok	b020ee5b05	[PyTorch Use MaybeOwned when promoting indices/offsets in embedding_bag (#120755 ) We're currently doing two unnecessary reference count operations in the case where promotion doesn't need to happen. Differential Revision: [D54285999](https://our.internmc.facebook.com/intern/diff/D54285999/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120755 Approved by: https://github.com/cyyever, https://github.com/Skylion007 ghstack dependencies: #120752	2024-02-28 20:13:30 +00:00
Scott Wolchok	98d1529474	[PyTorch] fix mixed int32/int64 indices/offsets for embedding_bag_out (#120752 ) This was an oversight in D27482738 (#55189) -- it only patched the regular embedding_bag operator, but static runtime uses the out variant. Differential Revision: [D54285460](https://our.internmc.facebook.com/intern/diff/D54285460/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120752 Approved by: https://github.com/houseroad	2024-02-28 20:13:30 +00:00
Emmett Neyman	db92558229	[codemod][lowrisk] Fix deprecated use of 0/NULL (#120740 ) Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D54163060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740 Approved by: https://github.com/Skylion007	2024-02-28 20:13:13 +00:00
Guilherme Leobas	491c2b4665	Let torch dynamo inline torch.func.grad (#118407 ) When dynamo sees torch.func.grad, it tries to inline all frames related to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407 Approved by: https://github.com/zou3519	2024-02-28 20:05:00 +00:00
Avik Chaudhuri	5472923998	derived dim (#118729 ) With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed. Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even). This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2dim`, or even `2dim` and `dim + 1`. We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec. Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more. Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols. Differential Revision: D53254587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729 Approved by: https://github.com/ezyang	2024-02-28 19:48:32 +00:00
Adam J. Stewart	9c55aa6ff6	TransformerEncoder/Decoder: add type hints (#120550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120550 Approved by: https://github.com/mikaylagawarecki	2024-02-28 19:36:08 +00:00
drisspg	4b7a521856	Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935 ) # Summary Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5). The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935 Approved by: https://github.com/cpuhrsch	2024-02-28 19:31:15 +00:00
PyTorch MergeBot	a9d9077f12	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit 7c556428c74a79c6d9c272826344a0828d3f66f5. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))	2024-02-28 18:57:09 +00:00
alanhe151220037	1c67f6cb26	fix decomposition of aten.diag_embed (#120549 ) Fixes #117019 Make the input that one dim negative and the other nonnegative be correctly solved in decomposition of `aten.diag_embed`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120549 Approved by: https://github.com/Dalian991, https://github.com/janeyx99	2024-02-28 18:48:01 +00:00
Chien-Chin Huang	f422467ccb	[BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625 ) When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice. However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base. Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625 Approved by: https://github.com/wconstab, https://github.com/yifuwang	2024-02-28 18:34:45 +00:00
andrewor14	91190d8087	[quant][pt2e] Relax `model_is_exported` input (#120720 ) Summary: This commit relaxes the `model_is_exported` API to additionally work for `torch.nn.Module`s in addition to just `torch.fx.GraphModule`s, simplifying downstream uses. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D54263935](https://our.internmc.facebook.com/intern/diff/D54263935) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120720 Approved by: https://github.com/tugsbayasgalan	2024-02-28 18:32:03 +00:00
Rohan Potdar	f67c77c497	Update engine.cpp (#120773 ) Minor comment fix; `backward` and `grad` are flipped here. See https://pytorch.org/docs/stable/_modules/torch/autograd.html#backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/120773 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/soulitzer	2024-02-28 18:23:35 +00:00
Xunsong, Huang	0ab2ec3738	[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 ) This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend. # Motivation The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats. # Principles 1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts. 2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths. 3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways. # Solutions ### a. Pathway Identification: Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling. ### b. `use_device` Logic Revision: With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction. ### c. Kernel List Segregation: To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects. ### d. Formatted Output: To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data. # Additional Enhancements ### a. Enumerations in `.pyi` Files: Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU. ### b. Correct DeviceType Returns: Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`. ### c. Bug Fixes in `cuda_corr_map`: Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee. # Further Abstraction Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`. # Next Pull Request The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices. We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-02-28 17:50:32 +00:00
Sherlock Huang	3e8b56d362	[Inductor] Track constant's original_fqn mapping (#120524 ) When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing. ``` constants_info_[0].name = "L__self___w_pre"; constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32); constants_info_[0].offset = 0; constants_info_[0].data_size = 64; constants_info_[0].from_folded = false; constants_info_[0].shape = {4, 4}; constants_info_[0].stride = {4, 1}; // constants_info_[0].original_fqn = "w_pre"; // this line is missing ``` Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing. As a fix, I maintain AOTI's own mapping for constant tensor's fqn. Differential Revision: D54097073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524 Approved by: https://github.com/chenyang78	2024-02-28 17:36:29 +00:00
Eddie Yan	702e82da28	[cuDNN][Flash Attention] Minor cleanup for cuDNN SDPA (#120750 ) Cleaning up before hopefully starting work on backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/120750 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-02-28 17:32:07 +00:00
Lucas Pasqualin	364faafe75	[DCP] Asserts CPU backend for async_save (#120241 ) If a CPU device is not present, collectives will hang in the threaded case due to: https://github.com/pytorch/pytorch/issues/115861 This PR asserts a CPU device is enabled in the pg group backend. Differential Revision: [D53952864](https://our.internmc.facebook.com/intern/diff/D53952864/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120241 Approved by: https://github.com/fegin	2024-02-28 17:21:30 +00:00
Catherine Lee	c8a34a4013	[ez] Smaller weight for some TD heuristics (#120736 ) Normalize to different number for the fuzzier heuristics Could this be done as a weighting elsewhere? Yes, but putting it here since I'm not sure which object would hold it best Pull Request resolved: https://github.com/pytorch/pytorch/pull/120736 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-02-28 17:07:45 +00:00
Oguz Ulgen	dfe7b9d471	Move user defined triton tests to inductor test folder (#120738 ) Summary: FBCode CI does not compile torch with CUDA for tests in dynamo folder, instead of adding a special rule, lets move these tests to inductor folder. Test Plan: ``` buck run mode/opt //caffe2/test/inductor/:triton_kernels ``` now works instead of skipping tests Differential Revision: D54280629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120738 Approved by: https://github.com/aakhundov	2024-02-28 17:03:41 +00:00
Yu, Guangye	df40847486	Add xpu header to include/ATen/xpu (#120786 ) # Motivation Add xpu header file to `include/ATen/xpu` to make them public. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120786 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 16:22:14 +00:00
Edward Z. Yang	7881b95c73	Don't suppress error codes in lint job, properly activate conda (#120769 ) Before: ``` 2024-02-28T02:38:24.3757573Z + conda activate /opt/conda/envs/py_3.9 2024-02-28T02:38:24.3757872Z 2024-02-28T02:38:24.3758116Z CondaError: Run 'conda init' before 'conda activate' ``` Now, this would actually fail the job, and I also fix the bug. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120769 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/malfet	2024-02-28 15:17:31 +00:00
Edward Z. Yang	facfc0baaf	Update _constrain_as_size docs (#120728 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120728 Approved by: https://github.com/Skylion007	2024-02-28 15:03:10 +00:00
James Wu	82099ab87b	[easy] Reword unexpected success error messages and generated github issues now that we have sentinel files (#120766 ) It's a bit annoying to have to read through the test name in verbose mode just to see what the test's sentinel file is actually called when encountering an unexpected success. Now that we have sentinel files, we can directly list the file path from root in the error message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120766 Approved by: https://github.com/Skylion007	2024-02-28 11:15:29 +00:00
Yu, Guangye	46e3f670b4	refactor code to share across different devices (#120602 ) # Motivation Refactor utils code to make it possible to share across CUDA, XPU, and other backends. # Solution Move `_dummy_type` and `_LazySeedTracker` to torch._utils; # Additional Context When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang	2024-02-28 09:42:58 +00:00
Chao Zhou	a11a49af58	Add NCCL work sequence number to work info (#120596 ) Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely. Test Plan: 1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq 2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest Differential Revision: D54180050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596 Approved by: https://github.com/kwen2501	2024-02-28 07:54:37 +00:00
Menglu Yu	be31e522ce	[PT2][Inductor] Fix "example_value" absent for stack nodes (#120655 ) Summary: We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689. pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization. Test Plan: ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` before fix: P1187633689 ``` W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6 W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5 W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4 W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20 W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18 W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17 W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19 W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15 W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14 W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16 W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18 W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12 W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11 W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13 W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10 W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7 ``` after fix: P1189491364 ``` W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16 W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16 W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20 W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19 W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18 W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17 ``` Differential Revision: D54140488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120655 Approved by: https://github.com/jackiexu1992	2024-02-28 05:35:36 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Xiaoya Xiang	8ba4cb451f	Fix an import loop (#119820 ) Summary: We ran into the following import loop when testing aps: ``` Traceback (most recent call last): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main code = _serve_one(child_r, fds, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one code = spawn._main(child_r, parent_sentinel) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare _fixup_main_from_name(data['init_main_from_name']) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name main_content = runpy.run_module(mod_name, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module return _run_module_code(code, init_globals, run_name, mod_spec) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code _run_code(code, mod_globals, init_globals, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module> class ICVRConfig(AdsComboLauncherConfig): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module> class AdsComboLauncherConfig(AdsConfig): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module> class AdsConfig(RecTrainAppConfig): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module> class EmbeddingKernelConfig: File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module> class ParameterSharding: File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding sharding_spec: Optional[ShardingSpec] = None File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module> class ShardingSpec(ABC): File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec tensor_properties: sharded_tensor_meta.TensorProperties, File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module> def empty(sharding_spec: shard_spec.ShardingSpec, ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py) ``` Using future annotations to mitigate. Test Plan: ``` hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74 CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep ``` Differential Revision: D53685582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119820 Approved by: https://github.com/fegin	2024-02-28 05:09:16 +00:00
Animesh Jain	e9a961f66a	[dynamo][refactor] Use originating_source for HASATTR (#120723 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120723 Approved by: https://github.com/jansel ghstack dependencies: #120520, #120590, #120721	2024-02-28 05:00:59 +00:00
PyTorch UpdateBot	a774baa501	[audio hash update] update the pinned audio hash (#120748 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120748 Approved by: https://github.com/pytorchbot	2024-02-28 04:47:38 +00:00
Jason Ansel	184e815c74	Add TORCH_LOGS_FORMAT=short alias (#120757 ) Shorthand for `"%(levelname)s:%(name)s:%(message)s"` which is hard to remember. I find the default formatter annoying since just the metadata fills up most of the width of my terminal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120757 Approved by: https://github.com/ezyang	2024-02-28 04:40:48 +00:00
PyTorch UpdateBot	bd5f290505	[vision hash update] update the pinned vision hash (#120749 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120749 Approved by: https://github.com/pytorchbot	2024-02-28 04:36:16 +00:00
feifan	bfa71b523d	add complex32 to v3_dtypes (#120388 ) Fixes [#120290](https://github.com/pytorch/pytorch/issues/120290) Fixes https://github.com/pytorch/pytorch/issues/73502 use `v3_dtypes` and `torch._utils._rebuild_tensor_v3` to handle torch.save(complex32) result: ![image](https://github.com/pytorch/pytorch/assets/37650440/18b6cbb3-fb3f-4855-9d48-374014647988) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120388 Approved by: https://github.com/albanD	2024-02-28 02:32:29 +00:00
Animesh Jain	5a53c0ff23	[dynamo][refactor] Rename LIST_LENGTH to SEQUENCE_LENGTH, separate DICT_LENGTH (#120721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120721 Approved by: https://github.com/jansel ghstack dependencies: #120520, #120590	2024-02-28 02:19:10 +00:00
Yang Chen	1627d9e06d	[aot_inductor] added a utility function aoti_torch_print_tensor_handle (#120660 ) Added a function to print tenosr values for a tensor handle. It can be injected to the cpp wrapper code and help debug numerical issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120660 Approved by: https://github.com/desertfire	2024-02-28 02:08:34 +00:00
laith sakka	d21c6eb215	Do not wrap output with input device inside _to_copy (#119868 ) Fixing https://github.com/pytorch/pytorch/issues/118790 This diff revert a small part of the code that was introduced in https://github.com/pytorch/pytorch/pull/104689 The PR above added a comment that "In case of dtype promotion, fake tensor converted into tensor" but its not always the case that a conversion in dtype causes a fake tensor to be a tensor. When such conversion does not happen we get the following error ``` Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to a python object of type FakeTensor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119868 Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi	2024-02-28 01:51:43 +00:00
wz337	33499ec41b	[FSDP2][DCP][DSD] Add FSDP2 model state dict unit test with distributed state dict (#120680 ) This adds some initial unit tests for FSDP2 model state dict only. This PR adds two tests: 1. Add a unit test for parity check for FSDP `model.state_dict()` with distributed_state_dict's `get_model_state_dict`. 2. Add a unit test to make sure`StateDictOptions(full_state_dict=True, cpu_offload=True)` in distributed_state_dict work for FSDP2 model state_dict. Optimizer state dict will be in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120680 Approved by: https://github.com/awgu	2024-02-28 01:40:04 +00:00
Yu, Guangye	1aa9099839	[CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616 ) # Motivation refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616 Approved by: https://github.com/albanD	2024-02-28 01:35:25 +00:00
Edward Z. Yang	1a1fc1047d	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007 ghstack dependencies: #120712	2024-02-28 01:01:41 +00:00
Mikayla Gawarecki	677e67c399	Update nn.Module._apply to not gate on should_use_set_data when swap_tensors is set (#120659 ) This updates the nesting of if statements in `nn.Module._apply` such that if `torch.__future__.set_swap_module_params_on_conversion(True)`, we always try to swap regardless of whether - `torch._has_compatible_shallow_copy_type(param, fn(param)` - `torch.__future__.set_overwrite_module_params_on_conversion` is set This means that `meta_module.to_empty('device')` can now use the swap_tensors path cc @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/120659 Approved by: https://github.com/albanD	2024-02-28 00:59:34 +00:00
Edward Z. Yang	213b3ac3f2	[BE] fail_* variables don't need to be shared across restarts, they're set only once (#120712 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120712 Approved by: https://github.com/yanboliang	2024-02-28 00:48:11 +00:00
Kurt Mohler	2ebf2c88ba	Add test to check that COW inputs are not materialized (#119507 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507 Approved by: https://github.com/ezyang ghstack dependencies: #120455	2024-02-28 00:37:33 +00:00
Kurt Mohler	cabc09a5f2	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-02-28 00:37:33 +00:00
cyy	1e9fafc160	[Clang-tidy header][20/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#120574 ) This PR fixes some clang-tidy warnings in aten/src/ATEN/. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120574 Approved by: https://github.com/Skylion007	2024-02-28 00:13:05 +00:00
Jeff Daily	9c597ff137	use condition_variable and wait_until in nccl dump on timeout (#120544 ) Fixes test_c10d_nccl.py -k test_timeout_dumps_timing_enabled_True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120544 Approved by: https://github.com/atalman	2024-02-28 00:06:08 +00:00
Hankyeol Kyung	14b258b5bc	Fix broken link in README (#120698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120698 Approved by: https://github.com/janeyx99	2024-02-27 23:55:06 +00:00
Eddie Yan	5929d4e830	[CUDA][cuBLAS] Check if a context is present when grabbing a cuBLAS handle (#120131 ) cuBLAS has indicated that certain kernels will transition to using the driver API over the CUDA runtime API, which we've observed to break existing tests (e.g., DataParallel) that use multithreading and may not eagerly grab a context via `cudaSetDevice`. CC @Aidyn-A @ptrblck Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120131 Approved by: https://github.com/atalman	2024-02-27 22:45:16 +00:00
PyTorch MergeBot	f36e00b8ce	Revert "[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639 )" This reverts commit 78f53a3f731ee67dcffd308519ed48a745640dde. Reverted https://github.com/pytorch/pytorch/pull/120639 on behalf of https://github.com/izaitsevfb due to breaking ROCm ([comment](https://github.com/pytorch/pytorch/pull/120639#issuecomment-1967585568))	2024-02-27 21:05:57 +00:00
Aaron Orenstein	6cc7f9a2e6	Limit loop unrolling (#120023 ) Tacotron2 causes massive loop unrolling resulting in very large graphs (26k nodes) which was causing inductor (and tracing itself) to choke. The unrolling size is controlled by the environment variable TORCHDYNAMO_MAX_LOOP_UNROLL_NODES which defaults to the arbitrary value 5000. This updates the tacotron2 timings as follows: eager timing: 3m:23s -> 35s aot_eager timing: 4m:12s -> 39s inductor timing: 22m:24s ->1m For reference the big loop in tacotron2 was this one (model.py[405]): ``` while len(mel_outputs) < decoder_inputs.size(0) - 1: decoder_input = decoder_inputs[len(mel_outputs)] mel_output, gate_output, attention_weights = self.decode(decoder_input) mel_outputs += [mel_output.squeeze(1)] gate_outputs += [gate_output.squeeze(1)] alignments += [attention_weights] ``` which gets unrolled and inlined adding about 36 nodes to the graph per iteration. Fixes #98467 Relates to #102839 which hopefully will result in a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120023 Approved by: https://github.com/yanboliang	2024-02-27 20:44:21 +00:00
PyTorch MergeBot	f3dd2a544c	Revert "Add structured trace logs (#120289 )" This reverts commit 9dfaef962cda5f65eec53e5fd6f07b5226ea65cb. Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))	2024-02-27 19:49:05 +00:00
eqy	65efece3a4	[CUDA][cuBLAS] Bump `test_cublas_baddbmm_large_input` tolerances (#117889 ) Unfortunate that the current `rtol=1e-5` hits a literal 1 / 1000000 mismatch (`rtol=1.04e-5`) on L40. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117889 Approved by: https://github.com/atalman	2024-02-27 19:05:20 +00:00
Jason Ansel	5b5c167adc	[dynamo] Add some helpers to PyCodegen (#120684 ) This are used in later PRs in the stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/120684 Approved by: https://github.com/yanboliang	2024-02-27 18:46:51 +00:00
Wanchao Liang	0c8bb6f70c	[dtensor] standardize tuple strategy handling for foreach ops (#120695 ) This PR refactors the tuple strategy handling logic, and allow TupleStrategy to have both input/output specs for each OpStrategy child, so that we could further enable operators like foreach norm Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695 Approved by: https://github.com/awgu	2024-02-27 18:23:11 +00:00
Shengbao Zheng	440a9b212d	[profiler] log process group config information in distributedInfo field (#119443 ) Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well Differential Revision: D53557965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443 Approved by: https://github.com/kwen2501	2024-02-27 18:21:54 +00:00
Hongtao Yu	78f53a3f73	[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639 ) Summary: Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 32 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x. Similar changes has been done to the HSTU ragged attention kernel D53386525. Test Plan: Before: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4% SingleProcess AUTOTUNE takes 8.1153 seconds ``` After: ``` AUTOTUNE mm(1024x1024, 1024x1024) ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2% SingleProcess AUTOTUNE takes 11.4076 seconds ``` Before: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6% SingleProcess AUTOTUNE takes 3.4052 seconds ``` After: ``` AUTOTUNE mm(2048x2048, 2048x2048) ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8% SingleProcess AUTOTUNE takes 11.3538 seconds ``` Before: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6% SingleProcess AUTOTUNE takes 9.0523 seconds ``` After: ``` AUTOTUNE mm(4096x4096, 4096x4096) ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2% SingleProcess AUTOTUNE takes 8.2225 seconds ``` Before: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7% SingleProcess AUTOTUNE takes 11.0074 seconds ``` After: ``` AUTOTUNE mm(8192x8192, 8192x8192) ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4% SingleProcess AUTOTUNE takes 14.9839 seconds ``` Reviewed By: xw285cornell, nmacchioni Differential Revision: D54203170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120639 Approved by: https://github.com/xw285cornell, https://github.com/jansel	2024-02-27 18:16:33 +00:00
Zhengxu Chen	3f62b05d31	[export] Use forward hooks to capture module signatures. (#120468 ) Summary: When we export in on strict mode and turn on preserve_module_call_signature, the following assertion error will occur today: ``` child_split[: len(parent_split)] == parent_split ``` This is due to the fact that we're monkey patching forward call directly, which kinda breaks the attribute propagation in the tracer. It's actually better to implement this by using forward hook because we don't have to alter the original module structure at all during export. Test Plan: CI Differential Revision: D54102714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120468 Approved by: https://github.com/ydwu4	2024-02-27 17:44:06 +00:00
Andrew M. James	ed3c256b61	Add lowering for adaptive_max_pool2d (#120254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120254 Approved by: https://github.com/lezcano	2024-02-27 16:32:18 +00:00
Bin Bao	27bb73fe46	[AOTI] Fix a strict-aliasing warning (#120628 ) Summary: This gets rid of an annoying compile time warning, "dereferencing type-punned pointer will break strict-aliasing rules" Differential Revision: [D54207229](https://our.internmc.facebook.com/intern/diff/D54207229) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120628 Approved by: https://github.com/Skylion007	2024-02-27 15:09:13 +00:00
Yang Chen	c29ac05ac0	[inductor] correctly retrieve the "shared" attribute from a Triton binary (#120666 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120666 Approved by: https://github.com/jansel	2024-02-27 13:10:09 +00:00
Isuru Fernando	435063aa89	Decomposition for upsample_linear{1d, 3d} (#114774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114774 Approved by: https://github.com/lezcano, https://github.com/vfdev-5, https://github.com/peterbell10	2024-02-27 11:57:45 +00:00
Kai Londenberg	2ad66e6bf0	Fix test failure: Add torch.cuda._get_device_properties to dynamo trace rules (#120620 ) In this PR stack, there were unrelated test failures within test_trace_rules.py - It turned out that torch.cuda._get_device_properties should be registered in _dynamoc/trace_rules.py. A test failed because it was not. This is a small fix which tries to get rid of the test failure by manually registering that function. Note: I am not sure whether this is the best way to fix this, as I am neither familiar with the trace rules nor with the introduction of torch.cuda._get_device_properties. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120620 Approved by: https://github.com/Skylion007	2024-02-27 10:46:01 +00:00
Animesh Jain	e3d64c4d5d	[dynamo] Desugar accumulate_grad, fix .grad handling (#120590 ) Fixes https://github.com/pytorch/pytorch/issues/118435 Fixes https://github.com/pytorch/pytorch/issues/119906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120590 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #120520	2024-02-27 10:12:26 +00:00
Rohan Varma	9db6a849ed	[FSDP] Clean missing and unexpected keys (#120600 ) Currently, when loading w/strict=False or w/strict=True and looking at error message, FQNs are garbled w/FSDP details such as "_fsdp_wrapped_module". This makes it tricky for upstream applications to validate the expected set of keys are missing / unexpected (for example with PEFT where state_dict is loaded non-strict), and makes error message more complicated w/FSDP details. This PR cleans those prefixes by using `clean_tensor_name` in FSDP's existing post load_state_dict hooks. Currently, only full_state_dict impl is tested, can test the rest of the impls as follow up work. Differential Revision: [D54182472](https://our.internmc.facebook.com/intern/diff/D54182472/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120600 Approved by: https://github.com/XilunWu, https://github.com/fegin	2024-02-27 07:43:45 +00:00
Max Ren	b2a318d856	[PyTorch][ExportedProgram] add 'is_lifted_tensor_constant' and 'get_lifted_tensor_constant' utils (#120546 ) as title Differential Revision: [D54149274](https://our.internmc.facebook.com/intern/diff/D54149274/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120546 Approved by: https://github.com/kirklandsign	2024-02-27 07:16:55 +00:00
Tobias Ringwald	7c556428c7	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn	2024-02-27 07:05:48 +00:00
angelayi	cbbc309cae	[pytree][reland] Require pytree serialized_type_name (#120636 ) Relanding https://github.com/pytorch/pytorch/pull/119718 as the diff which prevents breakages of torchrec [D53857843](https://www.internalfb.com/diff/D53857843) has landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/120636 Approved by: https://github.com/avikchaudhuri	2024-02-27 06:53:33 +00:00
Michael Suo	12f724c779	[export] preserve constant fqn (#120664 ) Summary: Previously we were renaming constants to `lifted_constant_tensor0` or equivalent. This PR changes things so that the constants retain the same FQN as in the original eager module. Actually, `symbolic_trace` already is supposed to do this, but the code path is not triggered when used from `make_fx`, since we don't pass an actual `nn.Module` instance to `trace()`, but rather a multiply-wrapped-functionalized-lambda-thing. So, I reproduced the essential logic outside of make_fx, at the export layer. Test Plan: added a unit test Differential Revision: D54221616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120664 Approved by: https://github.com/SherlockNoMad	2024-02-27 06:35:51 +00:00
dilililiwhy	a358b23a6a	Keep test order due to rename_privateuse1_backend is disposable (#120464 ) With the change in https://github.com/pytorch/pytorch/pull/120399. As rename_privateuse1_backend is disposable, run test_external_module_register with an renamed backend may cause problem. Try to change the testcase name and keep the right order (ASCII). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120464 Approved by: https://github.com/albanD	2024-02-27 05:38:43 +00:00
Aaron Gokaslan	5a5b654481	[BE]: Enable ruff LOG checks (#120674 ) Enable LOG error codes in ruff to find bad usages of the logger: https://docs.astral.sh/ruff/rules/#flake8-logging-log Pull Request resolved: https://github.com/pytorch/pytorch/pull/120674 Approved by: https://github.com/ezyang	2024-02-27 04:37:20 +00:00
Levy Zhao	b6139b1e57	[PyTorch][CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050 ) Differential Revision: D53734057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120050 Approved by: https://github.com/xw285cornell	2024-02-27 04:34:53 +00:00
PyTorch UpdateBot	a1c641f118	[executorch hash update] update the pinned executorch hash (#120675 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120675 Approved by: https://github.com/pytorchbot	2024-02-27 03:59:16 +00:00
Edward Z. Yang	237773132d	Restore artifact name in log messages (#120671 ) Yuzhen Huang was complaining to me that searching for `__recompile` no longer works. This is because the glog format is filename, not logger name, so we lost the artifact name. Add it back. Looks like: ``` V0226 15:56:04.142000 139828992779264 torch/_dynamo/guards.py:1084] [0/2] __guards: ___check_type_id(L['inputs'], 7626144) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120671 Approved by: https://github.com/Skylion007	2024-02-27 03:37:11 +00:00
PyTorch UpdateBot	ac28571742	[vision hash update] update the pinned vision hash (#119944 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119944 Approved by: https://github.com/pytorchbot	2024-02-27 03:25:51 +00:00
PyTorch UpdateBot	9d423f0e91	[audio hash update] update the pinned audio hash (#120135 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120135 Approved by: https://github.com/pytorchbot	2024-02-27 03:20:00 +00:00
Animesh Jain	63f874b476	[dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120593 Approved by: https://github.com/jansel	2024-02-27 03:13:55 +00:00
Eli Uriegas	27990045ff	docker: Only match tags that start with v* (#120670 ) To avoid issues where version could be confused with a ciflow tag. Example: ``` ❯ git describe --tags --always ciflow/periodic/c3496d50f0bb437c70f27085f71155209277bfd4-47-g4ca24959d1a ❯ git describe --tags --always --match "v[1-9]." v1.8.0-rc1-36500-g4ca24959d1a ``` Resolves https://github.com/pytorch/pytorch/issues/120392 Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120670 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-02-27 02:55:33 +00:00
cpuhrsch	cf6df886a0	Remove hard numpy dependency from experimental_ops.py (#119520 ) Based on similar code in the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/119520 Approved by: https://github.com/albanD	2024-02-27 02:46:13 +00:00
Yifu Wang	2de7468d2b	Switch to native functional collective by default (#120370 ) This enables native functional collectives by default. After this PR: - The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier. - Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173). - Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users. Testing performed: - We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed. - Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env). Fallback mechansim: - Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370 Approved by: https://github.com/wconstab, https://github.com/yf225	2024-02-27 01:53:56 +00:00
Animesh Jain	8a59f49da2	[dynamo][compile-time] Collect guard debug stack info only with logs enabled (#120520 ) Reduces backend=eager compile time from 33 to 19 seconds for `MobileBertForQuestionAnswering`. This also helps an internal model where guards.add function is taking 124 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120520 Approved by: https://github.com/mlazos	2024-02-27 01:51:16 +00:00
Nikita Shulga	2e0e545759	[EZ][BE] Use nested namespace in functorch (#120663 ) I should really enable this clang-tidy check rather than doing it by hand Pull Request resolved: https://github.com/pytorch/pytorch/pull/120663 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-02-27 01:45:32 +00:00
Yu, Guangye	b3fe53e1ad	[1/2] Intel GPU Runtime Upstreaming for Generator (#118528 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the last runtime component we would like to upstream is `Generator` which is responsible for the pseudo-random number generation. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `aten`. # Design Following the previous design, `c10::GeneratorImpl` is the device-agnostic abstraction of a random number generator. So we will introduce an XPU generator `XPUGeneratorImpl`, inheriting from `c10::GeneratorImpl`, to manage random states on an Intel GPU device. Intel GPU runtime `Generator` adopts the same algorithm as CPU. The corresponding C++ file should be placed in aten/src/ATen/xpu/ folder and is built in `libtorch_xpu.so`. This PR provide the list of APIs: - `getDefaultXPUGenerator` - `createXPUGenerator` # Additional Context The 2nd PR will cover `python frontend`. The differences with CUDA: The generator-related ATen CPP APIs are 1:1 mapping with CUDA. The XPUGeneratorImpl's member functions have slight differences with CUDA. lack of CUDA-related counterpart APIs listed below: - capture_prologue - capture_epilogue - philox_cuda_state - reset_rnn_state Pull Request resolved: https://github.com/pytorch/pytorch/pull/118528 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-02-27 01:39:40 +00:00
angelayi	f064dec7e0	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-27 01:34:59 +00:00
Jane Xu	ef9b6d6816	Replace individual detaches with overall torch.no_grad decorator (#120638 ) Fixes https://github.com/pytorch/pytorch/issues/120611. At first, I thought there were too many detaches, but @awgu and I made the conclusion that both `clip_grad_norm_` and `clip_grad_value_` should be run under torch.no_grad similar to optimizer step. One option is to continue calling `detach`, but doing that on many tensors is slower than setting the context to be no_grad (I think?) and Andrew had noticed: "the 1st round of detaches takes 10 ms for FSDP2, whereas existing FSDP's clip_grad_norm_ only takes 3 ms total" since there are more tensors in FSDP2. This change also disables grad mode for the foreach path of `clip_grad_value_`, which the first attempt that didn't do this was an oversight. Not sure how to add a test case for this since grad mode will be turned back on after the call. New profile is not much different from the one in the bottom of this stack, but the number of detaches is 0 :D: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (c71bcceb)]$ python playground2.py STAGE:2024-02-26 13:07:15 211224:211224 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaLaunchKernel 70.63% 110.415ms 70.63% 110.415ms 5.811ms 0.000us 0.00% 0.000us 0.000us 19 aten::linalg_vector_norm 0.18% 284.000us 26.00% 40.636ms 40.636ms 3.000us 0.99% 3.000us 3.000us 1 aten::clamp 0.09% 148.000us 14.88% 23.261ms 23.261ms 1.000us 0.33% 1.000us 1.000us 1 aten::to 0.75% 1.170ms 14.05% 21.970ms 84.826us 0.000us 0.00% 258.000us 0.996us 259 aten::_to_copy 2.28% 3.562ms 13.31% 20.800ms 161.240us 0.000us 0.00% 258.000us 2.000us 129 aten::_foreach_norm 4.44% 6.935ms 12.72% 19.878ms 9.939ms 19.000us 6.29% 21.000us 10.500us 2 aten::add 0.11% 173.000us 10.97% 17.153ms 17.153ms 1.000us 0.33% 1.000us 1.000us 1 aten::stack 2.99% 4.673ms 9.15% 14.300ms 14.300ms 0.000us 0.00% 6.000us 6.000us 1 aten::copy_ 5.49% 8.586ms 8.96% 14.001ms 108.535us 258.000us 85.43% 258.000us 2.000us 129 aten::reciprocal 0.11% 179.000us 8.35% 13.051ms 13.051ms 1.000us 0.33% 1.000us 1.000us 1 aten::cat 0.64% 993.000us 4.42% 6.902ms 6.902ms 6.000us 1.99% 6.000us 6.000us 1 aten::zeros 0.04% 69.000us 4.28% 6.698ms 3.349ms 0.000us 0.00% 2.000us 1.000us 2 aten::zero_ 0.04% 66.000us 4.13% 6.462ms 3.231ms 0.000us 0.00% 2.000us 1.000us 2 aten::fill_ 0.06% 98.000us 4.09% 6.396ms 3.198ms 2.000us 0.66% 2.000us 1.000us 2 aten::_foreach_mul_ 1.50% 2.342ms 3.79% 5.924ms 2.962ms 10.000us 3.31% 10.000us 5.000us 2 aten::empty 3.27% 5.115ms 3.27% 5.115ms 19.826us 0.000us 0.00% 0.000us 0.000us 258 aten::empty_strided 2.07% 3.237ms 2.07% 3.237ms 25.093us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceEnablePeerAccess 1.93% 3.023ms 1.93% 3.023ms 1.512ms 0.000us 0.00% 0.000us 0.000us 2 aten::unsqueeze 1.21% 1.896ms 1.74% 2.725ms 10.645us 0.000us 0.00% 0.000us 0.000us 256 cudaMemcpyAsync 1.01% 1.572ms 1.01% 1.572ms 12.186us 0.000us 0.00% 0.000us 0.000us 129 aten::as_strided 0.54% 839.000us 0.54% 839.000us 3.265us 0.000us 0.00% 0.000us 0.000us 257 cudaStreamWaitEvent 0.34% 539.000us 0.34% 539.000us 2.089us 0.000us 0.00% 0.000us 0.000us 258 cudaEventRecord 0.18% 274.000us 0.18% 274.000us 1.062us 0.000us 0.00% 0.000us 0.000us 258 aten::mul 0.07% 107.000us 0.08% 132.000us 132.000us 1.000us 0.33% 1.000us 1.000us 1 cudaDeviceSynchronize 0.01% 17.000us 0.01% 17.000us 8.500us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceCanAccessPeer 0.00% 7.000us 0.00% 7.000us 3.500us 0.000us 0.00% 0.000us 0.000us 2 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 0.66% 2.000us 1.000us 2 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 13.000us 4.30% 13.000us 3.250us 4 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 Memcpy PtoP (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 258.000us 85.43% 258.000us 2.000us 129 void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.99% 3.000us 3.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 3.31% 10.000us 2.500us 4 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 156.319ms Self CUDA time total: 302.000us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120638 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #120623	2024-02-27 01:27:05 +00:00
Jane Xu	df72819f91	clip_grad_norm can use fast foreach path for inf norm (#120623 ) Now that foreach_norm supports inf, we should not special case it. For a mere 256 parameters, we get a win of 30ms in CPU time and ~800us -> 300us decrease in CUDA time. This win is only bigger for more parameters. New profile: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (bf1c0490\|REBASE-i\|detached HEAD)]$ python playground2.py STAGE:2024-02-26 13:14:10 395517:395517 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaLaunchKernel 67.01% 102.262ms 67.01% 102.262ms 5.382ms 2.000us 0.66% 2.000us 0.105us 19 aten::linalg_vector_norm 0.20% 311.000us 23.44% 35.776ms 35.776ms 3.000us 0.99% 3.000us 3.000us 1 aten::to 0.79% 1.208ms 14.62% 22.311ms 86.143us 0.000us 0.00% 263.000us 1.015us 259 aten::clamp 0.12% 182.000us 13.96% 21.303ms 21.303ms 1.000us 0.33% 1.000us 1.000us 1 aten::_to_copy 2.38% 3.628ms 13.83% 21.103ms 163.589us 0.000us 0.00% 263.000us 2.039us 129 aten::_foreach_norm 4.71% 7.185ms 13.54% 20.659ms 10.329ms 19.000us 6.29% 23.000us 11.500us 2 aten::add 0.14% 211.000us 10.86% 16.580ms 16.580ms 1.000us 0.33% 1.000us 1.000us 1 aten::stack 3.11% 4.744ms 9.59% 14.642ms 14.642ms 0.000us 0.00% 6.000us 6.000us 1 aten::copy_ 5.71% 8.721ms 9.27% 14.152ms 109.705us 258.000us 85.43% 263.000us 2.039us 129 aten::reciprocal 0.13% 193.000us 7.93% 12.100ms 12.100ms 1.000us 0.33% 1.000us 1.000us 1 aten::cat 0.67% 1.017ms 4.67% 7.129ms 7.129ms 6.000us 1.99% 6.000us 6.000us 1 aten::zeros 0.05% 79.000us 4.46% 6.800ms 3.400ms 0.000us 0.00% 2.000us 1.000us 2 aten::zero_ 0.05% 79.000us 4.28% 6.537ms 3.268ms 0.000us 0.00% 2.000us 1.000us 2 aten::fill_ 0.09% 131.000us 4.23% 6.458ms 3.229ms 2.000us 0.66% 2.000us 1.000us 2 aten::_foreach_mul_ 1.56% 2.377ms 3.86% 5.896ms 2.948ms 10.000us 3.31% 10.000us 5.000us 2 aten::empty 3.55% 5.414ms 3.55% 5.414ms 20.984us 0.000us 0.00% 0.000us 0.000us 258 aten::empty_strided 2.18% 3.323ms 2.18% 3.323ms 25.760us 0.000us 0.00% 0.000us 0.000us 129 aten::detach 0.85% 1.302ms 2.10% 3.199ms 12.496us 0.000us 0.00% 0.000us 0.000us 256 cudaDeviceEnablePeerAccess 2.01% 3.069ms 2.01% 3.069ms 1.534ms 0.000us 0.00% 0.000us 0.000us 2 aten::unsqueeze 1.24% 1.899ms 1.81% 2.769ms 10.816us 0.000us 0.00% 0.000us 0.000us 256 detach 1.24% 1.897ms 1.24% 1.897ms 7.410us 0.000us 0.00% 0.000us 0.000us 256 cudaMemcpyAsync 1.01% 1.539ms 1.01% 1.539ms 11.930us 0.000us 0.00% 0.000us 0.000us 129 aten::as_strided 0.58% 881.000us 0.58% 881.000us 3.428us 0.000us 0.00% 0.000us 0.000us 257 cudaStreamWaitEvent 0.35% 540.000us 0.35% 540.000us 2.093us 0.000us 0.00% 0.000us 0.000us 258 cudaEventRecord 0.18% 278.000us 0.18% 278.000us 1.078us 5.000us 1.66% 5.000us 0.019us 258 aten::mul 0.08% 125.000us 0.09% 138.000us 138.000us 1.000us 0.33% 1.000us 1.000us 1 cudaDeviceSynchronize 0.01% 13.000us 0.01% 13.000us 6.500us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceCanAccessPeer 0.00% 5.000us 0.00% 5.000us 2.500us 0.000us 0.00% 0.000us 0.000us 2 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 0.66% 2.000us 1.000us 2 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 13.000us 4.30% 13.000us 3.250us 4 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 Memcpy PtoP (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 258.000us 85.43% 258.000us 2.000us 129 void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 1.99% 6.000us 3.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.99% 3.000us 3.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.33% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 3.31% 10.000us 2.500us 4 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 152.613ms Self CUDA time total: 302.000us ``` Compared to on main: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5a0a9644)]$ python playground2.py STAGE:2024-02-26 13:09:56 285045:285045 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ cudaLaunchKernel 61.42% 113.375ms 61.42% 113.375ms 424.625us 45.000us 5.66% 45.000us 0.169us 267 aten::linalg_vector_norm 14.04% 25.909ms 37.67% 69.534ms 271.617us 514.000us 64.65% 559.000us 2.184us 256 aten::to 0.78% 1.433ms 12.87% 23.751ms 91.703us 0.000us 0.00% 278.000us 1.073us 259 aten::_to_copy 2.02% 3.730ms 12.09% 22.318ms 173.008us 0.000us 0.00% 278.000us 2.155us 129 aten::clamp 0.09% 174.000us 11.43% 21.103ms 21.103ms 1.000us 0.13% 1.000us 1.000us 1 aten::add 0.11% 205.000us 9.08% 16.768ms 16.768ms 1.000us 0.13% 1.000us 1.000us 1 aten::copy_ 4.94% 9.112ms 8.15% 15.043ms 116.612us 258.000us 32.45% 278.000us 2.155us 129 aten::stack 2.76% 5.091ms 7.97% 14.719ms 14.719ms 0.000us 0.00% 6.000us 6.000us 1 aten::reciprocal 0.11% 194.000us 7.01% 12.933ms 12.933ms 1.000us 0.13% 1.000us 1.000us 1 aten::max 0.09% 165.000us 6.43% 11.868ms 11.868ms 3.000us 0.38% 3.000us 3.000us 1 aten::detach 1.58% 2.911ms 4.12% 7.596ms 14.836us 0.000us 0.00% 0.000us 0.000us 512 aten::cat 0.56% 1.042ms 3.73% 6.882ms 6.882ms 6.000us 0.75% 6.000us 6.000us 1 aten::_foreach_mul_ 1.36% 2.503ms 3.33% 6.145ms 3.072ms 10.000us 1.26% 10.000us 5.000us 2 detach 2.54% 4.685ms 2.54% 4.685ms 9.150us 0.000us 0.00% 0.000us 0.000us 512 aten::empty_strided 1.92% 3.545ms 1.92% 3.545ms 27.481us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceEnablePeerAccess 1.64% 3.022ms 1.64% 3.022ms 1.511ms 0.000us 0.00% 0.000us 0.000us 2 aten::unsqueeze 1.03% 1.892ms 1.49% 2.746ms 10.727us 0.000us 0.00% 0.000us 0.000us 256 aten::as_strided 1.35% 2.494ms 1.35% 2.494ms 4.862us 0.000us 0.00% 0.000us 0.000us 513 cudaMemcpyAsync 1.01% 1.868ms 1.01% 1.868ms 14.481us 4.000us 0.50% 4.000us 0.031us 129 cudaStreamWaitEvent 0.41% 760.000us 0.41% 760.000us 2.946us 8.000us 1.01% 8.000us 0.031us 258 cudaEventRecord 0.15% 276.000us 0.15% 276.000us 1.070us 8.000us 1.01% 8.000us 0.031us 258 aten::mul 0.08% 139.000us 0.08% 153.000us 153.000us 1.000us 0.13% 1.000us 1.000us 1 aten::empty 0.02% 35.000us 0.02% 35.000us 35.000us 0.000us 0.00% 0.000us 0.000us 1 cudaDeviceSynchronize 0.01% 14.000us 0.01% 14.000us 7.000us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceCanAccessPeer 0.00% 5.000us 0.00% 5.000us 2.500us 0.000us 0.00% 0.000us 0.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 514.000us 64.65% 514.000us 2.008us 256 Memcpy PtoP (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 258.000us 32.45% 258.000us 2.000us 129 void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 0.75% 6.000us 3.000us 2 void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 0.38% 3.000us 3.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 0.13% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 10.000us 1.26% 10.000us 2.500us 4 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 184.579ms Self CUDA time total: 795.000us ``` For script: ``` import torch from math import inf from torch.nn.utils import clip_grad_norm_ params = [torch.rand(32, 16, device="cuda:3")5 for _ in range(128)] + [torch.rand(32, 16, device="cuda:4")-7 for _ in range(128)] for p in params: p.grad = torch.rand_like(p) with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: total_norm = clip_grad_norm_(params, 10.0, norm_type=inf) torch.cuda.synchronize() print(p.key_averages().table(sort_by="cpu_time_total")) print(total_norm) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120623 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-27 01:27:05 +00:00
PyTorch MergeBot	b01bd1f7a1	Revert "Add torch.ops.aten.print (#120295 )" This reverts commit 3b944113c837e1111510487f4525aa07039462fe. Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))	2024-02-27 01:18:48 +00:00
PyTorch MergeBot	17560eb472	Revert "[Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444 )" This reverts commit 4d2073bc3faa7f2014c4fb2f568e68fe195b6f99. Reverted https://github.com/pytorch/pytorch/pull/120444 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54192376 ([comment](https://github.com/pytorch/pytorch/pull/120444#issuecomment-1965600268))	2024-02-27 00:58:00 +00:00
Isuru Fernando	e874376f6a	Mark test_reference_numerics_extremal__refs_frexp_cuda as xfail on windows (#120640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120640 Approved by: https://github.com/clee2000	2024-02-27 00:35:55 +00:00
Sergii Dymchenko	d341b66e96	Revert [dynamo] support group=None when rewriting collectives (#12018 ) (#120677 ) This reverts commit 298c686d3f7bc26399481b8830e71c4f02ce629c. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677 Approved by: https://github.com/yifuwang, https://github.com/huydhn	2024-02-27 00:33:35 +00:00
David Berard	fdae9363b3	[meta registration] efficient_attention_forward fix for NT inputs (#120594 ) When cu_seqlens_q is provided, we should use the user-specified max_seqlen_q instead of inferring it as query.size(1): `1c7b0e7cd1/aten/src/ATen/native/transformers/cuda/attention.cu (L989)` This wasn't caught because the value is taken as ceil(max_seqlen / 32) * 32; in the opinfos, and the opinfo inputs were small enough that this value was 32 in either case. Differential Revision: [D54179733](https://our.internmc.facebook.com/intern/diff/D54179733) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120594 Approved by: https://github.com/drisspg	2024-02-27 00:10:37 +00:00
Edward Z. Yang	9dfaef962c	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007	2024-02-27 00:04:23 +00:00
William Wen	ecb3f33a1a	[dynamo] fix segfault in _debug_get_cache_entry_list (#120635 ) Fix https://github.com/pytorch/pytorch/issues/120607. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120635 Approved by: https://github.com/jansel	2024-02-26 23:31:09 +00:00
lancerts	64660b51f6	Add the hyperlink of the transfomer doc (#120565 ) Fixes #120488 - The shape for forward pass is clearly stated in the main [transformer class](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) - Boolean mask for _key_padding_mask is also explained in the main transformer class. Therefore, add the hyperlink to the transformer class explicitly so the user can refer back to the main class. Also, correct several symbols in the transform doc from normal text style to math style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120565 Approved by: https://github.com/mikaylagawarecki	2024-02-26 23:11:58 +00:00
Kai	c59b14163b	Implement aten::upsample_linear1d on mps (#115031 ) Related to #77764 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031 Approved by: https://github.com/malfet	2024-02-26 23:04:52 +00:00
albanD	30625ae582	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-26 22:21:14 +00:00
PyTorch MergeBot	41adec3c59	Revert "Switch to native functional collective by default (#120370 )" This reverts commit 1f1bc0e6acc3613339b1001a7c9fcd1dfe7b6580. Reverted https://github.com/pytorch/pytorch/pull/120370 on behalf of https://github.com/yifuwang due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120370#issuecomment-1965362938))	2024-02-26 21:55:13 +00:00
rzou	7b1cc140aa	Use lxml in scripts/compile_tests when it is available (#120633 ) It's around 30x (300s -> 10s) faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120633 Approved by: https://github.com/oulgen	2024-02-26 21:35:22 +00:00
Yanbo Liang	5a0a964444	[Dynamo] Fix guards for script_if_tracing or lru_cache fn with default args (#120390 ) Fixes #120387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120390 Approved by: https://github.com/anijain2305	2024-02-26 19:40:14 +00:00
Menglu Yu	55b5908427	[PT2][Inductor]Add unbind node normalization (#120253 ) Summary: Normalize unbind nodes for the followup split_cat pattern detection and node removals Test Plan: ``` buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/f42297c2-2595-40a2-b270-5cec026f2fe4 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17451448578242323 Network: Up: 132KiB Down: 88KiB (reSessionID-fc725143-317a-42a9-bc7e-0bbab6ef9e5c) Jobs completed: 27. Time elapsed: 3:09.2s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` buck2 test mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb ``` Differential Revision: D53964593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120253 Approved by: https://github.com/jackiexu1992	2024-02-26 19:13:26 +00:00
Andrew Gu	274b362442	[FSDP] Removed `.detach` in `clip_grad_norm_` (#120612 ) This seems unnecessary under `no_grad()` context. The unit tests still pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120612 Approved by: https://github.com/Skylion007 ghstack dependencies: #120231	2024-02-26 19:03:00 +00:00
Edward Z. Yang	fd3cf88f27	Rewrite docs about why we guard on dynamic dims (#120566 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120566 Approved by: https://github.com/desertfire	2024-02-26 18:58:30 +00:00
angelayi	759204253f	[export] Change runtime asserts to using assert_scalar (#119608 ) By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors. https://github.com/pytorch/pytorch/issues/119587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608 Approved by: https://github.com/ezyang	2024-02-26 17:56:12 +00:00
Sam Larsen	2fb32a5f07	Enable fake tensor caching in fbcode by default (#118555 ) Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too. Test Plan: Ran torchbench benchmarks in fbcode Differential Revision: [D53771626](https://our.internmc.facebook.com/intern/diff/D53771626) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555 Approved by: https://github.com/eellison	2024-02-26 17:35:23 +00:00
Jason Ansel	ee01d0807b	[dynamo] Function => FunctionCtx for placeholder obj (#120577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120577 Approved by: https://github.com/yanboliang	2024-02-26 17:16:31 +00:00
Peter Bell	7eb7ac815f	[inductor] Optimize welford reduction (#120330 ) This does two things, 1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`) 2) Replace division with multiplication by reciprocal Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330 Approved by: https://github.com/lezcano	2024-02-26 17:01:47 +00:00
Catherine Lee	c39bbd6def	Numbers based TD (#119901 ) Convert from a list/bucket based TD system to just a numbers based TD system. Looks like a massive change but a decent amount of it is tests and removing code. Main file of interest is interface.py, which Github is collapsing by default due to size The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant. Other notable changes: * Use Frozenset to make TestRun hashable * Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901 Approved by: https://github.com/osalpekar, https://github.com/huydhn	2024-02-26 17:01:19 +00:00
angelayi	86063b4d03	Add torch._print to dynamo trace_rules (#120533 ) Fixes #114831 Before: ``` (pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $ python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated F ====================================================================== FAIL: test_torch_name_rule_map_updated (__main__.TraceRuleTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2739, in wrapper method(args, *kwargs) File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 328, in test_torch_name_rule_map_updated self._check_set_equality( File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 302, in _check_set_equality self.assertTrue(len(x) == 0, msg1) AssertionError: False is not true : New torch objects: {<built-in method _print of type object at 0x7ff477e40ee0>} were not added to `trace_rules.torch_c_binding_in_graph_functions` or `test_trace_rules.ignored_c_binding_in_graph_function_names`. Refer the instruction in `torch/_dynamo/trace_rules.py` for more details. To execute this test, run the following from the base repo dir: python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.184s FAILED (failures=1) ``` After change: ``` (pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $ python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated . ---------------------------------------------------------------------- Ran 1 test in 0.209s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120533 Approved by: https://github.com/clee2000, https://github.com/yanboliang, https://github.com/huydhn, https://github.com/Skylion007	2024-02-26 16:52:59 +00:00
PyTorch MergeBot	8a32a07856	Revert "Add meta device support to sparse compressed tensors (#120498 )" This reverts commit 5d71ba688563ef491bb28d47c493ec6fc7791da2. Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))	2024-02-26 15:59:36 +00:00
Shunting Zhang	b381a4372b	make GPT2ForSequenceClassification pass inference accuracy check (#120537 ) We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in ``` time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification ``` to --float16 or --float32 it will pass the accuracy check. Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537 Approved by: https://github.com/jansel	2024-02-26 11:02:57 +00:00
Yifu Wang	f4cf25bb24	Fix a bug where nn.functional._AllGather.backward produces wrong gradients (#120582 ) Summary: Fixes #120386 `_AllGather.backward` assumes that `_ReduceScatter` would always in-place update the output buffer. However, when the output buffer is non-contiguous, `_ReduceScatter` would allocate and return a different buffer, causing the gradient to be thrown away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120582 Approved by: https://github.com/XilunWu	2024-02-26 09:58:27 +00:00
leslie-fang-intel	c617e7b407	Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384 ) After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is: `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. ` - `torch.use_deterministic_algorithms(True)` only setting for accuracy test. `fff9d98e58/benchmarks/dynamo/common.py (L3480)` - However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. `fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)` Add these 2 models into the deterministic_algorithms exclusive model list in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384 Approved by: https://github.com/desertfire, https://github.com/jgong5	2024-02-26 05:05:43 +00:00
Animesh Jain	a299db2983	[dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120469 Approved by: https://github.com/jansel	2024-02-26 04:37:40 +00:00
Jiong Gong	1c7b0e7cd1	[inductor][cpp] disable masked load for non-fp data types (#120558 ) Fix https://github.com/pytorch/pytorch/issues/120377. We disable the masked load for non-fp data types for now. The complete support of masks will be added in https://github.com/pytorch/pytorch/pull/119654. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120558 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-02-26 04:12:22 +00:00
PyTorch UpdateBot	ea20885d95	[executorch hash update] update the pinned executorch hash (#120264 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120264 Approved by: https://github.com/pytorchbot	2024-02-26 03:55:32 +00:00
Animesh Jain	c18623b7ed	[dynamo] Reland 120147 - - Use EQUALS_MATCH guard for mod.training (#120578 ) To fix Memory leak discovered in https://github.com/pytorch/pytorch/issues/112090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120578 Approved by: https://github.com/jansel	2024-02-26 03:49:47 +00:00
Shan19900305	685d862c45	Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new. (#119263 ) 1) Using items stored in torch._tensor_classes to check item passed from python side; 2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new; 3) Using more general API to get python module name in get_storage_obj and get_name functions. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263 Approved by: https://github.com/ezyang	2024-02-26 01:54:30 +00:00
Animesh Jain	4328e772bf	[dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120416 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344, #120359	2024-02-25 23:24:24 +00:00
Animesh Jain	c269e48af0	[dynamo][guards-cpp-refactor] DictGuardManager (#120359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120359 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344	2024-02-25 23:24:24 +00:00
Animesh Jain	775a4388d9	[dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120344 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342	2024-02-25 23:24:04 +00:00
Pearu Peterson	5d71ba6885	Add meta device support to sparse compressed tensors (#120498 ) As in the title. Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498 Approved by: https://github.com/ezyang	2024-02-25 16:50:17 +00:00
Animesh Jain	834c7a1d3e	[dynamo][refactor] Move some helper functions to global scope (#120426 ) This is to prepare for guard C++ refactor work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120426 Approved by: https://github.com/ezyang	2024-02-25 04:38:20 +00:00
Ting Lu	5c7b761f8e	Fix default world_size when running on 1 or 0 GPU (#119372 ) the mentioned distributed tests would fail if the number of GPUs available isn't sufficient. need to correct the default world size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119372 Approved by: https://github.com/eqy, https://github.com/fegin	2024-02-25 04:14:34 +00:00
cyy	81f0b2c14e	[Clang-tidy header][19/N] Enable clang-tidy on torch/csrc/autograd/profiler_legacy.* (#120552 ) This PR enables clang-tidy on torch/csrc/autograd/profiler_legacy.* and cleans some path rules of clang-tidy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120552 Approved by: https://github.com/Skylion007	2024-02-25 03:29:40 +00:00
Yifu Wang	298c686d3f	[dynamo] support group=None when rewriting collectives (#120118 ) Resolves case 2 in #120082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118 Approved by: https://github.com/wconstab ghstack dependencies: #120370	2024-02-25 03:12:10 +00:00
Han, Xu	3e382456c1	Fix compiler check (#120492 ) Fixes #119304 1. Add try catch to handle the compiler version check. 2. Retry to query compiler version info. 3. Return False if can't get compiler info twice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120492 Approved by: https://github.com/ezyang	2024-02-25 02:41:20 +00:00
Edward Z. Yang	0f20cc1e0e	Don't use size on TensorVariable when doing out resize test (#120567 ) Fixes https://github.com/pytorch/pytorch/issues/120482 Fixes https://github.com/pytorch/pytorch/issues/120511 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120567 Approved by: https://github.com/Skylion007	2024-02-25 02:21:34 +00:00
chuboning	54c1cf8d8a	add distributed checkpoint support for custom device (#120201 ) Fixes #120200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201 Approved by: https://github.com/fegin, https://github.com/wz337	2024-02-24 19:14:29 +00:00
Michael Lazos	56203fc407	Add profiling for backward (#120540 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120540 Approved by: https://github.com/anijain2305	2024-02-24 16:53:28 +00:00
William Wen	a17979faa6	[dynamo] add stronger test for dynamo memory leaks (#120459 ) This issue was raised by a regression of https://github.com/pytorch/pytorch/issues/112090 caused by https://github.com/pytorch/pytorch/pull/120147. Make the memory leak test stronger by using weakref to check for model deletion instead of measuring CUDA memory allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120459 Approved by: https://github.com/jansel	2024-02-24 16:30:20 +00:00
Stephen Jia	a62d9184d5	[ET-VK] Move graph runtime from PT directory to ET directory (#120528 ) Summary: ## Context Move Vulkan graph runtime from PyTorch directory to ExecuTorch directory to improve development logistics: * ExecuTorch delegate changes will no longer require export to PyTorch directory * Makes it much easier to enable OSS build for Vulkan delegate Test Plan: ``` LD_LIBRARY_PATH=/home/ssjia/fbsource/third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/executorch/backends/vulkan/test:vulkan_compute_api_test_bin buck2 run fbcode//executorch/backends/vulkan/test:test_vulkan_delegate ``` Differential Revision: D54133350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120528 Approved by: https://github.com/manuelcandales	2024-02-24 15:00:21 +00:00
Yifu Wang	1f1bc0e6ac	Switch to native functional collective by default (#120370 ) This enables native functional collectives by default. After this PR: - The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier. - Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173). - Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users. Testing performed: - We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed. - Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env). Fallback mechansim: - Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370 Approved by: https://github.com/wconstab, https://github.com/yf225	2024-02-24 09:38:26 +00:00
Aaron Gokaslan	33938cfddd	[BE][Ez] Update ruff to 0.2.2 (#120517 ) Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517 Approved by: https://github.com/albanD	2024-02-24 07:13:53 +00:00
Aaron Orenstein	79f059987e	Update find_test_dir() to check for skip files relative to the local path first. (#120521 ) The search code to find the dynamo skip files wasn't working properly when used with pytest and multiple files: ``` pytest a.py b.py ``` because pytest would point `__main__` at itself instead of the individual file. (This worked fine when only running a single file test) Change the scanning code to look for the skip directory relative to its own file first. While in there add/update some comments and log a warning when the directory wasn't found (instead of a hard crash). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120521 Approved by: https://github.com/oulgen	2024-02-24 03:29:25 +00:00
Dheeraj Peri	15add24bf2	fix: set codegen in _SplitterBase partitioner (#120361 ) For graphs with single output, the expectation of torch.export / torch.compile graph_module output type is a single torch.tensor instead of a tuple. However, after using `_SplitterBase` partitioner on these graph_module (obtained from torch.export/torch.compile), the resultant graph module will return a tuple of tensors, in this case `(output,)`. This PR adds codegen to the graphs produced by `_SplitterBase` partitioner. Setting this will ensure pytree unflatten nodes will be added automatically to handle unflattening of the output to return single outputs directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120361 Approved by: https://github.com/angelayi	2024-02-24 02:27:20 +00:00
Oguz Ulgen	3eefe96297	Update scripts/compile_tests/update_failures.py (#120529 ) In order to unbreak this script, I have only tested with ``` ./scripts/compile_tests/update_failures.py 97918e8c37e649dc8782bb1822ae954bca904d0f ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120529 Approved by: https://github.com/zou3519	2024-02-23 22:15:44 +00:00
Isuru Fernando	b7df3bba62	add decomposition for frexp (#119217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217 Approved by: https://github.com/peterbell10 ghstack dependencies: #119284, #120027	2024-02-23 21:52:42 +00:00
Isuru Fernando	f7e79299c7	register torch.return_types in torch.fx._pytree (#120027 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120027 Approved by: https://github.com/lezcano, https://github.com/zou3519, https://github.com/XuehaiPan ghstack dependencies: #119284	2024-02-23 21:52:42 +00:00
Isuru Fernando	c3496d50f0	Fix torch.return_types init signature (#119284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284 Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan	2024-02-23 21:52:34 +00:00
Edward Z. Yang	623632a401	More informative stacklevel for autograd function warning (#120512 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8064897663537295 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120512 Approved by: https://github.com/albanD	2024-02-23 21:48:55 +00:00
Yanbo Liang	4d2073bc3f	[Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444 ) After the consolidated ```trace_rules.lookup```, we already unwrap at `2240018c03/torch/_dynamo/variables/builder.py (L712)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120444 Approved by: https://github.com/anijain2305	2024-02-23 21:22:09 +00:00
Shuqiang Zhang	8e20385447	[c10d] fix the macro definition of NCCL_COMM_DUMP (#120502 ) Summary: Only if both macros are defined, should we dump the comm dump, otherwise, use the original definition. The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined Test Plan: Build and unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502 Approved by: https://github.com/dsjohns2, https://github.com/Skylion007	2024-02-23 20:59:39 +00:00
Thiago Crepaldi	7cd623aa89	Remove monkey-patch for torch.utils._rebuild_tensor (#120446 ) Not needed after #108186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120446 Approved by: https://github.com/titaiwangms, https://github.com/BowenBao	2024-02-23 20:42:50 +00:00
Jinzhe Zeng	ed0ea2f30b	add `export` to `torch.jit.__all__` (#120432 ) I use pyright in the vscode. When I use `@torch.jit.export`, I always see an annoying error saying `export` is not exported. ![image](https://github.com/pytorch/pytorch/assets/9496702/f7b0e17f-6497-4f9a-87dd-55dc627156c3) Adding it to `__all__` should fix it. I have seen #92240 and #101678, and I am not sure why `export` is not there. cc @ringohoffman Pull Request resolved: https://github.com/pytorch/pytorch/pull/120432 Approved by: https://github.com/eellison	2024-02-23 20:37:09 +00:00
Nikita Shulga	e29eb39e04	[EZ] Fix typo in gcc version detection (#120489 ) It should be `FATAL_ERROR` rather than `FATAL` I wish cmakelint would have detected it Also, downgrade this check to 9.3, as all our binary builds are using 9.3 at the moment (will update in a followup PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120489 Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007	2024-02-23 20:31:21 +00:00
Animesh Jain	007606e520	[dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120342 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096	2024-02-23 20:10:09 +00:00
Animesh Jain	4b65d192f0	[dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120096 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093	2024-02-23 20:10:09 +00:00
Animesh Jain	a92ce46dc3	[dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120093 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123	2024-02-23 20:10:01 +00:00
Animesh Jain	bb331b1eb5	[dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120123 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119	2024-02-23 20:09:52 +00:00
Animesh Jain	2eac593ffd	[dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120119 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091	2024-02-23 20:09:43 +00:00
Animesh Jain	da95421f05	[dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120091 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089	2024-02-23 20:09:34 +00:00
Shuqiang Zhang	39f0a5ecc9	[c10d] simplify the dump timeout logic and unify the async call (#120331 ) Summary: The current dump timeout logic is a bit cumbersome as it needs 2 times: 1. timeout, 2. wake up time. And in theory the caller just needs to wait for a max of timeout value for the dump and declare the dump to be either successful or not. Also we unify the async call using std::async instead of a customized async lauch function for each operation. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331 Approved by: https://github.com/wconstab	2024-02-23 19:46:40 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	8646872ff7	Make balance_gradient preserved in export (#120332 ) Summary: We can only not-decompose CompositeImplicit functional custom ops. From the looks of the implementation, this op looks functional. So the fix is just fixing the schema. Test Plan: CI Differential Revision: D54019265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120332 Approved by: https://github.com/zhxchen17	2024-02-23 19:14:08 +00:00
Isuru Fernando	182ed1e32c	Use a dtype property in torch inductor nodes (#119227 ) I usually forget to do `x.get_dtype()` and I type `x.dtype`. Similarly for `layout, device, sizes`. What do you think about making them properties? Pull Request resolved: https://github.com/pytorch/pytorch/pull/119227 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-02-23 18:40:03 +00:00
Huy Do	d54121d13f	Increase bazel CUDA tests timeout to 480s (#120443 ) One of the bazel CUDA tests `//:modules_test` frequently timeout in trunk, so I try to increase the timeout value to 480s https://bazel.build/reference/test-encyclopedia to see if it helps fix the issue. Bazel CPU tests already use this value. Here is an example timeout https://github.com/pytorch/pytorch/actions/runs/8009308009/job/21877698886#step:13:3316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120443 Approved by: https://github.com/clee2000	2024-02-23 18:32:35 +00:00
Oguz Ulgen	6b35415a54	Create a sentinel file for each dynamo test skips (Part 2) (#120501 ) [no ci] tested on https://github.com/pytorch/pytorch/pull/120451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120501 Approved by: https://github.com/clee2000 ghstack dependencies: #120500	2024-02-23 18:25:30 +00:00
Oguz Ulgen	cffdd642a9	Create a sentinel file for each dynamo test skips (Part 1) (#120500 ) [no ci] tested on https://github.com/pytorch/pytorch/pull/120451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120500 Approved by: https://github.com/clee2000	2024-02-23 18:25:30 +00:00
Jorge Pineda	2120f65174	[AT-VK][EZ] Move ops to dedicated folder (#120364 ) These ops are at the level of the OperatorRegistry from the previous change. All ExecuTorch ops will go here. ``` ATen/native/vulkan/graph/ops ``` They are not to be confused with the general ATen ops from `native_functions.yaml` that will continue to exist. All PyTorch ops are here. ``` ATen/native/vulkan/ops ``` To help think around this split, note that we can actually implement the latter ATen ops with the former OperatorRegistry ops, since it's currently a subset. Differential Revision: [D54030933](https://our.internmc.facebook.com/intern/diff/D54030933/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120364 Approved by: https://github.com/SS-JIA ghstack dependencies: #120362, #120363	2024-02-23 18:11:09 +00:00
Jorge Pineda	6d920dd3c6	[ET-VK][Op Redesign][2/n] Introduce OperatorRegistry (#120363 ) TSIA Differential Revision: [D53982439](https://our.internmc.facebook.com/intern/diff/D53982439/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120363 Approved by: https://github.com/SS-JIA ghstack dependencies: #120362	2024-02-23 18:07:59 +00:00
Jorge Pineda	3e2ac1f094	[AT-VK][EZ] Define OpNode constructor (#120362 ) Instead of using `emplace_back()`. This will be useful throughout the rest of the stack. Differential Revision: [D53982443](https://our.internmc.facebook.com/intern/diff/D53982443/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120362 Approved by: https://github.com/SS-JIA	2024-02-23 18:05:17 +00:00
Aleksei Nikiforov	232f09e0ea	Add copy of scripts for setting up s390x workers (#120417 ) This PR contains scripts used to produce self-hosted s390x worker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120417 Approved by: https://github.com/malfet	2024-02-23 17:01:44 +00:00
angelayi	3b944113c8	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-23 17:01:22 +00:00
cyy	97918e8c37	[Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504 Approved by: https://github.com/albanD	2024-02-23 16:47:33 +00:00
PyTorch MergeBot	2892d2f31b	Revert "[inductor] Optimize welford reduction (#120330 )" This reverts commit 4c6ba16f825ca7b99133efca95da0b7112add66b. Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/jeffdaily due to broke ROCm CI while ROCm was in unstable status ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1961623739))	2024-02-23 16:24:52 +00:00
Aaron Enye Shi	2c85c9e77e	[Memory Snapshot] Add Total memory used after allocation in Trace View (#120339 ) Summary: Being able to see max allocated helps improve user experience with memory snapshots. Test Plan: Before: ![image](https://github.com/pytorch/pytorch/assets/17602366/534001fa-2fbe-4fc5-bd48-cd82f3277941) After: ![image](https://github.com/pytorch/pytorch/assets/17602366/f8b9a7bc-3a34-4e38-82cb-f766e54b3fd2) Reviewed By: zdevito Differential Revision: D53953648 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/120339 Approved by: https://github.com/zdevito	2024-02-23 16:17:14 +00:00
jmarin	d9db9e62e3	Describe special case in avgpool (#120335 ) Fixes #116420 AvgPool1d, AvgPool2d and AvgPool3d include now in their descriptions the special case when `ceil_mode` is True and the last window starts outside the tensor. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120335 Approved by: https://github.com/mikaylagawarecki	2024-02-23 15:29:54 +00:00
Yukio Siraichi	cef9f70f4b	Move torchbench model configuration into a YAML file. (#120299 ) This PR moves other aspects of torchbench's model configuration (e.g. batch size, tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the recently added `torchbench_skip_models.yaml` file inside the `skip` key. This is an effort so that external consumers are able to easily replicate the performance results and coverage results from the PyTorch HUD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299 Approved by: https://github.com/jansel	2024-02-23 14:00:14 +00:00
yuanx749	54bac042e7	Fix error in examples of `torch.linalg.lu_factor` (#120484 ) Found an error in the doc of `torch.linalg.lu_factor` related to `torch.linalg.lu_solve`. Also fix a sphinx issue by the way. ```Python traceback TypeError: linalg_lu_solve(): argument 'LU' (position 1) must be Tensor, not torch.return_types.linalg_lu_factor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120484 Approved by: https://github.com/lezcano	2024-02-23 13:19:04 +00:00
Yang Chen	b96ea097ee	[aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391 ) make WrapperCodeGen subclass names consistent with the file names: CppWrapperCodeGen -> CppWrapperCpu CudaWrapperCodeGen -> CppWrapperCuda Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391 Approved by: https://github.com/aakhundov	2024-02-23 10:41:50 +00:00
Yanli Zhao	72fec96e59	fix no shard state dict loading (#120367 ) Summary: fix no shard state dict loading Test Plan: CI tests Differential Revision: D51058607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120367 Approved by: https://github.com/fegin	2024-02-23 07:25:43 +00:00
Eddie Yan	9e9eaf0032	[CUDA] Workaround register spilling issue in mem-efficient SDP kernels on `sm60` (#120445 ) We're seeing that a newer version of CUDA introduces register spilling behavior for a few kernels on Pascal---this PR works around them for this specific version. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/120445 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-02-23 06:06:37 +00:00
Edward Z. Yang	edf1c4e552	[Dynamo] Handle guard_size_oblivious in user code (#120379 ) Fixes https://github.com/pytorch/pytorch/issues/120083 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120379 Approved by: https://github.com/yanboliang	2024-02-23 05:38:57 +00:00
Oguz Ulgen	a5548c6886	Create a sentinel file for each dynamo test failure (#120355 ) Created via ``` import os current_dir = os.path.dirname(os.path.abspath(__file__)) directory = os.path.join(current_dir, 'dynamo_expected_failures') for name in dynamo_expected_failures: path = os.path.join(directory, name) with open(path, 'w') as fp: pass ``` Differential Revision: [D54036062](https://our.internmc.facebook.com/intern/diff/D54036062) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120355 Approved by: https://github.com/aorenste, https://github.com/yanboliang	2024-02-23 05:22:11 +00:00
Nikita Shulga	2240018c03	Construct `c10::Half` from `float16_t` on ARMv8 (#120425 ) By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos) ```cpp #include <arm_neon.h> #ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC #error Ieeee #endif float16_t sum1(float16_t x, float16_t y) { return x + y; } float16_t sum2(float16_t x, float16_t y) { return static_cast<float>(x) + static_cast<float>(y); } ``` both sum variants are compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic ``` sum1(half, half): // @sum1(half, half) fadd h0, h0, h1 ret sum2(half, half): // @sum2(half, half) fadd h0, h0, h1 ret ``` Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425 Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal	2024-02-23 04:22:45 +00:00
eqy	3f6be7696b	[cuDNN][cuDNN RNNv8 API] Fix math type behavior in cuDNN RNN (#120277 ) Adds back `CUDNN_TENSOR_OP_MATH` which was erroneously dropped by #115719 CC @malfet @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/120277 Approved by: https://github.com/drisspg	2024-02-23 04:11:14 +00:00
Driss Guessous	36c1cc962a	Update cutlass from 3.3.0 to 3.4.1 (#120434 ) ### COPY OF https://github.com/pytorch/pytorch/pull/120010 ### Update I have rolled the two blocking changes into this PR, I also imported this to fbcode to verify that nothing is breaking: D53870253 This copy was generated by merging in all the internal only changes into one merged atomic commit and re-exporting to github ### Current Status - [PR](https://github.com/pytorch/pytorch/pull/118935) aims to update the flash attention kernels to a more recent version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120434 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-02-23 03:57:26 +00:00
cyy	f609f2050f	[structural binding][6/N] Replace std::tie with structural binding (#120353 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120353 Approved by: https://github.com/albanD	2024-02-23 03:38:40 +00:00
lancerts	3426c6f559	update the tensor.scatter_ doc (#120169 ) Fixes #119543 - doc fixed with the `reduce` being a kwarg (see below for details) - doc added another interface `(int dim, Tensor index, Number value, , str reduce)` where the full signature in the pyi file after build is ``` def scatter_(self, dim: _int, index: Tensor, value: Union[Number, _complex], , reduce: str) -> Tensor: ``` . This can be further verified in `02fb043522/aten/src/ATen/native/native_functions.yaml (L8014)` Therefore, the value can be int, bool, float, or complex type. Besides the issue mentioned in 119543, the `reduce should be a kwarg` as shown below ``` * (int dim, Tensor index, Tensor src) * (int dim, Tensor index, Tensor src, , str reduce) (int dim, Tensor index, Number value) * (int dim, Tensor index, Number value, *, str reduce) ``` The test case for scala value is already implemented in `70bc3b3be4/test/test_scatter_gather_ops.py (L86)` so no additional test case required. @mikaylagawarecki @janeyx99 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120169 Approved by: https://github.com/mikaylagawarecki	2024-02-23 02:51:55 +00:00
Sergii Dymchenko	bb6f50929b	Fix lint after https://github.com/pytorch/pytorch/pull/105590 (#120461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120461 Approved by: https://github.com/Skylion007	2024-02-23 02:45:23 +00:00
Shuqiang Zhang	2b0168aeb0	[c10d] update the work progress of PG periodically (#120438 ) Summary: Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress but log only when there is timeout detected. We found it is not enough since the 'straggler' itself might not detect the timeout and hence there is no log from the 'straggler'. In this PR, we can log these states periorically so that it would be much easier for us to identify the straggler by checking which rank has the smallest number of lastEnqueuedSeq_ Test Plan: Log adding, build success Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438 Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501	2024-02-23 01:40:43 +00:00
ydwu4	8f4ffd3d8a	[HigherOrderOp] makes control flow operators respect global decomp table (#120412 ) A follow up of @zou3519 's comment on https://github.com/pytorch/pytorch/pull/120366. We create a helper method for this purpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120412 Approved by: https://github.com/zou3519	2024-02-23 00:10:20 +00:00
Rohan	156954d6a2	[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590 ) Fixes #104729 As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization. The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path. \| Slow path \| Fast path (NEON intrinsics) -- \| -- \| -- Softmax (100 passes, 1024 dimension) \| 623.706ms \| 452.011ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/105590 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-02-22 23:55:35 +00:00
Peter Bell	4c6ba16f82	[inductor] Optimize welford reduction (#120330 ) This does two things, 1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`) 2) Replace division with multiplication by reciprocal Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330 Approved by: https://github.com/lezcano	2024-02-22 23:54:24 +00:00
PyTorch MergeBot	722afe6171	Revert "[dynamo] Use EQUALS_MATCH guard for mod.training (#120147 )" This reverts commit b642a18e8056287b0e5768f631dd03e0326a8b11. Reverted https://github.com/pytorch/pytorch/pull/120147 on behalf of https://github.com/williamwen42 due to memory leak, see https://github.com/pytorch/pytorch/issues/112090 ([comment](https://github.com/pytorch/pytorch/pull/120147#issuecomment-1960522018))	2024-02-22 23:46:55 +00:00
Thiago Crepaldi	3588e7f265	Ignore .numpy() under FakeTensorMode() (#120261 ) Fixes #120259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261 Approved by: https://github.com/jansel	2024-02-22 22:49:20 +00:00
Nikita Shulga	f9eb66e16d	[BE][EZ] Flatten preprocessor hierarchy (#120422 ) Instead of ```cpp #if defined(foo) #else #if defined(bar) #else #endif #endif ``` use ```cpp #if defined(foo) #elif defined(bar) #else #endif ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120422 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-02-22 22:38:08 +00:00
Taras Tsugrii	1c7ba330b2	[BE][optim] Simplify _init_group. (#120055 ) This version is more concise and avoids second lookup in case `momentum_buffer` is in the `state`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120055 Approved by: https://github.com/janeyx99	2024-02-22 22:15:01 +00:00
wz337	5603d95375	[DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046 ) More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614 In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-02-22 22:03:13 +00:00
Jeff Daily	c11bd724fe	[ROCm] replace ROCmLoops.cuh with hipified CUDALoops.cuh (#120101 ) The intent of this change was to minimize code differences between CUDA and ROCm while maintaining or improving performance. Verified new performance using pytorch/benchmarks/operator_benchmark. ``` python -u -m pt.unary_test --tag-filter all --device cuda python -u -m pt.binary_test --tag-filter all --device cuda ``` On MI200 this improved performance on average 3%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120101 Approved by: https://github.com/albanD	2024-02-22 21:57:36 +00:00
dilililiwhy	77692736d1	Use privateuseone during external module register test (#120399 ) Fixes #120397 Use privateuseone instead of xpu in test_external_module_register. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120399 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-22 21:32:59 +00:00
baocheny	edd03f975f	highlight readme code block (#120228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120228 Approved by: https://github.com/mikaylagawarecki	2024-02-22 21:23:08 +00:00
Oguz Ulgen	1eae8950b9	[Dynamic] Fix dynamic shape size inspection bug (#120341 ) Fixes #120198 Differential Revision: [D54035984](https://our.internmc.facebook.com/intern/diff/D54035984) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120341 Approved by: https://github.com/ezyang	2024-02-22 21:08:28 +00:00
Yifu Wang	11e4a9266d	Temporarily support ranks + tag as pg identifier in native funcol (#120226 ) As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #120042, #120043, #120070	2024-02-22 20:24:16 +00:00
Yifu Wang	5a3e19578f	Make tests using CommDebugMode work for both legacy and native funcol (#120070 ) We have many tests that use CommDebugMode to verify the occurrence of collectives. These tests do so by querying comm_counts with legacy funcol ops as key. For the purpose of native funcol migration, we need these tests to work for both legacy and native funcol. To avoid the need to modify all tests to accommodate the two implementations, we make CommDebugMode translate native funcol ops into legacy funcol ops until the migration finishes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120070 Approved by: https://github.com/wconstab, https://github.com/wanchaol ghstack dependencies: #120042, #120043	2024-02-22 20:24:15 +00:00
Yifu Wang	a4c5f48b11	Prepare test_dtensor.py for native funcol migration (#120043 ) This file contains representative tests that we would like to run with both funcol impls during the migration period. Marking them as `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120043 Approved by: https://github.com/wanchaol ghstack dependencies: #120042	2024-02-22 20:24:15 +00:00
Yifu Wang	1c9fc720ae	Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042 ) Summary: While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does. Also marking a test affected by this with `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042 Approved by: https://github.com/wanchaol	2024-02-22 20:24:15 +00:00
ydwu4	7b8f6736d1	[cond] make sure subgraphs in cond are decomposed according to current decomp table (#120366 ) Fixes https://github.com/pytorch/pytorch/issues/120160. The issue is because previously cond doesn't pass in the global decomposition table in ProxyMode. This PR adds the current_decomposition_table to the recursive make_fx call. Test Plan: see added tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120366 Approved by: https://github.com/aakhundov, https://github.com/jansel	2024-02-22 20:06:46 +00:00
lancerts	680cfec295	Fix the default value of side in torch.searchsorted (#120066 ) Fixes #119999, currently the [doc](https://pytorch.org/docs/stable/generated/torch.searchsorted.html#torch.searchsorted) shows the default value of `side = "left"` <img width="600" alt="Screenshot 2024-02-16 at 10 36 08 AM" src="https://github.com/pytorch/pytorch/assets/7495155/e7d159aa-4985-4f50-9d81-6e71c3116c0d"> while the [implementation ](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L11247) gives the default value of `side = c10::nullopt`. - fix the [torch doc](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py#L13782) such that the default value of side is None. - fix the [comment in cpp](`4dc75f9084/aten/src/ATen/native/Bucketization.cpp (L19)`) such that the default value of side is None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120066 Approved by: https://github.com/malfet	2024-02-22 19:35:17 +00:00
Andrew Gu	c37d07a1bc	[FSDP2] Removed `super().__setattr__` call (#120340 ) `nn.Module.__setattr__` does not actually call `super().__setattr__()`. If we make this call in our fast path, then we will inadvertently set the parameter as an actual attribute on the module, not just as an entry in the `_parameters` dict. This can lead to a bug where after replacing the parameters on the module (e.g. via `to_empty()` from meta device), we now have both an actual attribute (old) and a new entry in `_parameters` (new). Trying to access the parameter would give the old one since Python only resolves `__getattr__` if normal attribute lookup fails. The bug was exercised in the following PR. I wanted to land this bug fix separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120340 Approved by: https://github.com/yifuwang ghstack dependencies: #120231	2024-02-22 19:33:57 +00:00
Jackie (Jiaqi) Xu	2ba798df60	[inductor] decompose memory bound mm (#120047 ) Summary: Decompose memory bound mm/bmm. Linear decomposition result: D53502768 BMM decomposition result: D53148650 We should only decompose when 1)bmm, b is large, m,n,k is relative small 2)mm/addmm. m is large, n and K is relative small. e.g. mm of input gradient in linear backward should not be decomposed since m is small and n is large. Need to conduct more experiments to see if we can find a better strategy for decomposition. I have tried to use a linear regression model (see the bento results) which does not fit well. For short term, we use heuristics to determine when to decompose. Test Plan: ``` buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm ``` COFFEE APS mc0: baseline: aps-lsf-0124-bf16-267ccb7a0d decompose: aps-lsf-0124-bf16-4e3824db40 FIRST AFOC pyper mc1 Differential Revision: D53602514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120047 Approved by: https://github.com/mengluy0125	2024-02-22 19:29:51 +00:00
feifan	ce807c17c0	modify comment of SparseTensor coalesce (#120221 ) Fixes #ISSUE_NUMBER Found the comment of coalesce is incorrect, modify it Pull Request resolved: https://github.com/pytorch/pytorch/pull/120221 Approved by: https://github.com/mikaylagawarecki	2024-02-22 19:24:53 +00:00
Andrei	bb72bfe2ac	Add code example for torch.stack() (#120304 ) Fixes #120303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120304 Approved by: https://github.com/albanD	2024-02-22 18:30:30 +00:00
yuanx749	ca64f7cbb8	Fix rendering in the doc of `PackedSequence` (#120385 ) Correct a typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120385 Approved by: https://github.com/albanD	2024-02-22 18:29:12 +00:00
Shunting Zhang	a77226aa49	[inductor] improve kernel metadata logging (#120274 ) Log a few more fields - num_atomic_add: perf of kernels using atomic_add are usually data dependent. Our benchmarking code generate all indices to be 0 which will result in worse perf than reality. - kernel_args_num_gb: estimate the amount of read/writes for kernel args. In-place args will be double counted. If we have a good estimation, this should be the lower bound of memory access that the GPU performs. Sometimes GPU will do more memory access since a single buffer may be access multiple times (e.g. for softmax when input tensor is quite large. cache only help a bit here). With this logged, and if we augment the metadata with amount of memory the GPU actually accessed, then it would be nice to dig into kernels that GPU access more memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120274 Approved by: https://github.com/jansel ghstack dependencies: #120266	2024-02-22 18:28:05 +00:00
briancoutinho	b88621040a	[profiler] Add kineto init delay when used in daemon mode (#120276 ) Fixes #112389 ## About PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer. - Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148 - However, the above needs the dynamic linking to libcupti.so to have taken place. - I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389 ![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec) ## Workaround We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue. ## Testing Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) First export the daemon env variable ### Without any delay ``` >$ python3 linear_model_example.py INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu 99 1385.468505859375 ``` ### With 5 seconds delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py cpu 99 284.82305908203125 10099 8.817167282104492 INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1 ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024) INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 20099 8.817167282104492 ``` ### With an invalid delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu ``` ### Unit test updated as well. ## Impact This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276 Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi	2024-02-22 18:17:33 +00:00
Xuehai Pan	be0ee93467	[pytree] support `X \| Y` union type in `tree_map_only` (#120389 ) Follow-up PR for #119974 with some small tweaks. 1. Support `X \| Y` union type for Python 3.10+ 2. Enable predicate function in `tree_map_only` in CXX pytree. 3. Remove unnecessary function definition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120389 Approved by: https://github.com/zou3519	2024-02-22 18:17:13 +00:00
Wanchao Liang	65627cfd6a	[dtensor] implement scaled dot product attention (flash-attention) (#120298 ) as titled, this PR implements the sdpa flash attention op in DTensor Adding flash attention first but efficient attention and other attention ops should be similar fixes https://github.com/pytorch/pytorch/issues/120333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298 Approved by: https://github.com/XilunWu ghstack dependencies: #120297	2024-02-22 17:53:47 +00:00
PyTorch MergeBot	f2452e98a6	Revert "Native Half on ARM (#119483 )" This reverts commit 8f3fd79b23d483e846537b62f49111696d117870. Reverted https://github.com/pytorch/pytorch/pull/119483 on behalf of https://github.com/malfet due to Broke nightly arm builds (and will be breaking runtime), as F16 arithmetic is ARMv8.2 only, see https://github.com/pytorch/pytorch/actions/runs/8000968963/job/21851281141 ([comment](https://github.com/pytorch/pytorch/pull/119483#issuecomment-1959944948))	2024-02-22 17:41:55 +00:00
Dmitry Nikolaev	c7328602ed	[ROCm] enable tests test_sampled_addmm_autograd_cuda_*, test_sample… (#117501 ) These tests PASS on ROCM 5.6+ now: - test_sampled_addmm_autograd_cuda_complex128 - test_sampled_addmm_autograd_cuda_complex64 - test_sampled_addmm_autograd_cuda_float32 - test_sampled_addmm_autograd_cuda_float64 - test_sampled_addmm_cuda_complex128 - test_sampled_addmm_cuda_complex64 - test_sampled_addmm_cuda_float32 - test_sampled_addmm_cuda_float64 - test_autograd_dense_output_addmm_cuda_float64 - test_autograd_dense_output_addmv_cuda_float64 - test_autograd_dense_output_mv_cuda_float64 @pruthvistony @jithunnair-amd Pull Request resolved: https://github.com/pytorch/pytorch/pull/117501 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-02-22 17:24:25 +00:00
Lucas Pasqualin	1c1028ac49	[DCP] Adds utility for converting torch save to dcp (#119815 ) as title Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815 Approved by: https://github.com/fegin ghstack dependencies: #119813, #119814	2024-02-22 17:22:11 +00:00
willfengg	aae7ccd2d5	[FSDP2] disable compile in broken unit tests (#120358 ) following unit tests are broken in original commit, revert to keep trunk healthy. will add them back when figuring out the root cuase ``` python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_param_registration ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120358 Approved by: https://github.com/awgu, https://github.com/Skylion007	2024-02-22 17:17:23 +00:00
Lucas Pasqualin	1ab441a7dd	[DCP] Adds utility for converting dcp to torch save format (#119814 ) as title Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814 Approved by: https://github.com/fegin ghstack dependencies: #119813	2024-02-22 16:55:58 +00:00
rraminen	e0a7b024b0	[ROCm] Skip test_parity* unit tests in test_foreach only if ROCm version < 6.0 (#117301 ) Skip test_parity* unit tests in test_foreach.py on ROCm only if ROCm version < 6.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117301 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2024-02-22 16:21:09 +00:00
Bert Maher	de60050801	[inductor] Colorization improvements for bandwidth profiler (#120343 ) A couple things: * Don't colorize output to the log file * Don't repeatedly warn if colorama isn't installed Differential Revision: [D54027075](https://our.internmc.facebook.com/intern/diff/D54027075/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120343 Approved by: https://github.com/Chillee, https://github.com/shunting314	2024-02-22 15:25:46 +00:00
Yanbo Liang	03f7235caa	[Dynamo] Fix dynamo trace rules (#120371 ) ```test_trace_rules.py``` is still failing due to this. Fixes https://github.com/pytorch/pytorch/issues/114831 (Having this here will run the disabled test on the PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120371 Approved by: https://github.com/drisspg, https://github.com/huydhn	2024-02-22 14:32:00 +00:00
Bert Maher	0e4bd25a33	[inductor] When generating debug logs don't fail if nvcc not found (#120346 ) If nvcc isn't found subprocess throws a CalledProcessError Differential Revision: [D54028438](https://our.internmc.facebook.com/intern/diff/D54028438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120346 Approved by: https://github.com/Skylion007, https://github.com/shunting314	2024-02-22 14:25:34 +00:00
Yu, Guangye	c2b2e57032	Intel GPU Runtime Upstreaming for Guard (#118523 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR. # Design Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder. # Additional Context It is unnecessary to add `Guard` code to PyTorch frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118523 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #120315	2024-02-22 14:07:21 +00:00
Yu, Guangye	dcfe463600	fix xpu build failure (#120315 ) # Motivation fix build failure introduced by [[DeviceIndex][6/N] Use DeviceIndex in more places](https://github.com/pytorch/pytorch/pull/120133), parameter `total` is undefined in line 100. see https://github.com/pytorch/pytorch/pull/120133/files#diff-00eb8a6f5dfbc341ee9ab9aff0e3dbece8ad73483d4f41a005b1f453cb78221cR91-L102 [PR120133](https://github.com/pytorch/pytorch/pull/120133) forgot to add the label `ciflow/xpu`, so the XPU CI flow was not triggered. # Solution refer to [Why is std::cout not printing the correct value for my int8_t number?](https://stackoverflow.com/questions/7587782) , static cast int8_t to int16_t and the condition `device >= 0 && device < total` is enough. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120315 Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/malfet, https://github.com/EikanWang, https://github.com/gujinghui	2024-02-22 13:43:56 +00:00
lezcano	faad8ecb26	Use opmath for sinc on CPU (#120311 ) This aligns the implementation with CUDA and `torch.compile` Fixes https://github.com/pytorch/pytorch/issues/118176 https://github.com/pytorch/pytorch/issues/49133 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120311 Approved by: https://github.com/jgong5, https://github.com/Chillee	2024-02-22 12:37:50 +00:00
Sheng Fu	5c5b71b6ee	Capture non tensor arguments in record_function (#120017 ) Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION. Test Plan: unit test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture Differential Revision: D53674768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017 Approved by: https://github.com/soulitzer	2024-02-22 09:40:08 +00:00
Yu Guo	7e6bce9684	[amd] fix unused variable device_flags (#120369 ) Summary: get build error due to D53986297 (https://github.com/pytorch/pytorch/pull/119996) ``` caffe2/c10/cuda/__fb_c10_hipify_gen__/out/c10/hip/HIPStream.cpp:40:23: error: unused variable 'device_flags' [-Werror,-Wunused-variable] static c10::once_flag device_flags[C10_COMPILE_TIME_MAX_GPUS]; ``` Reviewed By: jianyuh, xw285cornell Differential Revision: D54027737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120369 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-02-22 09:36:59 +00:00
Michael Lazos	5210a22b39	Add basic shampoo test (#120293 ) Fixes [T175418669](https://www.internalfb.com/intern/tasks/?t=175418669) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120293 Approved by: https://github.com/bdhirsh	2024-02-22 08:39:55 +00:00
wangjiangben-hw	354a436d96	Remove device assert in Gradscaler (#119362 ) Fixes #119358 Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Co-authored-by: ydwu4 <ydwu2014@gmail.com> Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Shuqiang Zhang <sqzhang@meta.com> Co-authored-by: Adnan Akhundov <aakhundov@meta.com> Co-authored-by: Ting Lu <tingl@nvidia.com> Co-authored-by: Yang Chen <yangche@fb.com> Co-authored-by: cyy <cyyever@outlook.com> Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: Jason Ansel <jansel@meta.com> Co-authored-by: Eddie Yan <eddiey@nvidia.com> Co-authored-by: wz337 <wz337@cornell.edu> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Co-authored-by: Anthony Alayo <anthony.alayo@applovin.com> Co-authored-by: leslie-fang-intel <leslie.fang@intel.com> Co-authored-by: Yifu Wang <yifu@fb.com> Co-authored-by: Yukio Siraichi <yukio.siraichi@gmail.com> Co-authored-by: atalman <atalman@fb.com> Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: haozhe.zhu <haozhe.zhu@intel.com> Co-authored-by: lezcano <lezcano-93@hotmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119362 Approved by: https://github.com/ezyang	2024-02-22 08:02:18 +00:00
PyTorch MergeBot	fff9d98e58	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit e0268821dd2ea0e8a51b81c0ef3b18e77f68a33d. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. `450339ab2d` ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))	2024-02-22 00:12:54 +00:00
PyTorch MergeBot	8fa6340701	Revert "Ignore .numpy() under FakeTensorMode() (#120261 )" This reverts commit 952b37145b7bb526ea5907ac574e324d274b02ee. Reverted https://github.com/pytorch/pytorch/pull/120261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems breaking trunk on Python 3.12 `952b37145b` ([comment](https://github.com/pytorch/pytorch/pull/120261#issuecomment-1958267417))	2024-02-21 23:09:27 +00:00
cyy	1aad5c98b4	[structural binding][5/N] Replace std::tie with structural binding (#120142 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120142 Approved by: https://github.com/albanD	2024-02-21 22:32:55 +00:00
Oguz Ulgen	d514df63ea	Reenable triton tests and clean extra clones after the pin update (#120324 ) Test Plan: just tests Differential Revision: D54008642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120324 Approved by: https://github.com/aakhundov, https://github.com/sijiac	2024-02-21 22:25:33 +00:00
Thiago Crepaldi	952b37145b	Ignore .numpy() under FakeTensorMode() (#120261 ) Fixes #120259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261 Approved by: https://github.com/jansel	2024-02-21 22:06:29 +00:00
soulitzer	450339ab2d	Test for fatal signal in test_pynode_destruction_deadlock (#120279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120279 Approved by: https://github.com/albanD	2024-02-21 21:53:51 +00:00
ydwu4	306642b66d	[export] fix test_passes on ci (#120322 ) We put the test cases generation in unitest.setUp to avoid running export on machines that runs with Python 3.12, where dynamo is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120322 Approved by: https://github.com/angelayi, https://github.com/huydhn, https://github.com/malfet	2024-02-21 21:23:40 +00:00
Tobias Ringwald	e0268821dd	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-02-21 21:10:49 +00:00
soulitzer	27c5bbe5cb	Add is_nested_int() (#119975 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119975 Approved by: https://github.com/jbschlosser ghstack dependencies: #119661, #119974	2024-02-21 21:10:02 +00:00
soulitzer	2e77629b9f	[pytrees] Allow tree_map_only to support predicate function as filter (#119974 ) In many places in the code we use `tree_map_only((SymInt, SymBool, SymFloat), foo)` but with nested ints, it is possible to have SymInts that are non-symbolic, so we may want to do something like `tree_map_only(is_symbolic, foo)` instead. Alternative: wrap nested int SymNodes with something other than SymInt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119974 Approved by: https://github.com/zou3519 ghstack dependencies: #119661	2024-02-21 21:10:02 +00:00
Aaron Enye Shi	722e87899a	[Memory Snapshot] Clean up elem text (#120245 ) Summary: These UI changes were added: - Prefix address with Addr: and size with Size: - Add comma between addr and size - Remove duplicate (${elem.size} bytes) print out Test Plan: Before: ![image](https://github.com/pytorch/pytorch/assets/17602366/2d9867d6-9cdb-405b-aa92-f0daf44f2ba7) After: ![image](https://github.com/pytorch/pytorch/assets/17602366/c7bd97d3-fdc6-4832-ae35-97a02ea73907) Differential Revision: D53953187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120245 Approved by: https://github.com/zdevito	2024-02-21 20:59:04 +00:00
Wanchao Liang	a5893926f2	[dtensor] simplify outputs wrapping handling (#120297 ) This PR simplifies the outputs wrapping handling in op dispatch, to make it simpler and easier to understand. It also enables a new case, where if the output DTensorSpec for the res is None, and the res is a scalar tensor, we will just return the scalar tensor instead of wrapping it with a DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120297 Approved by: https://github.com/wz337	2024-02-21 20:28:20 +00:00
xiangdong	e06978be4b	[CI] Add initial inductor cpu smoketest for performance (#116456 ) Co-authored-by: chuanqiw <chuanqi.wang@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116456 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-02-21 20:04:50 +00:00
Shengbao Zheng	9630bcbd49	[execution trace/chakra] remove backend_id from pg_info (#120038 ) Summary: PR 104373(https://github.com/pytorch/pytorch/pull/104373) log backend which has an unsafe dict loop up that might fail. We decide to deprecate backend_id and use pg id/name directly. Differential Revision: D53676181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120038 Approved by: https://github.com/aaronenyeshi	2024-02-21 19:37:18 +00:00
Joel Schlosser	e7eab2f07e	Fix to keep stride in return_and_correct_aliasing() (#117860 ) Fixes #117794 Fix tripped the assert here: `86dedebeaf/torch/utils/_python_dispatch.py (L216)` From investigation: I found that functionalization of an in-place op (`mul_` in this test case) results in the strides of `TwoTensor`'s `a` / `b` components being mutated to be contiguous. This is not reflected in the outer tensor, causing the assert to be tripped. After discussion with Brian, I address this in this PR by disallowing input mutations on non-contiguous tensor subclass inputs for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117860 Approved by: https://github.com/bdhirsh	2024-02-21 19:15:27 +00:00
Huy Do	fa77829126	Remove bc linter label triggers after test-infra #4956 (#120148 ) After https://github.com/pytorch/test-infra/pull/4956, mergebot will not block merge for a bc linter failure that has been suppressed. The failure will be ignored instead. This should help mitigate https://github.com/pytorch/test-infra/issues/4938 because the workflow will not be triggered multiple times when labels are attached. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120148 Approved by: https://github.com/clee2000	2024-02-21 18:36:38 +00:00
Mehant Kammakomati	e87deb8004	fix: conversion of max memory allocated and reserved to GB (#120172 ) Fixes #120171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120172 Approved by: https://github.com/soulitzer, https://github.com/aaronenyeshi	2024-02-21 18:04:47 +00:00
rajibm	d336be2942	Update torch.mean() description about dtype restriction. (#120208 ) Fixes #120173 Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120208 Approved by: https://github.com/soulitzer	2024-02-21 18:04:11 +00:00
Animesh Jain	9c64068ef8	[dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120089 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068	2024-02-21 17:56:48 +00:00
Animesh Jain	ec6783990a	[dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120068 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067	2024-02-21 17:56:48 +00:00
Animesh Jain	66c52d678f	[dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120067 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065	2024-02-21 17:56:36 +00:00
Animesh Jain	7a0c2a9d0a	[dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120065 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064	2024-02-21 17:56:18 +00:00
Animesh Jain	8d5ae8c0b3	[dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120064 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062	2024-02-21 17:56:05 +00:00
Animesh Jain	034955b2fc	[dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120062 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061	2024-02-21 17:55:46 +00:00
Animesh Jain	cc6cf89c30	[dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120061 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060	2024-02-21 17:55:32 +00:00
Animesh Jain	5066bec743	[dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120060 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833	2024-02-21 17:55:17 +00:00
Michael Gschwind	8f3fd79b23	Native Half on ARM (#119483 ) Summary: Native Half on ARM Test Plan: sandcastle Differential Revision: D53585776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119483 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-02-21 17:46:16 +00:00
Oguz Ulgen	29b2131c62	[Inductor] Fix bug around out of order constexprs in inductor (#120287 ) Inductor signature/config generation code assumes that all constexprs come as last arguments of the function. This is not always true for user defined kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120287 Approved by: https://github.com/jansel	2024-02-21 17:39:41 +00:00
Catherine Lee	cfddfce0d3	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-21 16:40:27 +00:00
Shuqiang Zhang	a24cba35b0	[c10d][flight recorder] dump additinal NCCL debug info (#120063 ) Summary: This PR is mainly about flight recorder side of changes that takes a map of maps as input, and dump it as picklable. Also add functions that should be compiled only when NCCL_COMM_DUMP is defined Test Plan: Integration tests with NCCL would be done later, here we only do the c10d side of dump test, aka,NCCLTraceTest Testing the dump function is a bit tricky as we don't have existing C++ unit tests for them. So we still use the Python NCCLTraceTest with the python binding of _dump_nccl_trace(), we manually fed the dump_nccl_trace with a map of test info, and assert the pickle result and print the converted python dict: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTest NCCL version 2.19.3+cuda12.0 [rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 . ---------------------------------------------------------------------- Ran 8 tests in 95.761s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063 Approved by: https://github.com/wconstab	2024-02-21 16:35:23 +00:00
Catherine Lee	06bc203c7b	Update dynamo_test_failures list (#120271 ) This PR removes and adds some failures and successes that were hidden in the past week (ish). https://github.com/pytorch/pytorch/pull/119408 (47182a8f4b5e36e280ca3595ba134f53499d2dc9) accidentally removed environment variables on rerun (see PR body of https://github.com/pytorch/pytorch/pull/120251 for slightly more details). Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry. Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120271 Approved by: https://github.com/DanilBaibak, https://github.com/zou3519	2024-02-21 16:34:34 +00:00
Edward Z. Yang	9199468401	Properly trace into mark_static (#120232 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120232 Approved by: https://github.com/yanboliang	2024-02-21 13:51:31 +00:00
Shan19900305	d38a3627a5	Support privateUser1 key in RNN op. (#118182 ) (#118351 ) Support privateUser1 key in RNN op。 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118351 Approved by: https://github.com/bdhirsh	2024-02-21 13:51:27 +00:00
Oguz Ulgen	eae025b1d7	Fix bug with block pointer multi dim args (#120263 ) Summary: Now we can parse statements like ``` %22 = tt.make_tensor_ptr %20, [%21, %c128_i64], [%c2048_i64, %c1_i64], [%1, %c0_i32] ``` Test Plan: Added new test ``` buck2 test mode/opt //hammer/ops/tests/inductor:ragged_hstu_test ``` now passes again with optimizations Differential Revision: D53975130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120263 Approved by: https://github.com/aakhundov, https://github.com/sijiac	2024-02-21 09:06:20 +00:00
cyy	3cd6a21e8f	[DeviceIndex][6/N] Use DeviceIndex in more places (#120133 ) This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133 Approved by: https://github.com/Skylion007	2024-02-21 06:24:23 +00:00
cyy	d5d13ab15e	Remove C10_FALLTHROUGH (#120157 ) Since [[fallthrough]] is supported in our C++17 compilers and no other repo is using it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120157 Approved by: https://github.com/Skylion007	2024-02-21 06:18:58 +00:00
drisspg	d6801578c3	Update tracing rules for new cudnn functions (#120268 ) # Summary This updates the trace_rules with the new cudnn torch functions for sdpa To repro: `pytest test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120268 Approved by: https://github.com/shuqiangzhang, https://github.com/huydhn, https://github.com/yanboliang	2024-02-21 05:22:44 +00:00
Michael Lazos	65519d183b	Remove old optimizer tests (#120257 ) Removes old tests now that all configs are covered in test_compiled_optimizers.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/120257 Approved by: https://github.com/eellison	2024-02-21 05:11:23 +00:00
wangjiangben-hw	b4cef25a1e	add register_device_op_overrides (#119268 ) Fixes #119267 Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268 Approved by: https://github.com/jansel	2024-02-21 04:53:07 +00:00
Quinn Zhu	3993771617	Expose recordSize in ChunkRecordIterator (#120239 ) Summary: Add a public method to read recordSize in ChunkRecordIterator Test Plan: ci Differential Revision: D53931944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120239 Approved by: https://github.com/zoranzhao	2024-02-21 04:33:03 +00:00
wangjiangben-hw	26610175d2	pass device_str for async_compile.triton function (#120202 ) Fixes #120203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120202 Approved by: https://github.com/jansel	2024-02-21 03:48:57 +00:00
Shunting Zhang	800e9acd43	[inductor] fix bandwidth extimation for StarDep (#120266 ) A lot of HF models fail when inductor_config.bechmark_kernel is enabled. The reason is the bandwidth estimation code assumes every dependencies has an index but StarDep does not. An exception is raised when StarDep.index is being accessed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120266 Approved by: https://github.com/eellison, https://github.com/jansel	2024-02-21 03:33:45 +00:00
wangjiangben-hw	20f7e5a719	Remove dependency of triton during inductor codegen (#120193 ) Fixes #120192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120193 Approved by: https://github.com/jansel	2024-02-21 01:09:48 +00:00
Yifu Wang	dd6b5e236e	Prepare test_inductor_collectives.py for native funcol migration (#120025 ) There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`. Other tests are marked with `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025 Approved by: https://github.com/wanchaol ghstack dependencies: #119982	2024-02-21 00:46:25 +00:00
Catherine Lee	af765dbdfd	[ez] Explicit env for run_test (#120251 ) env=None (which is the default) inherits the env from the calling process. Explicitly set the env to the calling process env so that things can be added to it later Tested in: `e7b4d8ec88` Checked that test-reports (which depend on the CI env var) get made. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251 Approved by: https://github.com/huydhn	2024-02-21 00:40:19 +00:00
PyTorch MergeBot	a1fc29cd78	Revert "[pytree] add function `tree_iter` (#120155 )" This reverts commit 372d078f361e726bb4ac0884ac334b04c58179ef. Reverted https://github.com/pytorch/pytorch/pull/120155 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/120155#issuecomment-1955479765))	2024-02-21 00:21:28 +00:00
lancerts	701f651f9c	Change the parameter type from int to float in torch.nn.Softplus (#120183 ) Fixes #120175 1 The c_api uses the double `f2cf0768d1/torch/csrc/api/include/torch/nn/options/activation.h (L501)`. 2 The type is also double in the test case `f2cf0768d1/test/cpp/api/functional.cpp (L1788)` 3 With float parameter in python works perfectly fine ``` m = nn.Softplus(beta=0.1,threshold=1.2) input = torch.randn(2) output = m(input) print(output) tensor([7.3749, 7.6852]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120183 Approved by: https://github.com/mikaylagawarecki	2024-02-21 00:14:38 +00:00
Matthew Hoffman	35891e5007	Explicitly set nn.Module.set_extra_state return type to None (#120161 ) Implicitly, the return type of `set_extra_state` is `NoReturn` since it always raises an error, and pyright will complain about mismatched return types if you override it with an implementation that doesn't also always raise an error. If we explicitly hint the return type as `None` (how we expect people to override it), we can avoid this error message. ``` Method "set_extra_state" overrides class "Module" in an incompatible manner Return type mismatch: base method returns type "NoReturn", override returns type "None" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120161 Approved by: https://github.com/mikaylagawarecki	2024-02-20 23:57:36 +00:00
David Berard	e54c4e8659	[aot_autograd] handle subclass input mutations correctly in collect_metadata_analysis.py (#120136 ) This PR fixes the issue in https://github.com/pytorch/pytorch/issues/120188. In collect_metadata_analysis.py, handling of input/output mutations was different from handling in other locations. In other locations, MUTATED_OUT_GRAPH was used to indicate that mutation would require returning an output; in collect_metadata_analysis.py, any type of mutation was being handled as if it would require returning an output. This PR changes collect_metadata_analysis to match other callsites and refactors computation of mutation types so that it is a property of the dataclass instead of something that needs to be computed manually when constructing an InputAliasInfo. Differential Revision: [D53950998](https://our.internmc.facebook.com/intern/diff/D53950998) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120136 Approved by: https://github.com/bdhirsh ghstack dependencies: #120141	2024-02-20 23:30:57 +00:00
David Berard	b36404159d	[aot_autograd] support inplace mutations for subclasses (#120141 ) This PR removes the conditional logic depending on requires_subclass_dispatch for mutation handling. Inputs are labeled with one of three labels: NOT_MUTATED, MUTATED_IN_GRAPH, or MUTATED_OUT_GRAPH. MUTATED_IN_GRAPH indicates mutation that is allowed in the aot autograd graph; MUTATED_OUT_GRAPH indicates mutation that is not allowed in the graph, so the result is computed, returned, and then assigned back to the input after the graph. Previously, there was logic to handle subclasses differently, so that MUTATED_IN_GRAPH + subclasses would behave like MUTATED_OUT_GRAPH. This PR simplifies aot_autograd's handling of mutations so that MUTATED_IN_GRAPH will always be handled in graph, even when subclasses are present. Note that there are still some cases where subclass support won't be handled correctly. Differential Revision: [D53950999](https://our.internmc.facebook.com/intern/diff/D53950999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120141 Approved by: https://github.com/bdhirsh	2024-02-20 23:30:57 +00:00
Elias Ellison	96092e1f55	Extend aot_graph_input_parser to sym shapes (#120246 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120246 Approved by: https://github.com/shunting314	2024-02-20 23:24:45 +00:00
Andrew Gu	7acdd08fcc	[FSDP2] Used stream APIs for CUDA event handling (#120231 ) If we already have Python `Stream` objects, then calling `stream1.wait_stream(stream2)` is syntactic sugar for creating an `event: Event`, recording it in `stream2`, and calling `stream1.wait_event(event)`. ~~Getting a Python `Stream` object incurs some CPU overhead, so we prefer to not change other callsites where we do not already have the `Stream` objects.~~ Update: Calling `event.record()` with no stream specified calls `torch.cuda.current_stream()`, so the overhead should be identical. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120231 Approved by: https://github.com/yifuwang ghstack dependencies: #118298, #119985	2024-02-20 21:35:46 +00:00
PyTorch MergeBot	dfb83df889	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit 47182a8f4b5e36e280ca3595ba134f53499d2dc9. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it. Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))	2024-02-20 21:28:13 +00:00
Yifu Wang	2d6c0cc81b	Run test_functional_api.py with both legacy and native funcol impls (#119982 ) Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982 Approved by: https://github.com/wanchaol	2024-02-20 21:15:37 +00:00
Yanbo Liang	d42ede8ae4	[torch.compile] Log compilation start time for timeline view (#120220 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120220 Approved by: https://github.com/angelayi	2024-02-20 21:07:40 +00:00
atalman	be8ba5ef2d	Revert "use two pass reduction for deterministic reduction order (#11… (#120243 ) This reverts commit cc7ef43423fe36cf1778a9c9643454d62050a5b5. Manual revert because of the conflict in: test/inductor/test_cpu_repro.py , conflict with this PR: https://github.com/pytorch/pytorch/pull/118365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120243 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-02-20 20:50:29 +00:00
Menglu Yu	4f0f25b7ce	[Inductor][bugFix] fix a bug in merge_splits (#119956 ) Summary: RecGPT got a keyerror when running the split_cat, and it was caused by a corner case hit. Test Plan: P1184947021 Differential Revision: D53791839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119956 Approved by: https://github.com/jackiexu1992	2024-02-20 20:38:34 +00:00
bhack	957f37686a	Refactor instance_descriptor for new triton version (#119636 ) Check https://github.com/pytorch/pytorch/pull/119457#issuecomment-1936764161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119636 Approved by: https://github.com/shunting314	2024-02-20 20:26:35 +00:00
Hugues de Saxcé	8464654ae4	Add missing words to torch.utils.checkpoint doc (#120196 ) This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it. Changes are: - "backward." -> "backward propagation." - "to be advanced than" -> "to be more advanced than" Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196 Approved by: https://github.com/soulitzer	2024-02-20 20:18:42 +00:00
Menglu Yu	b33e8d3f6b	[Inductor][fx pass] Add split cat pattern to remove cat nodes (#115004 ) Summary: Titled Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/8e4179db-363a-41b5-8bd7-cc445a512f6f Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598708548039 Network: Up: 91KiB Down: 32KiB (reSessionID-b0985d82-1919-49c5-b307-ee0ab49b4738) Jobs completed: 28. Time elapsed: 1:27.1s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0 # local reproduce (IG_CTR) ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` P895047189 Differential Revision: D51777617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115004 Approved by: https://github.com/jackiexu1992	2024-02-20 19:35:20 +00:00
Brian Hirsh	cccacf6c8e	add a test that non_overlapping checks dont generate too many guards (#120106 ) Pre-emptive test in OSS to ensure that models relying on the "non-overlapping guards" checks do not suffer drastically w.r.t. guard slowness. Current plan is to follow up on this with a "real" fix, to generate a linear number of these guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120106 Approved by: https://github.com/mlazos	2024-02-20 18:38:59 +00:00
Angela Yi	6d82a7e9b0	Add pixel_shuffle to core aten decomps (#120092 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53860966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092 Approved by: https://github.com/ydwu4	2024-02-20 18:37:32 +00:00
Nikita Shulga	53bfae2c06	[MPS] Add `torch.fft.` support (#119670 ) Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 ) When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing. Fixes https://github.com/pytorch/pytorch/issues/78044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670 Approved by: https://github.com/kulinseth, https://github.com/albanD	2024-02-20 18:23:06 +00:00
Jokeren	5f3f8fd3c7	[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450 ) `CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`. Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450 Approved by: https://github.com/soulitzer	2024-02-20 16:58:20 +00:00
Jeff Daily	d3839b624b	[ROCm] HIP Lazy Streams (#119996 ) For ROCm/HIP, each stream is lazily initialized rather than creating all streams when the first stream is requested. HIP streams are not as lightweight as CUDA streams; the pooling strategy can affect performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119996 Approved by: https://github.com/ezyang	2024-02-20 16:24:04 +00:00
Brian Hirsh	26fbbc3e84	DTensor + dynamo: fix is_shard/replicate always inlining to False (#118668 ) Fixes an internal enablement bug. When dynamo traces `is_sharded`/`is_replicate`, it would unconditioanlly assume the result was False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118668 Approved by: https://github.com/wconstab, https://github.com/wanchaol ghstack dependencies: #117667, #117666, #118209, #118191, #118667	2024-02-20 15:23:48 +00:00
Brian Hirsh	609cde94f9	DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667 ) This fixes an internal DTensor enablement bug (I don't have an OSS issue for it) I finally root-caused this as follows: (1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn) (2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549 (3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly. (4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in. I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667 Approved by: https://github.com/wanchaol ghstack dependencies: #117667, #117666, #118209, #118191	2024-02-20 15:23:48 +00:00
Brian Hirsh	6819452a08	fix multiple-fake-modes bug with compile + subclasses (#118191 ) This should fix the "multiple fake modes" errors we've been seeing with both float8 tensor and DTensor. Haven't added a test yet - will add one before landing. I also have a separate PR that would have made the error significantly nicer (the bad error resulted from us returning a FakeTensor at runtime): https://github.com/pytorch/pytorch/pull/118644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118191 Approved by: https://github.com/drisspg ghstack dependencies: #117667, #117666, #118209	2024-02-20 15:23:41 +00:00
haozhe.zhu	b4b1480b06	remove redundant to_dtype in Fused Schedular Nodes (#118365 ) Fix https://github.com/pytorch/pytorch/issues/115260. This issue is triggered by `FusedSchedularNodes` cases. We always store `lowp buffer` to `store_cache` then load `lowp buffer` from `store_cache` and `convert it to float` before `compute ops`. Now we will generate a `{key: to(float32)_expr, value: the float32 cse var before to_dtype and store}` in `cse.cache`. Then the `to_dtype(float32)` after `load` will hit this cache and not generate a new var with cast codes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118365 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-02-20 13:35:03 +00:00
Kazuaki Ishizaki	c28a43988e	Fix typo under aten/src/ATen/native directory (#119686 ) This PR fixes typo in comments and msgs under `aten/src/ATen/native` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/119686 Approved by: https://github.com/lezcano, https://github.com/malfet	2024-02-20 06:31:10 +00:00
Animesh Jain	389b56b4c4	[dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119833 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827	2024-02-20 05:33:08 +00:00
Animesh Jain	96f45d15d8	[dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119827 Approved by: https://github.com/jansel ghstack dependencies: #119822	2024-02-20 05:33:08 +00:00
Animesh Jain	0802951081	[dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822 ) The full blown implementation is in this stack - https://github.com/pytorch/pytorch/pull/110590 which is passing all the test cases on CI. That stack is hard to review. So, breaking apart. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119822 Approved by: https://github.com/jansel	2024-02-20 05:33:08 +00:00
PyTorch UpdateBot	0512ba43ab	[executorch hash update] update the pinned executorch hash (#120214 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120214 Approved by: https://github.com/pytorchbot	2024-02-20 04:13:02 +00:00
lezcano	a7e2b609d3	Skip less replacements (#119570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119570 Approved by: https://github.com/ezyang	2024-02-20 04:10:33 +00:00
haozhe.zhu	cc7ef43423	use two pass reduction for deterministic reduction order (#115620 ) ## Motivation Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`. ## Latest update on 1.15: `55d81901bc`. Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap. ``` vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0 vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4) ``` Examples code: ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); #pragma omp for for(...){ .... tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x; // access array will always from memory } } ``` will be changed to ``` tmp0_acc_arr[64]; #pragma omp parallel num_threads(64) { auto tid = omp_get_thread_num(); auto tmp0_acc_local = 0; #pragma omp for for(...){ .... tmp0_acc_local = tmp0_acc_local + tmp_x; } tmp0_acc_arr[tid] = tmp0_acc_local; } ``` ## Descriptions Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order. `9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)` `9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)` ``` float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); // init reduction buffer per thread float tmp_acc0_arr[64]; at::vec::Vectorized<float> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0)); auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0)); auto tmp2 = tmp0 - tmp1; auto tmp3 = tmp2 * tmp2; // reduce to per thread buffers tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3; } } // second pass reduce for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid]; tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec); out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0); ``` ## Test results I test this PR with dynamo benchmark on 32-core ICX system, Result (avg speed up): \| \| before this PR \| after this PR \| \| ---- \| ---- \| ---- \| \| torchbench \| 1.303 \| 1.301 \| \| hugginface \| 1.346 \| 1.343 \| \| timms \| 1.971 \| 1.970 \| ``` export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1" export KMP_AFFINITY=granularity=fine,compact,1,0 export KMP_BLOCKTIME=1 multi_threads_test() { CORES=$(lscpu \| grep Core \| awk '{print $4}') export OMP_NUM_THREADS=$CORES end_core=$(expr $CORES - 1) numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv } SCENARIO=performance DT=float32 export TORCHINDUCTOR_FREEZING=1 Flag_extra="--freezing" Mode_extra="--inference" for suite in timm_models huggingface torchbench do export SUITE=$suite echo $SUITE export LOG_BASE=`date +%m%d%H%M%S` mkdir $LOG_BASE multi_threads_test done ``` System info ``` ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 BogoMIPS: 5800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 1.5 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 40 MiB (32 instances) L3: 54 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-02-20 00:46:59 +00:00
Nikita Shulga	ae7830051d	[BE] Delete GCC-7 ICE workarounds (#120122 ) As one needs gcc-9 to compile PyTorch, so those workarounds are no longer relevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/120122 Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/Skylion007	2024-02-20 00:31:20 +00:00
PyTorch MergeBot	0bdeaad936	Revert "add register_device_op_overrides (#119268 )" This reverts commit 2864a7e161cc107f7e4c00cccdf860a6089c73c3. Reverted https://github.com/pytorch/pytorch/pull/119268 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/119268#issuecomment-1953231324))	2024-02-19 22:31:32 +00:00
Nikita Shulga	3ad067fe2b	[CPP] Update GCC minversion check to 9 or newer (#120126 ) It's already a requirement for building PyTorch, but should be a requirement for linking extensions with it, as that can lead to runtime crashes, as `std::optional` template layout is incompatible between gcc-9 and older compilers. Also, update minimum supported clang version to 9.x(used to build Android), as clang-5 is clearly not C++17 compliant. Fixes https://github.com/pytorch/pytorch/issues/120020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120126 Approved by: https://github.com/Skylion007	2024-02-19 22:05:00 +00:00
Jeff Daily	48bdd0fb47	[ROCm] TunableOp bugfix filename handling (#120144 ) Fixes nightly wheel seg fault during pytorch shutdown. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120144 Approved by: https://github.com/xw285cornell	2024-02-19 21:31:29 +00:00
PyTorch MergeBot	f1fbba8f35	Revert "Fix lint after #119268 (#120207 )" This reverts commit d9d0f1dccc59ce6f0cb150ac236654c24a0d1118. Reverted https://github.com/pytorch/pytorch/pull/120207 on behalf of https://github.com/atalman due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/120207#issuecomment-1953170249))	2024-02-19 21:21:12 +00:00
PyTorch MergeBot	a73a98c9ae	Revert "Updating sleef submodule to eb3d97785 to fix export errors (#119953 )" This reverts commit fa9cbdce993601276765ad7701871f7e04a400c6. Reverted https://github.com/pytorch/pytorch/pull/119953 on behalf of https://github.com/atalman due to Broke trunk linux-focal-cpu-py3.10-gcc9-bazel-test and linux-focal-cuda12.1-py3.10-gcc9-bazel-test. These are not flaky failures. ([comment](https://github.com/pytorch/pytorch/pull/119953#issuecomment-1953118780))	2024-02-19 20:26:33 +00:00
atalman	d9d0f1dccc	Fix lint after #119268 (#120207 ) Fixes lint after: https://github.com/pytorch/pytorch/issues/119268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120207 Approved by: https://github.com/davidberard98	2024-02-19 20:01:45 +00:00
Yukio Siraichi	92bf2a4550	[torchbench] Update skipped models. (#120117 ) This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of the changes: - `detectron2_maskrcnn`: #120115 - `fambench_xlmr`: moved to canary models - `hf_Bert` and `hf_Bert_large`: pass - `maml`: pass - `clip`: renamed to `hf_clip` - `gat`, `gcn`, and `sage`: moved to canary models Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-02-19 18:08:32 +00:00
Yifu Wang	637cf4a3f2	Test parametrization utils for native funcol migration (#119950 ) ``` Between the time we switch to the native funcol by default and the time when we are confident that we can remove the legacy implementation, we want to ensure that the legacy funcol remains covered by unit tests. This is to prepare for any potential (but unlikely) reverts. The following utilities help achieve this goal. run_with_{native,legacy}_funcol - mark a test to run with only {native,legacy} funcol. These decorators are for impl specific tests (e.g. verifying generated code with FileCheck). run_with_both_funcol_impls - parametrize a test to run with both legacy and native funcol. run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but passes `enable_native_funcol` to the test so impl specific checks can be carried out. ``` This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950 Approved by: https://github.com/wanchaol ghstack dependencies: #119881	2024-02-19 02:46:03 +00:00
Yifu Wang	40786ca509	Handle unwaited work objects on process termination (#119881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881 Approved by: https://github.com/wconstab	2024-02-19 02:46:02 +00:00
leslie-fang-intel	84de851539	[Inductor] Enable the decomposition of quant/dequant per channel (#119177 ) Summary Part 2 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type. Enable decomposition of quant/dequant per channel to make it vectorized code generation. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8_bf16_input python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8_bf16_input ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119177 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-02-19 01:30:44 +00:00
Anthony Alayo	fa9cbdce99	Updating sleef submodule to eb3d97785 to fix export errors (#119953 ) Fixes #119952 with submodule updates Pull Request resolved: https://github.com/pytorch/pytorch/pull/119953 Approved by: https://github.com/ezyang	2024-02-19 00:56:24 +00:00
Jason Ansel	f2cf0768d1	[dynamo][distributed] handle _rank_not_in_group, _get_or_create_default_group (#119628 ) Copy of #117692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119628 Approved by: https://github.com/yanboliang	2024-02-18 22:34:35 +00:00
Xuehai Pan	372d078f36	[pytree] add function `tree_iter` (#120155 ) Fixes #119768 - #119768 This PR adds a new function `tree_iter` that lazily iterates over the tree leaves. It is different than the `tree_leaves` function while the latter traversal the whole tree first to build a list of leaves. ```python for leaf in tree_iter(tree): ... ``` is much more efficient than: ```python for leaf in tree_leaves(tree): ... ``` where `tree_leaves(tree)` is `list(tree_iter(tree))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120155 Approved by: https://github.com/vmoens	2024-02-18 09:16:50 +00:00
wz337	61a3a7628c	[nit][DTensor][Test] Update test name to reflect the actual test (#118960 ) test_name: test_partial_mul_failure -> test_partial_mul Pull Request resolved: https://github.com/pytorch/pytorch/pull/118960 Approved by: https://github.com/XilunWu	2024-02-18 08:23:06 +00:00
wangjiangben-hw	2864a7e161	add register_device_op_overrides (#119268 ) Fixes #119267 Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268 Approved by: https://github.com/jansel	2024-02-18 06:11:54 +00:00
PyTorch UpdateBot	70bc3b3be4	[executorch hash update] update the pinned executorch hash (#120165 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120165 Approved by: https://github.com/pytorchbot	2024-02-18 03:44:50 +00:00
Jason Ansel	d74bdd5042	[inductor] Always allow 64 bit in next_power_of_2 (#120164 ) see #120153 #120152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120164 Approved by: https://github.com/yanboliang	2024-02-18 03:22:46 +00:00
Eddie Yan	de15781af0	[cuDNN] Bump cuDNN frontend submodule to 1.1.1 (#120137 ) Hopefully addresses the failure seen when trying to bump to 1.1.0 (#119642) CC @Skylion007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120137 Approved by: https://github.com/Skylion007	2024-02-18 02:57:02 +00:00
Animesh Jain	b642a18e80	[dynamo] Use EQUALS_MATCH guard for mod.training (#120147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120147 Approved by: https://github.com/jansel ghstack dependencies: #120132, #120140, #120145	2024-02-18 00:31:36 +00:00
Animesh Jain	0b11b0edd6	[dynamo][refactor] Use existing helper functions for CLOSURE_MATCH (#120145 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120145 Approved by: https://github.com/jansel, https://github.com/Fidget-Spinner ghstack dependencies: #120132, #120140	2024-02-18 00:31:36 +00:00
wangjiangben-hw	0c972c7c4e	enhance next_power_of_2 function (#120153 ) Fixes #120152 cc @ezyang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/120153 Approved by: https://github.com/jansel	2024-02-17 20:18:46 +00:00
Jason Ansel	2fea475215	[dynamo] Refactor reconstruct() not to return anything (#120150 ) This simplifies things slightly and avoids some bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120150 Approved by: https://github.com/yanboliang	2024-02-17 17:13:41 +00:00
Animesh Jain	757fc663a8	[dynamo][refactor] Use TYPE_MATCH instead of manually constructing guard (#120140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120140 Approved by: https://github.com/jansel, https://github.com/yanboliang ghstack dependencies: #120132	2024-02-17 16:03:36 +00:00
Animesh Jain	48d96c08f2	[dynamo][guards] Use EQUALS_MATCH for NAME_MATCH (#120132 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120132 Approved by: https://github.com/jansel, https://github.com/yanboliang	2024-02-17 16:03:36 +00:00
cyy	a9953a5ef3	Remove unused c10/util/C++17.h inclusion and outdated checks (#120149 ) This is a continued work to clean up pre-C++17 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120149 Approved by: https://github.com/ezyang	2024-02-17 14:28:17 +00:00
Yang Chen	fac598c4ae	[inductor] allow padding mm/bmm/addmm in the presence of dynamic dims (#120073 ) Previously, pad_mm skips cases where any input tensor has symbolic dimension or stride. This is too constraint in practise. This PR enables this pass to pad non-symbolic dimensions in the presence of dynamic dims. For example, with this PR, we could pad the K dimension (i.e. 1921) for torch.mm(A[s0, 1921], B[2048, 1921]). Pull Request resolved: https://github.com/pytorch/pytorch/pull/120073 Approved by: https://github.com/jansel	2024-02-17 12:22:20 +00:00
Ting Lu	2f8a80ecb2	Fix skip for test_set_nccl_pg_timeout (#120130 ) Test is failing on our internal CI with below error ```RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Purpose of this test is for nccl so it doesnt make sense to run in 1 GPU setting either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120130 Approved by: https://github.com/wconstab, https://github.com/eqy	2024-02-17 07:36:14 +00:00
Adnan Akhundov	badf84bd6b	[inductor] Add torch.cond support to JIT Inductor (#119759 ) Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them. In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs. Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 24 tests in 86.790s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-17 07:25:27 +00:00
Shuqiang Zhang	30000aa3fd	[c10d] remove one line of verbose log (#120138 ) Summary: I don't find exiting DBG mode support in c10d. This is flooding the log, removing it to unblock user Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120138 Approved by: https://github.com/wconstab	2024-02-17 06:39:57 +00:00
Bin Bao	fa0e39560c	[AOTI] Fix a typo (#120094 ) Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094 Approved by: https://github.com/khabinov, https://github.com/sijiac	2024-02-17 05:28:58 +00:00
PyTorch UpdateBot	0a7471e0df	[executorch hash update] update the pinned executorch hash (#120134 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120134 Approved by: https://github.com/pytorchbot	2024-02-17 05:00:35 +00:00
ydwu4	ac2ba7889d	[export] turn on replace_set_grad_with_hop_pass in pre_dispatch (#119915 ) This PR turns on replace_set_grad_with_hop_pass for pre_dispatch export. To do that, we need to propagate the meta-data from original submodule to the new higher order op and fix the names of nodes as is required by the _sig_to_specs pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119915 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736, #119810, #119913, #119914	2024-02-17 02:18:35 +00:00
ydwu4	737630268c	[export] manuually create test cases for split and inline (#119914 ) This PR makes the tests for inline and sequential_split stop relying on set_grad_enabled to be in the graph. Because they'll be gone if we turn on the replace_set_grad_with_hop_pass in the following diff. Instead, we'll manually insert them into the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119914 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736, #119810, #119913	2024-02-17 02:18:35 +00:00
ydwu4	8d81e61fb6	[export] make node_inline_ also inline the get_item calls (#119913 ) As titled. Before the PR, after we split then inline_, there will be getitem calls in the graph while the original graph module doesn't have them. This PR removes the additional get_item calls by inlining. Test Plan: Added new test cases for graphs that return multiple outputs and takes multiple inputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119913 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736, #119810	2024-02-17 02:18:27 +00:00
ydwu4	812f05d731	[export] add replace_set_grad_with_hop_pass (#119810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119810 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732, #119736	2024-02-17 02:18:19 +00:00
ydwu4	4769e6916a	[export] add node_inline_ to prepare replacing set_grad_enabled with hop (#119736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119736 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #119732	2024-02-17 02:18:11 +00:00
ydwu4	068659ddc2	[export] add sequential_split to prepare replacing set_grad_enabled with hop (#119732 ) This pr is the 1/N pr of transforming the global state mutating ops such as torch._C.set_grad_enabled calls in pre-dispatch graph into a higher order op so that the graph becomes more functional. We make use of split_module to help us do the transformation. This pr preserves the node.name in original module by adding a new kwarg `keep_original_node_name` to split_module. For a graph looks like this: ```python def forward(self, arg_0): arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec) add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None sin = torch.ops.aten.sin.default(add); add = None sum_1 = torch.ops.aten.sum.default(sin); sin = None _set_grad_enabled = torch._C._set_grad_enabled(False) add_1 = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None _set_grad_enabled_1 = torch._C._set_grad_enabled(True) sub = torch.ops.aten.sub.Tensor(add_1, 1) return pytree.tree_unflatten((add_1, sub), self._out_spec) ``` Before the change, split graph returns the following graphs and subgraphs (notice the change from `add` -> `add_tensor`, `sin` -> `sin_default`: ```python def forward(self, arg_0): arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec) submod_0 = self.submod_0(arg0_1); arg0_1 = None submod_1 = self.submod_1(submod_0); submod_0 = None submod_2 = self.submod_2(submod_1) return pytree.tree_unflatten((submod_1, submod_2), self._out_spec) # submod_0 def forward(self, arg0_1): add_tensor = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None sin_default = torch.ops.aten.sin.default(add_tensor); add_tensor = None sum_default = torch.ops.aten.sum.default(sin_default); sin_default = None return sum_default # submod_1 def forward(self, sum_1): _set_grad_enabled = torch._C._set_grad_enabled(False) add_tensor = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None return add_tensor # submod_2 def forward(self, add_1): _set_grad_enabled = torch._C._set_grad_enabled(True) sub_tensor = torch.ops.aten.sub.Tensor(add_1, 1); add_1 = None return sub_tensor """) ``` After the change, the test produce the following graph, all the node names in original graph module are preserved in sub_modules. ```python def forward(self, arg_0): sub, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec) submod_0 = self.submod_0(sub); sub = None submod_1 = self.submod_1(submod_0); submod_0 = None submod_2 = self.submod_2(submod_1) return pytree.tree_unflatten((submod_1, submod_2), self._out_spec) # submod_0 def forward(self, arg0_1): add = torch.ops.aten.add.Tensor(arg0_1, 1); arg0_1 = None sin = torch.ops.aten.sin.default(add); add = None sum_1 = torch.ops.aten.sum.default(sin); sin = None return sum_1 # submod_1 def forward(self, sum_1): _set_grad_enabled = torch._C._set_grad_enabled(False) add_1 = torch.ops.aten.add.Tensor(sum_1, 1); sum_1 = None return add_1 # submod_2 def forward(self, add_1): _set_grad_enabled_1 = torch._C._set_grad_enabled(True) sub = torch.ops.aten.sub.Tensor(add_1, 1); add_1 = None return sub ``` Note that currently, we call split_module on the graph after pre-dispatch aot. The difference is even larger if we `split_module` the graph module produced by dynamo, where all the original variables names in user program are preserved after dynamo but lost after `split_module` without this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119732 Approved by: https://github.com/tugsbayasgalan	2024-02-17 02:18:04 +00:00
Shunting Zhang	becfda005e	tiny improvement to the cprofile wrapper (#120100 ) 1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ... 2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120100 Approved by: https://github.com/eellison	2024-02-17 02:10:25 +00:00
Shunting Zhang	36e118b810	[inductor] logging meta data for inductor generated triton kernel (#120048 ) I want to log metadata for inductor generated triton kernels for a couple of purposes 1. with these metadata, it should be convenient to find unaligned reduction kernels and try the idea here https://github.com/pytorch/pytorch/issues/119929 . I think it's nice to try on kernels that are used in real models 2. I'm thinking that based on the collected kernel metadata, I can build a simple offline tool by benchmarking each kernel with ncu and augment each kernel metadata with: latency, theoretical membw (estimated memory access / latency), and actually achieved membw. Hopefully this can point us to some good optimization opportunities. Command: ``` TORCHINDUCTOR_CACHE_DIR=`realpath ~/inductor-caches/kernel-metadata-log` TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training ``` The best practice here is to point inductor cache to a folder outside of /tmp so that one can always run the kernel again based on the path stored in kernel metadata. (folders under /tmp may get removed by the system) Here is first 1000 rows of collected metadata for huggingface: https://gist.github.com/shunting314/cf4ebdaaaa7e852efcaa93524c868e5f And here is the total 10K kernels collected for huggingface. The gist can not be rendered as a csv since it's too large: https://gist.github.com/shunting314/7f841528e2debdc2ae05dece4ac591be . Pull Request resolved: https://github.com/pytorch/pytorch/pull/120048 Approved by: https://github.com/jansel	2024-02-17 02:09:27 +00:00
Andre Eid	24968ff042	Add quantized gelu (#119935 ) Summary: Added Quantized gelu for vulkan backend. Test Plan: Tested it on "On Demand RL FBSource" LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_quantized_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="VulkanAPITest.gelu_q" ---------------------------------------------------------------------------------- Note: Google Test filter = VulkanAPITest.gelu_q [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.gelu_qint8 [ OK ] VulkanAPITest.gelu_qint8 (318 ms) [ RUN ] VulkanAPITest.gelu_qint8_self [ OK ] VulkanAPITest.gelu_qint8_self (214 ms) [ RUN ] VulkanAPITest.gelu_quint8 [ OK ] VulkanAPITest.gelu_quint8 (152 ms) [ RUN ] VulkanAPITest.gelu_quint8_self [ OK ] VulkanAPITest.gelu_quint8_self (142 ms) [----------] 4 tests from VulkanAPITest (828 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (828 ms total) [ PASSED ] 4 tests. Differential Revision: D52985437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119935 Approved by: https://github.com/jorgep31415	2024-02-17 01:17:25 +00:00
Aaron Enye Shi	7973ac586d	[Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404 ) Summary: Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables: ``` double garbage_collection_threshold; size_t max_split_size; size_t pinned_num_register_threads; bool expandable_segments; bool release_lock_on_cudamalloc; bool pinned_use_cuda_host_register; std::string last_allocator_settings; std::vector<size_t> roundup_power2_divisions; ``` Test Plan: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True', 'max_split_size': -1, 'garbage_collection_threshold': 0.0, 'expandable_segments': True, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 0, '2': 0, '4': 0, '8': 0, '16': 0, '32': 0, '64': 0, '128': 0, '256': 0, '512': 0, '1024': 0, '2048': 0, '4096': 0, '8192': 0, '16384': 0, '32768': 0}} ``` `PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]', 'max_split_size': 2097152000, 'garbage_collection_threshold': 0.0, 'expandable_segments': False, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8} } ``` Differential Revision: D53536199 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404 Approved by: https://github.com/zdevito	2024-02-17 01:16:37 +00:00
Nikita Shulga	9aa8bbf7f2	[BE] Delete `C10_IS_TRIVIALLY_COPYABLE` (#120120 ) It's not used anywhere in PyTorch after custom implementation of `c10::optional` is gone, and it's not used by the repo as well, see https://github.com/search?type=code&q=C10_IS_TRIVIALLY_COPYABLE+org%3Apytorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/120120 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/huydhn	2024-02-17 01:04:30 +00:00
Jeeja KP	79569d117d	Add hpu device support in storage/resize (#119761 ) Add hpu device to - In storage method resize_ - is_supported_device for fsdp - for storage add hpu device support Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119761 Approved by: https://github.com/mikaylagawarecki	2024-02-17 01:04:27 +00:00
BowenBao	6b63d3bac9	[ONNX][dynamo_export] Adjust to new symbolic shape name format in value_info (#119855 ) Bump onnxscript in CI and adjust the test case expectation of the experimental exported shape naming format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119855 Approved by: https://github.com/thiagocrepaldi	2024-02-17 00:51:19 +00:00
cyy	e61c8ef3aa	Simplify c10::is_pod implementation and remove unneeded inclusion of C++17.h (#118212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118212 Approved by: https://github.com/albanD	2024-02-17 00:14:09 +00:00
cyy	6952d6ddad	[structural binding][4/N] Replace std::tie with structural binding (#120039 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie Pull Request resolved: https://github.com/pytorch/pytorch/pull/120039 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-02-17 00:05:58 +00:00
Thiago Crepaldi	761fa5d6ec	Add FakeTensor support to torch._utils._rebuild_tensor (#108186 ) There are two scenarios: * Scenario 1: The checkpoint was saved with pytorch < 1.6 * Scenario 2: The checkpoint was saved with pytorch >= 1.6 Repro Scenario 1: ```python from torch._subclasses import fake_tensor import transformers fake_mode = fake_tensor.FakeTensorMode() with fake_mode: fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2") ``` Error: ```bash Some weights of the model checkpoint at sshleifer/tiny-gpt2 were not used when initializing GPT2Model: ['lm_head.weight'] - This IS expected if you are initializing GPT2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing GPT2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:463 in │ │ load_state_dict │ │ │ │ 460 │ │ │ ) │ │ 461 │ │ return safe_load_file(checkpoint_file) │ │ 462 │ try: │ │ ❱ 463 │ │ return torch.load(checkpoint_file, map_location="cpu") │ │ 464 │ except Exception as e: │ │ 465 │ │ try: │ │ 466 │ │ │ with open(checkpoint_file) as f: │ │ │ │ /opt/pytorch/torch/serialization.py:1030 in load │ │ │ │ 1027 │ │ │ │ return _legacy_load(opened_file, map_location, _weights_only_unpickler, │ │ 1028 │ │ │ except RuntimeError as e: │ │ 1029 │ │ │ │ raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None │ │ ❱ 1030 │ │ return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args │ │ 1031 │ │ 1032 │ │ 1033 # Register pickling support for layout instances such as │ │ │ │ /opt/pytorch/torch/serialization.py:1258 in _legacy_load │ │ │ │ 1255 │ _sys_info = pickle_module.load(f, pickle_load_args) │ │ 1256 │ unpickler = UnpicklerWrapper(f, pickle_load_args) │ │ 1257 │ unpickler.persistent_load = persistent_load │ │ ❱ 1258 │ result = unpickler.load() │ │ 1259 │ │ │ 1260 │ deserialized_storage_keys = pickle_module.load(f, pickle_load_args) │ │ 1261 │ │ │ │ /opt/pytorch/torch/_utils.py:201 in _rebuild_tensor_v2 │ │ │ │ 198 def _rebuild_tensor_v2( │ │ 199 │ storage, storage_offset, size, stride, requires_grad, backward_hooks, metadata=None │ │ 200 ): │ │ ❱ 201 │ tensor = _rebuild_tensor(storage, storage_offset, size, stride) │ │ 202 │ tensor.requires_grad = requires_grad │ │ 203 │ if metadata: │ │ 204 │ │ set_tensor_metadata(tensor, metadata) │ │ │ │ /opt/pytorch/torch/_utils.py:180 in _rebuild_tensor │ │ │ │ 177 def _rebuild_tensor(storage, storage_offset, size, stride): │ │ 178 │ # first construct a tensor with the correct dtype/device │ │ 179 │ t = torch.tensor([], dtype=storage.dtype, device=storage._untyped_storage.device) │ │ ❱ 180 │ return t.set_(storage._untyped_storage, storage_offset, size, stride) │ │ 181 │ │ 182 │ │ 183 def get_tensor_metadata(tensor): │ │ │ │ /opt/pytorch/torch/utils/_stats.py:20 in wrapper │ │ │ │ 17 │ │ if fn.__qualname__ not in simple_call_counter: │ │ 18 │ │ │ simple_call_counter[fn.__qualname__] = 0 │ │ 19 │ │ simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1 │ │ ❱ 20 │ │ return fn(args, kwargs) │ │ 21 │ return wrapper │ │ 22 │ │ │ │ /opt/pytorch/torch/_subclasses/fake_tensor.py:1160 in __torch_dispatch__ │ │ │ │ 1157 │ def __torch_dispatch__(self, func, types, args=(), kwargs=None): │ │ 1158 │ │ assert self not in _get_current_dispatch_mode_stack(), func │ │ 1159 │ │ try: │ │ ❱ 1160 │ │ │ return self.dispatch(func, types, args, kwargs) │ │ 1161 │ │ except TypeError: │ │ 1162 │ │ │ log.exception("fake tensor raised TypeError") │ │ 1163 │ │ │ raise │ │ │ │ /opt/pytorch/torch/_subclasses/fake_tensor.py:1318 in dispatch │ │ │ │ 1315 │ │ │ │ 1316 │ │ # we are falling through to running non constant tensors, any input constant tha │ │ 1317 │ │ # is written to must be invalidated │ │ ❱ 1318 │ │ self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) │ │ 1319 │ │ │ │ 1320 │ │ # Try for fastpath │ │ 1321 │ │ if has_symbolic_sizes: │ │ │ │ /opt/pytorch/torch/_subclasses/fake_tensor.py:1557 in invalidate_written_to_constants │ │ │ │ 1554 │ │ any_constant = any(e.constant is not None for e in flat_arg_fake_tensors) │ │ 1555 │ │ if any_constant and get_schema_info(func).is_mutable(): │ │ 1556 │ │ │ schema_info = get_schema_info(func) │ │ ❱ 1557 │ │ │ _, new_kwargs = normalize_function( │ │ 1558 │ │ │ │ func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True │ │ 1559 │ │ │ ) │ │ 1560 │ │ │ for k, v in new_kwargs.items(): │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:297 in normalize_function │ │ │ │ 294 │ │ new_args_and_kwargs = _args_kwargs_to_normalized_args_kwargs(sig, args, kwargs, │ │ 295 │ else: │ │ 296 │ │ assert callable(target) │ │ ❱ 297 │ │ torch_op_schemas = get_signature_for_torch_op(target) │ │ 298 │ │ matched_schemas = [] │ │ 299 │ │ if torch_op_schemas: │ │ 300 │ │ │ # Iterate through all of the schema until we find one that matches │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:167 in get_signature_for_torch_op │ │ │ │ 164 │ │ │ return (None, None) if return_schemas else None │ │ 165 │ │ schemas = torch._C._jit_get_schemas_for_operator(aten_fn) │ │ 166 │ │ │ ❱ 167 │ signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] │ │ 168 │ return (signatures, schemas) if return_schemas else signatures │ │ 169 │ │ 170 @compatibility(is_backward_compatible=False) │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:167 in <listcomp> │ │ │ │ 164 │ │ │ return (None, None) if return_schemas else None │ │ 165 │ │ schemas = torch._C._jit_get_schemas_for_operator(aten_fn) │ │ 166 │ │ │ ❱ 167 │ signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] │ │ 168 │ return (signatures, schemas) if return_schemas else signatures │ │ 169 │ │ 170 @compatibility(is_backward_compatible=False) │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:70 in _torchscript_schema_to_signature │ │ │ │ 67 │ from inspect import Parameter │ │ 68 │ parameters : List[Parameter] = [] │ │ 69 │ for arg in ts_schema.arguments: │ │ ❱ 70 │ │ arg_type = _torchscript_type_to_python_type(arg.type) │ │ 71 │ │ default = arg.default_value if arg.has_default_value() else Parameter.empty │ │ 72 │ │ # TODO: Figure out if this is safe. It seems like when generating the type signa │ │ 73 │ │ # PythonArgParser, we emit signatures with `input` instead of `self` as the firs │ │ │ │ /opt/pytorch/torch/fx/operator_schemas.py:64 in _torchscript_type_to_python_type │ │ │ │ 61 │ eval'ing the annotation_str. _type_eval_globals sets up expressions │ │ 62 │ like "List" and "Future" to map to actual types (typing.List and jit.Future) │ │ 63 │ """ │ │ ❱ 64 │ return eval(ts_type.annotation_str, _type_eval_globals) │ │ 65 │ │ 66 def _torchscript_schema_to_signature(ts_schema : torch._C.FunctionSchema) -> inspect.Sig │ │ 67 │ from inspect import Parameter │ │ <string>:1 in <module> │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NameError: name 'Storage' is not defined During handling of the above exception, another exception occurred: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:467 in │ │ load_state_dict │ │ │ │ 464 │ except Exception as e: │ │ 465 │ │ try: │ │ 466 │ │ │ with open(checkpoint_file) as f: │ │ ❱ 467 │ │ │ │ if f.read(7) == "version": │ │ 468 │ │ │ │ │ raise OSError( │ │ 469 │ │ │ │ │ │ "You seem to have cloned a repository without having git-lfs ins │ │ 470 │ │ │ │ │ │ "git-lfs and run `git lfs install` followed by `git lfs pull` in │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/codecs.py:322 in decode │ │ │ │ 319 │ def decode(self, input, final=False): │ │ 320 │ │ # decode input (taking the buffer into account) │ │ 321 │ │ data = self.buffer + input │ │ ❱ 322 │ │ (result, consumed) = self._buffer_decode(data, self.errors, final) │ │ 323 │ │ # keep undecoded input until the next call │ │ 324 │ │ self.buffer = data[consumed:] │ │ 325 │ │ return result │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte During handling of the above exception, another exception occurred: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/pytorch/bug_repro.py:16 in <module> │ │ │ │ 13 fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2") │ │ 14 assert fake_model is not None │ │ 15 with fake_mode: │ │ ❱ 16 │ fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2") # raises │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:484 in │ │ from_pretrained │ │ │ │ 481 │ │ │ ) │ │ 482 │ │ elif type(config) in cls._model_mapping.keys(): │ │ 483 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │ │ ❱ 484 │ │ │ return model_class.from_pretrained( │ │ 485 │ │ │ │ pretrained_model_name_or_path, model_args, config=config, *hub_kwargs, │ │ 486 │ │ │ ) │ │ 487 │ │ raise ValueError( │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:2604 in │ │ from_pretrained │ │ │ │ 2601 │ │ if from_pt: │ │ 2602 │ │ │ if not is_sharded and state_dict is None: │ │ 2603 │ │ │ │ # Time to load the checkpoint │ │ ❱ 2604 │ │ │ │ state_dict = load_state_dict(resolved_archive_file) │ │ 2605 │ │ │ │ │ 2606 │ │ │ # set dtype to instantiate the model under: │ │ 2607 │ │ │ # 1. If torch_dtype is not None, we use that dtype │ │ │ │ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:479 in │ │ load_state_dict │ │ │ │ 476 │ │ │ │ │ │ "model. Make sure you have saved the model properly." │ │ 477 │ │ │ │ │ ) from e │ │ 478 │ │ except (UnicodeDecodeError, ValueError): │ │ ❱ 479 │ │ │ raise OSError( │ │ 480 │ │ │ │ f"Unable to load weights from pytorch checkpoint file for '{checkpoint_f │ │ 481 │ │ │ │ f"at '{checkpoint_file}'. " │ │ 482 │ │ │ │ "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please s │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OSError: Unable to load weights from pytorch checkpoint file for '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin' at '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. ``` Repro scenario 2: ```python import tempfile import torch from torch._subclasses import fake_tensor class TheModelClass(torch.nn.Module): def __init__(self): super(TheModelClass, self).__init__() self.fc1 = torch.nn.Linear(5, 10) def forward(self, x): return self.fc1(x) with tempfile.NamedTemporaryFile() as state_dict_file: # Create state_dict to be loaded later model = TheModelClass() torch.save(model.state_dict(), state_dict_file.name) fake_mode = fake_tensor.FakeTensorMode() with fake_mode: # This is where the bug is triggered state_dict = torch.load(state_dict_file.name) ``` Error: ```bash Traceback (most recent call last): File "issue_gh_torch_105077.py", line 22, in <module> state_dict = torch.load(state_dict_file.name) File "/opt/pytorch/torch/serialization.py", line 1014, in load return _load(opened_zipfile, File "/opt/pytorch/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor return t.set_(storage._untyped_storage, storage_offset, size, stride) File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper return fn(args, **kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants _, new_kwargs = normalize_function( File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function torch_op_schemas = get_signature_for_torch_op(target) File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp> signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature arg_type = _torchscript_type_to_python_type(arg.type) File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type return eval(ts_type.annotation_str, _type_eval_globals) File "<string>", line 1, in <module> NameError: name 'Storage' is not defined ``` This PR adds the ability to create fake tensors during torch.load (when fake mode is active) by changing the storage's device to 'meta'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186 Approved by: https://github.com/ezyang, https://github.com/atalman	2024-02-16 23:42:50 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	7ad4ab4765	Remove unused import (#120004 ) Summary: Title Test Plan: CI Differential Revision: D53820298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120004 Approved by: https://github.com/zhxchen17, https://github.com/Skylion007	2024-02-16 22:00:44 +00:00
Menglu Yu	7b1f5c874f	[PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745 ) Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed Test Plan: timeout flows: f528209775 f530084719 Differential Revision: D53692344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745 Approved by: https://github.com/jackiexu1992	2024-02-16 21:32:04 +00:00
IvanKobzarev	006eead7d2	[dynamo][functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683 ) Differential Revision: [D53758434](https://our.internmc.facebook.com/intern/diff/D53758434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119683 Approved by: https://github.com/ezyang	2024-02-16 21:28:39 +00:00
Yanbo Liang	4f4629d522	[Dynamo] Fix ListIteratorVariable repr to avoid log flooding (#120053 ) This issue was found from Meta internal use case. Before: ``` V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1) V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0] a = [sum(x) for x in result] V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 [] V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()] V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=0)] V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), LazyVariableTracker()] V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)] V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum)] V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum), ListVariable()] V0215 18:33:41.764000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), ConstantVariable(int: 50)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), LazyVariableTracker()] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)] V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum)] V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum), ListVariable()] V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), ConstantVariable(int: 68)] V0215 18:33:41.767000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)] ``` After: ``` V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1) V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0] a = [sum(x) for x in result] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 [] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=0)] V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), LazyVariableTracker()] V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=1)] V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum)] V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum), ListVariable()] V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=1), ConstantVariable(int: 55)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=1)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=1)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), LazyVariableTracker()] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=2)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum)] V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum), ListVariable()] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=2), ConstantVariable(int: 64)] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=2)] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=2)] V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), LazyVariableTracker()] V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=3)] V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum)] V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum), ListVariable()] V0215 18:27:57.907000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=3), ConstantVariable(int: 56)] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120053 Approved by: https://github.com/williamwen42	2024-02-16 21:19:37 +00:00
Brian Hirsh	26343451be	DTensor: make tensor_flatten more compatible for dynamo getattr (#118209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118209 Approved by: https://github.com/ezyang, https://github.com/wanchaol ghstack dependencies: #117667, #117666	2024-02-16 21:16:07 +00:00
Brian Hirsh	ee7bcf23db	dynamo: support attribute access on tensor subclasses without sources (#117666 ) Fixes https://github.com/pytorch/pytorch/issues/117596 This was needed for Float8Tensor. Before this PR, dynamo would sometimes handle attribute access on tensor subclasses correctly, but it would choke on tensor subclasses with no source (it would fall back to using a `GetAttrVariable` to represent the attribute access, which is a problem if the attribute is a tensor that we later want to call tensor methods on). I supported two cases: (1) the attribute is a tensor, which is part of the `attrs` returned by the subclass's `__tensor_flatten__`. This creates a `TensorVariable` (2) the attribute is a constant, which is part of the constant metadata returned by `__tensor_flatten__`. As per the contract of tensor_flatten, this should be a `ConstantVariable`. It could be possible that we allow non-constant metadata in the future, but we don't support that today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117666 Approved by: https://github.com/zou3519 ghstack dependencies: #117667	2024-02-16 21:16:07 +00:00
Brian Hirsh	67f6aca0d0	dynamo: respect autograd.Function + multiple save_for_backward calls (#117667 ) Fixes https://github.com/pytorch/pytorch/issues/117652. Corner case that I hit debugging some Float8 issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117667 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-02-16 21:16:07 +00:00
Yifu Wang	4ac857f94e	Support broadcast in native funcol (#119229 ) ### Summary @LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol. - Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_` - Integrated with python functol broadcast and `AsyncCollectiveTensor` - Implemented Inductor lowering. Verified correctness and buffer reuse behavior - Validated dynamo traceability - Validated AOTInductor compile-ability Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229 Approved by: https://github.com/wanchaol ghstack dependencies: #119104	2024-02-16 21:01:34 +00:00
Nikita Shulga	24d5caba6e	[EZ] Fix argument parsing in build_with_debinfo (#120088 ) `nargs="?"` accept 0 or 1 argument, but `nargs="*"` accepts 0 or any number of arguments, which is the intended behavior of the tool Test plan: Run `python tools/build_with_debinfo.py aten/src/ATen/native/cpu/BlasKernel.cpp aten/src/ATen/native/BlasKernel.cpp` and observe that it generates torch_cpu with those two files containing debug information Pull Request resolved: https://github.com/pytorch/pytorch/pull/120088 Approved by: https://github.com/Skylion007	2024-02-16 20:06:52 +00:00
Nikita Shulga	2d4aa91a10	Fix searchsorted function signature in docs (#120086 ) Side should be optional string, to match definition in native_functions: `fbe8e0f92d/aten/src/ATen/native/native_functions.yaml (L11246)` Fixes https://github.com/pytorch/pytorch/issues/119999 Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/120086/generated/torch.searchsorted.html#torch-searchsorted Pull Request resolved: https://github.com/pytorch/pytorch/pull/120086 Approved by: https://github.com/lezcano	2024-02-16 20:00:04 +00:00
wz337	288d1f3698	[Optim][Rprop] Replace new().resize_as_() by torch.full_like() (#119978 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119978 Approved by: https://github.com/janeyx99	2024-02-16 19:54:04 +00:00
andrewor14	6ea4480818	[quant][pt2e] Add `model_is_exported` util function (#119726 ) Summary: This commit adds the `model_is_exported` util function for users to be able to easily tell what APIs to call to move their models between train and eval modes. This has the additional advantage of hiding the implementation of how we detect a model is exported, in case the metadata format changes in the future. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726 Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD	2024-02-16 19:29:36 +00:00
soulitzer	312ce35c1f	Rename singleton int to nested int (#119661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661 Approved by: https://github.com/ezyang	2024-02-16 19:21:17 +00:00
lezcano	b97fa6ac30	Make roll a decomposition and remove its lowering (#119857 ) We use the fact that we now propagate indexing properly to avoid having to maintain two different implementations of the op. Doing this we also remove a spurious guard on this op. We move the ref into a decomp as we now use advanced indexing. The only difference we did in the implementation is that we now use advanced indexing rather than `torch.cat`. We also remove it from core. Let's see how this goes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857 Approved by: https://github.com/peterbell10, https://github.com/larryliu0820 ghstack dependencies: #119863, #119864	2024-02-16 19:14:39 +00:00
lezcano	8b02d64197	Correct index propagation for % (#119864 ) The current index propagation transformed % into `fmod`. This was incorrect. We perform the index propagation in the most common case, when we it is correct to do it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119864 Approved by: https://github.com/peterbell10 ghstack dependencies: #119863	2024-02-16 19:14:39 +00:00
lezcano	00524970e8	Simplify indexing when doing ModularIndexing + index propagation. (#119863 ) We now avoid creating an unnecessary ternary operator in some reasonably common case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119863 Approved by: https://github.com/peterbell10	2024-02-16 19:14:39 +00:00
PyTorch MergeBot	86dedebeaf	Revert "Add pixel_shuffle to core aten decomps (#119899 )" This reverts commit 9201d7335a25d9a91e10c1914c399419af0bd7c3. Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))	2024-02-16 17:44:59 +00:00
Oleg Khabinov	b10ae9e54c	[pytree] Properly register immutable collections (#120036 ) Summary: Getting error like: ``` No registered serialization name for <class 'torch.fx.immutable_collections.immutable_dict'> found. Please update your _register_pytree_node call with a `serialized_type_name` kwarg. ``` Reviewed By: suo Differential Revision: D53833323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120036 Approved by: https://github.com/SherlockNoMad	2024-02-16 17:39:12 +00:00
Dan Johnson	124c251510	Guarantee init cuda before attaching hooks (#120052 ) Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op. Test Plan: Testing this on fsdp+tp example model where cuda is not initialized before init_process_group. Job without the fix keeps dynamically registering: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION The following keeps looping: [0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1 [0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true [0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1 [0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000 Job with the fix does not have this issue: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION Reviewed By: minsii, kwen2501, xw285cornell Differential Revision: D53770989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052 Approved by: https://github.com/kwen2501	2024-02-16 17:36:53 +00:00
Edward Z. Yang	fbe8e0f92d	Fix missing right square bracket to match glog format (#119966 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119966 Approved by: https://github.com/oulgen ghstack dependencies: #119869	2024-02-16 15:14:00 +00:00
Andrew M. James	9726d7ca8e	Add lowering for logcumsumexp (#118753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753 Approved by: https://github.com/peterbell10 ghstack dependencies: #119809	2024-02-16 14:04:38 +00:00
Wilson Hong	3f4dd9bfa4	Back out "[pytree] Require serialized_type_name" (#120041 ) Summary: D53785493 breaks apf.rec.ir.tests.ir_export_deserialize_test.IRExportDeserializeTest: test_export_deserialize_ebc failed: https://www.internalfb.com/sandcastle/workflow/3436246515685789584 Test Plan: buck2 test mode/opt apf/rec/ir/tests:ir_export_deserialize_test Differential Revision: D53834881 Co-authored-by: Wilson Hong <wilsonhong@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120041 Approved by: https://github.com/ydwu4	2024-02-16 10:02:25 +00:00
Andrew M. James	4625ecb858	Add decomp for linalg.cross (#119809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119809 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-02-16 09:58:38 +00:00
laith sakka	3693d8f467	Do to convert UnsupportedFakeTensorException into RuntimeError in runNode for proper graph breaking. (#120026 ) Fix: https://github.com/pytorch/pytorch/issues/119779 by properly graph breaking a proper fix is to handle quantized tensors for full complete solution. if when generating a fake tensor, UnsupportedFakeTensorException is thrown, then its handled and converted into a Unimplemented in inside wrap_fake_exception which is then translated to a graph break. However run_node used to convert UnsupportedFakeTensorException into a runtime error, creating runtime errors instead of graph breaks whenever generating a fake tensor for a quantized tensor fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120026 Approved by: https://github.com/jansel	2024-02-16 09:21:58 +00:00
Chien-Chin Huang	54025c01a7	[DCP][state_dict] Let distributed_state_dict filter out the compiler prefix (#119830 ) Let distributed_state_dict filter out the compiler prefix Differential Revision: [D53681864](https://our.internmc.facebook.com/intern/diff/D53681864/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119830 Approved by: https://github.com/wz337	2024-02-16 08:59:58 +00:00
Yang Chen	bc7f3efb09	[aot_inductor] move CppWrapperCodeGen into a separate file (#119871 ) This reverts commit d8e319a961bb872027f0abdc413d6beb7502ac9b. Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871 Approved by: https://github.com/albanD, https://github.com/khabinov ghstack dependencies: #119870	2024-02-16 08:14:20 +00:00
Yang Chen	78c9b2948a	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119870 ) This reverts commit 3ab08946d5052eaeda11d683d6a58e801a032755. Differential Revision: [D53817852](https://our.internmc.facebook.com/intern/diff/D53817852) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119870 Approved by: https://github.com/khabinov	2024-02-16 08:10:51 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Mu-Chu Lee	b8be8b639f	Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823 ) Summary: 1. Make sure folded constants generated internally doesn't get exposed. 2. Add runConstantFolding and related API calls Test Plan: ```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=PyTorchPredictorContainerTest.LoadAOTInductorModel ``` The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`, which would trigger runConstantFolding from predictor's module loading. Reviewed By: SherlockNoMad Differential Revision: D53718139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823 Approved by: https://github.com/chenyang78	2024-02-16 06:45:48 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
Yifu Wang	02fb043522	Change native funcol inductor tests to use fake pg (#119104 ) Summary: Previously these tests require more than 2 GPUs to run. Changing them to use fake pg so they can run more often. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119104 Approved by: https://github.com/wconstab ghstack dependencies: #119103	2024-02-16 05:18:45 +00:00
Oguz Ulgen	62e5840b36	[Dynamo] Do not create TorchInGraphFunctionVariable for tags (#120005 ) Fixes https://github.com/pytorch/pytorch/issues/119793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120005 Approved by: https://github.com/yanboliang	2024-02-16 03:37:32 +00:00
PyTorch UpdateBot	ddde1e4dee	[executorch hash update] update the pinned executorch hash (#119943 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119943 Approved by: https://github.com/pytorchbot	2024-02-16 03:36:56 +00:00
Nikita Shulga	4eefe7285a	Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012 ) Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en) As results, very slow and naive [`torch.mm`](`edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)`) runs 3x faster: 85 msec before to 27 msec (measured by running `e41341df2d/benchmarks/benchmark_torch_mm.py` ) This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit "Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)` (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` ) But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012 Approved by: https://github.com/huydhn	2024-02-16 03:04:06 +00:00
Sam Larsen	3e5e8590f4	Account for inference mode in FakeTensor cache (#119963 ) Summary: an fbcode test exposed a shortcoming where we serve a FakeTensor from the cache with the wrong inference_mode. Take the current mode into account in the cache key so we only serve entries from the same mode we're in currently Test Plan: New unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/119963 Approved by: https://github.com/eellison	2024-02-16 02:53:33 +00:00
chilli	8bfc87ce74	fixed flop counter formula for conv transposed backwards pass (#119874 ) Fixes #119806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119874 Approved by: https://github.com/zou3519 ghstack dependencies: #119521	2024-02-16 02:43:49 +00:00
Wei (Will) Feng	17c345ebd9	[FSDP] compile compute and CI with @test_compiled_fsdp (#119933 ) goal: all unit tests for eager. we want to test torch.compile by default this PR adds ``@test_compiled_fsdp(compile_compute_on_module=None/TransformerBlock)`` to unit tests. now it's compiling compute-only as follows. ``` module.compile() # include user registered hooks if any fully_shard(module) ``` torch.compile does not work following component yet * compiling AC * compiling reshard_after_forward=2 * delayed_all_gather, delayed_reduce_scatter Pull Request resolved: https://github.com/pytorch/pytorch/pull/119933 Approved by: https://github.com/awgu, https://github.com/jansel	2024-02-16 01:48:51 +00:00
Omkar Salpekar	c802c50196	Setup Nvidia Runtime before Indexer (#119923 ) Sets up Nvidia Runtime and runs indexer inside a docker container. Verified this works by running the indexer jobs (all the setup is correct, it OOMs for an unrelated reason, for which a fix is on the way). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119923 Approved by: https://github.com/huydhn	2024-02-16 00:33:18 +00:00
Jane Xu	4319735ace	Add meta registration for _foreach_norm (2nd try) (#119927 ) The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927 Approved by: https://github.com/albanD	2024-02-16 00:23:23 +00:00
wz337	707cde9b31	[DTensor][Test] Improve math_ops test (#118956 ) The DTensor fully_shard_tensor was created but not used in shard_math_ops test previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118956 Approved by: https://github.com/wanchaol	2024-02-15 23:59:25 +00:00
cyy	94f19fe545	[3/N] Replace std::tie with structural binding (#119962 ) This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie Pull Request resolved: https://github.com/pytorch/pytorch/pull/119962 Approved by: https://github.com/albanD	2024-02-15 23:48:28 +00:00
Yanbo Liang	2a63dd8889	[Dynamo] Support lazy module with namedtuple/dict input (#119972 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119972 Approved by: https://github.com/jansel	2024-02-15 23:18:18 +00:00
Michael Lazos	f9f602fcb8	Clean up decorators (#119925 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/119925 Approved by: https://github.com/eellison	2024-02-15 22:51:53 +00:00
lancerts	444c628e06	Include the scalar tensor auto-transfer in the doc (#119967 ) Fixes #119609 @albanD Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967 Approved by: https://github.com/albanD	2024-02-15 22:37:39 +00:00
PyTorch MergeBot	47300221c2	Revert "[export] Change runtime asserts to using assert_scalar (#119608 )" This reverts commit f4d641ba2fb11fca2ba47f0c425d8a4a1adbffb6. Reverted https://github.com/pytorch/pytorch/pull/119608 on behalf of https://github.com/huydhn due to This break ONNX trunk job `65fd8b6730` ([comment](https://github.com/pytorch/pytorch/pull/119608#issuecomment-1947436402))	2024-02-15 22:25:24 +00:00
Jack Taylor	da1df5d7b8	[ROCm] Update triton wheels to ROCm 6.0 (#119765 ) Upgrades nightly triton issues to ROCM 6.0 and adds bitcodes for gfx941 and gfx942. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119765 Approved by: https://github.com/jeffdaily, https://github.com/huydhn	2024-02-15 21:57:51 +00:00
Shunting Zhang	3f4f91f2eb	[inductor][eazy] fix profiler (#119959 ) print_performance previously returns the execution time for `times` runs in total but now it returns the average execution time of a single run. Change the profiler to be consistent with that. Not sure if there is a good way to add test though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119959 Approved by: https://github.com/eellison	2024-02-15 21:47:09 +00:00
PyTorch MergeBot	65fd8b6730	Revert "[export] Disable exported_program.__call__ (#119466 )" This reverts commit c26884f06345bf61e0843d13db84e76236ff6142. Reverted https://github.com/pytorch/pytorch/pull/119466 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119466#issuecomment-1947384298))	2024-02-15 21:42:32 +00:00
drisspg	744898b311	Add doc page for environment variables that effect PyTorch Runtime (#119087 ) # Summary The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087 Approved by: https://github.com/janeyx99, https://github.com/eqy	2024-02-15 21:41:38 +00:00
laith sakka	d707e3c9c6	Fix handling none source in build_torch_function_fn (#119724 ) Fix https://github.com/pytorch/pytorch/issues/119580 When a UserDefinedObjectVariable is created it does not always have a source, i.e: when its an intermediate This diff fix two handling of none source in two locations during an inlining of a user torch function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119724 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305	2024-02-15 21:21:47 +00:00
eliasstenhede	9548860b37	Fix typo in istft docstring (#119776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119776 Approved by: https://github.com/colesbury	2024-02-15 21:20:00 +00:00
Kazuaki Ishizaki	a2f07bb317	Fix typo under docs directory (#119657 ) This PR fixes typo under `docs` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657 Approved by: https://github.com/colesbury	2024-02-15 21:14:34 +00:00
Valter Schütz	2d7a395c0f	Fix typo in functional.py (#119775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119775 Approved by: https://github.com/colesbury	2024-02-15 21:14:29 +00:00
Yanbo Liang	c3b4d78e17	[Dynamo][Easy] Fix a small bug in test_trace_rules.py (#119973 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119973 Approved by: https://github.com/zou3519	2024-02-15 20:44:32 +00:00
angelayi	b4c7afe101	[pytree] Require serialized_type_name (#119718 ) Differential Revision: [D53785493](https://our.internmc.facebook.com/intern/diff/D53785493) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119718 Approved by: https://github.com/suo	2024-02-15 20:32:44 +00:00
otdossett	f32560c939	Remove Redundant Bullet Point (#120007 ) Fast path explanation for scaled_dot_product_attention in nn.MultiHeadAttention mentioned inputs being batched with batch_first = True twice. Removed the second mention of this requirement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120007 Approved by: https://github.com/mikaylagawarecki	2024-02-15 19:47:35 +00:00
lancerts	605de946cf	Clarify the patience in ReduceLROnPlateau (#119872 ) Fixes #119763 @janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119872 Approved by: https://github.com/janeyx99	2024-02-15 19:43:06 +00:00
Huy Do	26b6de43e5	Revert "Use ARMV8.2 scalar fp16<->fp32 conversion (#119895 )" (#120001 ) This reverts commit d833e2f2364a01c6fdab689a8bb5bbf55a5b60f7. This is failing some RL builds internally using clang 13 D53791577 https://github.com/pytorch/pytorch/pull/119895#issuecomment-1946859332. The bot doesn't like a commit being merged into the stack base and fails to revert the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120001 Approved by: https://github.com/malfet	2024-02-15 19:41:51 +00:00
Aaron Orenstein	9b6fae2d79	Tweak to pr#119719 - eager & fullgraph (#119921 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119921 Approved by: https://github.com/oulgen	2024-02-15 19:31:56 +00:00
Wei Lu	01ee85c8ab	[PyTorch][Vulkan]remove redundant test of log_softmax (#119964 ) Summary: `vulkan_api_test.cpp` already has [a test for `log_softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4521), so we remove the redundant `DISABLED_log_softmax`. According to the comment the test was disabled because "the op is not working correctly. Add it back when it is fixed." Actually it's a simple typo mistake: the [CPU output should use `at::log_softmax` instead of `at::softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4548). Since we already have a test for `log_softmax`, the fix isn't necessary and we remove this disabled test. Test Plan: Full vulkan_api_test P1184744699: ``` LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin ... [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 427 tests from VulkanAPITest (23633 ms total) [----------] Global test environment tear-down [==========] 427 tests from 1 test suite ran. (23634 ms total) [ PASSED ] 426 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` Reviewed By: jorgep31415 Differential Revision: D53766200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119964 Approved by: https://github.com/jorgep31415	2024-02-15 19:16:56 +00:00
Xiaodong Wang	8835ff1b09	[AMD] Update hipify code to oss (#119958 ) Summary: Syncing the hipify code to third party. Trunk was broken by multiple diffs D53716382 D53744795 Test Plan: sandcastle Differential Revision: D53790854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119958 Approved by: https://github.com/jianyuh, https://github.com/drisspg	2024-02-15 19:14:34 +00:00
lancerts	143b5f2745	Fix the missing device in _memory_profiler (#119751 ) Fixes #119722, 1, added the missing device in ``` max_memory_allocated = torch.cuda.max_memory_allocated() max_memory_reserved = torch.cuda.max_memory_reserved() ``` 2, fix the device parameter to device_str. Based on [lines](`2bda6b4cb8/torch/profiler/profiler.py (L291)`), the input device are a string (device_str) for ``` self.mem_tl.export_memory_timeline_html self.mem_tl.export_memory_timeline_raw self.mem_tl.export_memory_timeline ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119751 Approved by: https://github.com/aaronenyeshi	2024-02-15 19:11:15 +00:00
Peter Bell	98fd23cccc	[EASY] Move OpsHandler and MockHandler to their own file (#119851 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119851 Approved by: https://github.com/lezcano ghstack dependencies: #119728	2024-02-15 18:54:41 +00:00
Peter Bell	6f324e8776	[ATen] Tag isinf as a pointwise op (#119728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119728 Approved by: https://github.com/lezcano	2024-02-15 18:54:41 +00:00
eqy	e386bfa688	[CUDA][cuSPARSE] Work around IMA in cuSPARSE ALG1 on SM 8.9 devices (#119610 ) Originally surfaced from the discuss forum: https://discuss.pytorch.org/t/issue-with-torch-sparse-mm-while-running-on-gpu/188669 This has been forwarded to cuSPARSE but we have not yet received a commitment on their end to fix this issue directly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/119610 Approved by: https://github.com/jeffdaily, https://github.com/jcaip	2024-02-15 18:28:45 +00:00
Andrew Gu	2429495820	[FSDP2][ez] Made typing more strict to avoid `cast` (#119985 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119985 Approved by: https://github.com/Skylion007, https://github.com/fegin ghstack dependencies: #118298	2024-02-15 18:20:35 +00:00
Zhengxu Chen	840426e793	[export] Log export time. (#119960 ) Summary: as title. we are logging the time to complete one export session. Test Plan: CI Differential Revision: D53737766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119960 Approved by: https://github.com/angelayi	2024-02-15 18:04:15 +00:00
PyTorch MergeBot	9b38ee2343	Revert "Alternate sharding (#119078 )" This reverts commit 861acda20577739d52dd0bcf09e162192f25020f. Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing `861acda205` ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))	2024-02-15 16:59:50 +00:00
Anthony Alayo	a83a1bc43b	Adding c10 device type to newly added DeviceAccelerator (#119961 ) Follow up to https://github.com/pytorch/pytorch/pull/104364, A new file got submitted yesterday that is using DeviceType without the c10 namespace. This fixes that. I haven't yet figured out a way to setup a test for this, but I will submit a follow up PR once I figure that out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119961 Approved by: https://github.com/ezyang	2024-02-15 14:56:05 +00:00
Ting Lu	e5bfdde7ba	Fix the skip condition for test_c10d tests (#119938 ) Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU. ``` RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` referring to https://github.com/pytorch/pytorch/pull/84980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119938 Approved by: https://github.com/eqy, https://github.com/fegin	2024-02-15 11:03:39 +00:00
Angela Yi	c26884f063	[export] Disable exported_program.__call__ (#119466 ) Summary: `ExportedProgram` is an artifact produced by torch.export, containing the graph that is exported, along with other attributes about the original program such as the graph signature, state dict, and constants. One slightly confusing thing that users run into is that they treat the `ExportedProgram` as a `torch.nn.Module`, since the object is callable. However, as we do not plan to support all features that `torch.nn.Module`s have, like hooks, we want to create a distinction between it and the `ExportedProgram` by removing the `__call__` method. Instead users can create a proper `torch.nn.Module` through `exported_program.module()` and use that as a callable. Test Plan: CI Differential Revision: D53075378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119466 Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi	2024-02-15 08:49:34 +00:00
angelayi	f4d641ba2f	[export] Change runtime asserts to using assert_scalar (#119608 ) By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors. https://github.com/pytorch/pytorch/issues/119587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608 Approved by: https://github.com/ezyang	2024-02-15 07:13:42 +00:00
Wang, Xiao	c83af673bc	Allow CUDA extension builds to skip generating cuda dependencies during compile time (#119936 ) nvcc flag `--generate-dependencies-with-compile` doesn't seem to be supported by `sccache` for now. Builds with this flag enabled will not benefit from sccache. This PR adds an environment variable that allows users to set this flag and skip those nvcc dependencies to speed up their build with compiler caches. If everything is "fresh build" in CI, we don't care if there are unnecessary recompile during incremental builds. related: https://github.com/pytorch/pytorch/pull/49344 - [ ] todo: raise an issue to sccache Pull Request resolved: https://github.com/pytorch/pytorch/pull/119936 Approved by: https://github.com/ezyang	2024-02-15 07:03:59 +00:00
cyy	d4882e438a	[DeviceIndex][5/N] Use DeviceIndex in more places (#119866 ) This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866 Approved by: https://github.com/Skylion007	2024-02-15 07:01:43 +00:00
cyy	68328ad394	Check existence of caffe2::mkl target (#119945 ) Fixes #118862 If libtorch is included multiply times in different sub-folders, linking caffe2::mkl may incur errors like ``` Cannot specify link libraries for target "caffe2::mkl" which is not built by this project. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119945 Approved by: https://github.com/ezyang	2024-02-15 06:28:17 +00:00
Omkar Salpekar	0898ead2d5	Timestamp Embedding Indices Generated for TD (#119955 ) Timestamps the generated embedding indices. Moves the old indices to an `archived/` folder and then uploads the index to a `latest/` folder. There will be a short period in between these operations where there is no index in `latest/`. To handle this case, any workflow fetching the index (such as the retriever) should use a retry with backoff when copying from S3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119955 Approved by: https://github.com/huydhn	2024-02-15 04:48:40 +00:00
Wei Lu	af346df6a0	[PyTorch][Vulkan]fix the issue of log 0 after softmax (#119898 ) Summary: In some cases the output of `softmax` are so small that they are below the float16 precision. These values are represented as 0 in float16 and result in `-inf` when log is applied. According to [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format#Exponent_encoding), the minimum strictly positive (subnormal) value is 2^−24 ≈ 5.9605 × 10^−8. Therefore, we add 6 x 10^-8 to the output of softmax to avoid the numerical issue. Test Plan: We add two tests: - `log_softmax_underflow_exception` tests the log_softmax without adding epsilon to the output of softmax, so we expect to get nan or -inf. (NOTE: this test has passed on both devserver and on Android device, but failed on the ` fbsource//xplat/caffe2:vulkan_ops_testAndroid` test on CI. In this test, `log` of small numbers [even `log 0` shows output -88 instead of `-inf`](https://interncache-cco.fbcdn.net/v/t49.3276-7/379414752_342395058779076_6447867753374424757_n.txt?ccb=1-7&_nc_sid=ce8ad4&efg=eyJ1cmxnZW4iOiJwaHBfdXJsZ2VuX2NsaWVudC9pbnRlcm4vc2l0ZS94L3Rlc3RpbmZyYSJ9&_nc_ht=interncache-cco&oh=00_AfApTdId1WOHUqdoSTc66s6adnrQt1YS0NDT-LDppIvX0g&oe=65D0CC99). We cannot reproduce this error on device now, so we DISABLE this test for now to integrate into CI.) - `log_softmax_underflow` tests the updated implementation of log_softmax, nan and -inf have been removed ## test on devserver ``` luwei@devbig984.prn1 /data/users/luwei/fbsource (9f6b78894)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="log_softmax_underflow" File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp Buck UI: https://www.internalfb.com/buck2/baaaa683-60da-4dd8-95b9-6848fe1d7d74 Network: Up: 53KiB Down: 1.4MiB (reSessionID-9580ce4f-7e1e-4c65-8497-52443329b796) Jobs completed: 6. Time elapsed: 24.2s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = log_softmax_underflow [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception [ RUN ] VulkanAPITest.log_softmax_underflow [ OK ] VulkanAPITest.log_softmax_underflow (169 ms) [----------] 1 test from VulkanAPITest (169 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (169 ms total) [ PASSED ] 1 test. YOU HAVE 1 DISABLED TEST ``` full test results: P1184164670 ``` [----------] 428 tests from VulkanAPITest (21974 ms total) [----------] Global test environment tear-down [==========] 428 tests from 1 test suite ran. (21974 ms total) [ PASSED ] 427 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 11 DISABLED TESTS ``` ## test on device: - build ``` [luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid --show-output ``` - push to device and run ``` [luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid --gtest_filter="log_softmax_underflow" Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = log_softmax_underflow [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception [ RUN ] VulkanAPITest.log_softmax_underflow [ OK ] VulkanAPITest.log_softmax_underflow (292 ms) [----------] 1 test from VulkanAPITest (293 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (294 ms total) [ PASSED ] 1 test. YOU HAVE 1 DISABLED TEST ``` Reviewed By: yipjustin Differential Revision: D53694989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119898 Approved by: https://github.com/jorgep31415	2024-02-15 03:59:44 +00:00
Yifu Wang	cd08dc37f8	Support tracing native functional collective via python APIs (#119103 ) Summary: - Inlined `torch.distributed.distributed_c10d._get_group_size_by_name` - Updated all torch.compile tests in test_c10d_functional_native.py to use funcol python APIs (as opposed to the dispatcher ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119103 Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/wanchaol	2024-02-15 03:33:49 +00:00
cyy	5f9b432494	[2/N] Replace std::tie with structural binding (#119879 ) This PR follows #119774, Python generated code was changed to use structural binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119879 Approved by: https://github.com/albanD	2024-02-15 02:56:34 +00:00
Oguz Ulgen	9ff9798716	Fix a bug in kernel analysis with ttir defined args (#119934 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119934 Approved by: https://github.com/aakhundov	2024-02-15 02:49:11 +00:00
Yanbo Liang	7f5b87c953	[torch.compile] Log more compilation time breakdown (#119865 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119865 Approved by: https://github.com/ezyang	2024-02-15 02:20:07 +00:00
Nikita Shulga	516f38a144	[RelEng] Define `BUILD_BUNDLE_PTXAS` (#119750 ) That would bundle PTXAS into a `bin` folder When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas Needs `5c814e2527` to produce valid binary builds Test plan: - Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module. - Run following script (to be added to binary tests ) against CUDA-11.8 wheel: ```python import torch import triton @torch.compile def foo(x: torch.Tensor) -> torch.Tensor: return torch.sin(x) + torch.cos(x) x=torch.rand(3, 3, device="cuda") print(foo(x)) # And check that CUDA versions match cuda_version = torch.version.cuda ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii") assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}" ``` Fixes https://github.com/pytorch/pytorch/issues/119054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119750 Approved by: https://github.com/jansel, https://github.com/atalman	2024-02-15 02:08:57 +00:00
Orvid King	a07fd51b6b	[caffe2] Add an avx512 implementation of adagrad_update (#113289 ) Summary: As per title Test Plan: contbuilds Differential Revision: D50947444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113289 Approved by: https://github.com/ezyang	2024-02-15 01:45:30 +00:00
Catherine Lee	861acda205	Alternate sharding (#119078 ) Changes sharding to attempt to put all serial tests on as few shards as possible. Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines -> 8 + 20/2 = 18 total minutes of tests -> 18 / 6 machines = 3 min per machine -> all serial tests should fit on 3 machines (3min, 3 min, 2min) -> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests Move serial tests to run first If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective. See `73e816ee80` for example logs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078 Approved by: https://github.com/huydhn	2024-02-15 01:32:44 +00:00
Tugsbayasgalan Manlaibaatar	b4252d73b1	Make pattern matcher more robust (#119876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119876 Approved by: https://github.com/cccclai	2024-02-15 00:48:16 +00:00
Wanchao Liang	daf1050ae5	[dtensor] refactor sharding cost model to count for latency (#119897 ) This PR refactors the shardeing cost model, to do a more accurate estimation of redistribute cost, including both collective latency and communciation time. The previous cost model does not recale the latency and communciation time, therefore the latency factor is too small to be counted, and in the case of small tensors, multiple collectives is preferred than a single collective, which is wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119897 Approved by: https://github.com/tianyu-l	2024-02-15 00:35:56 +00:00
Alexander Grund	99cb807e25	Skip test_wrap_bad if run under pytest (#115070 ) Pytest replaces sys.stdout/stderr by `TextIOWrapper` instances which do not support `fileno()` Hence skip that test in this case Fixes #115069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115070 Approved by: https://github.com/clee2000	2024-02-15 00:10:05 +00:00
Nikita Shulga	d833e2f236	Use ARMV8.2 scalar fp16<->fp32 conversion (#119895 ) Thanks to discussion with @mikekgfb I've realized that SVE is the feature availble by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--) As results, very slow and naive [`torch.mm`](`edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)`) runs 3x faster: 85 msec before to 27 msec (measured by running `e41341df2d/benchmarks/benchmark_torch_mm.py` ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895 Approved by: https://github.com/mikekgfb ghstack dependencies: #119892	2024-02-14 23:42:53 +00:00
Andrew Gu	096ebcca73	[FSDP2] Added gradient accumulation w/o reduction (#118298 ) This PR adds a way to do gradient accumulation without collectives (i.e. reduce-scatter for FSDP and reduce-scatter/all-reduce for HSDP, though HSDP is not yet implemented). Since the `no_sync()` context manager has received some feedback, we simply define a method on the module to set whether the module requires gradient synchronization or not, where this method can recurse or not. ``` # Before with `no_sync()`: with fsdp_model.no_sync() if not is_last_microbatch else contextlib.nullcontext(): # Forward/backward # After with a setter: fsdp_model.set_requires_gradient_sync(not is_last_microbatch) # Forward/backward ``` Having the method be able to recurse or not also gives some flexibility. For example, some large modules can still reduce-scatter, while some smaller modules can avoid it to save communication bandwidth: ``` fsdp_modules_to_reduce_scatter: Set[nn.Module] = ... for module in fsdp_model.modules(): if isinstance(module, FSDP) and module not in fsdp_modules_to_reduce_scatter: module.set_requires_gradient_sync(not is_last_microbatch) # Forward/backward ``` (Separately, we may expose a helper for `return [module for model.modules() if isinstance(module, FSDP)]`.) --- To show the spirit of this API choice, I also included `set_requires_all_reduce` that would give us the ability to only reduce-scatter but not all-reduce for HSDP (originally from the MiCS paper). If we want to flexibly support heterogeneous sharding where FSDP is applied to some modules and HSDP to others in the same model, then having a module-level method that has the option to not recurse makes sense to me. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118298 Approved by: https://github.com/wconstab, https://github.com/wanchaol ghstack dependencies: #119550, #118136, #118223, #118755, #119825	2024-02-14 23:09:59 +00:00
Zhengxu Chen	8f27fde2f5	[export] Log private api uses. (#119848 ) Summary: as title. The following APIs are logged: - capture_preautograd_graph - torch._export.aot_compile - external usage of _export_to_torch_ir (AOTInductor, Pippy) - constraints API - public use of torch._dynamo.export Test Plan: CI Differential Revision: D53735599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119848 Approved by: https://github.com/suo	2024-02-14 22:58:23 +00:00
soulitzer	340b6fa972	Deduplicate docs between global and non-global full backward hooks (#119708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119708 Approved by: https://github.com/albanD ghstack dependencies: #114970	2024-02-14 22:53:44 +00:00
PyTorch MergeBot	3713103db4	Revert "[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450 )" This reverts commit 4e93b00b692118b8531f3807ec95eb4c538ea419. Reverted https://github.com/pytorch/pytorch/pull/119450 on behalf of https://github.com/soulitzer due to Regressed perf on the dashboard ([comment](https://github.com/pytorch/pytorch/pull/119450#issuecomment-1944876761))	2024-02-14 22:44:21 +00:00
Joel Schlosser	756cf2913d	Fix NJT stride access in SDPA dispatcher logic (#119846 ) `._stride` -> `._strides` Adds test to cover this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119846 Approved by: https://github.com/drisspg, https://github.com/ani300, https://github.com/soulitzer ghstack dependencies: #119910	2024-02-14 22:37:52 +00:00
Joel Schlosser	0560c193a6	Fix meta registration for _flash_attention_forward() [ROCm forward fix] (#119910 ) Addresses ROCm failures from #119812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119910 Approved by: https://github.com/drisspg	2024-02-14 22:37:52 +00:00
Nikita Shulga	734ae20f2e	[C10] Expand half unittest (#119892 ) So far it's been only testing legacy conversion, rather than the one actually used when `at::Half` is constructed Test `fp16` to `fp32` for the whole range of its 65536 values, though skip NaN comparisons, as different algorithms are not guaranteed to yield identical NaN representations and they are different anyway. Do a small code cleanup, remove extraneous semicolons as well as named namespace inside unnamed one Pull Request resolved: https://github.com/pytorch/pytorch/pull/119892 Approved by: https://github.com/kit1980	2024-02-14 22:32:43 +00:00
Lucas Pasqualin	3470ab42bb	[DCP] Automatically set `no_dist` if distributed is unavailable (#119813 ) [DCP] Automatically set `no_dist` if distributed is unavailable Differential Revision: [D53718043](https://our.internmc.facebook.com/intern/diff/D53718043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119813 Approved by: https://github.com/fegin, https://github.com/wz337	2024-02-14 22:25:07 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
Joel Schlosser	9ec8dd2467	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-14 22:00:43 +00:00
Animesh Jain	6b04251b87	[inductor][scheduler] Use set for origin (#119861 ) xref - https://github.com/pytorch/pytorch/issues/119440 This avoids node > node comparison if the origin order is same in the origins tuple. However, I am unable to come up with a test case where this could happen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119861 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-02-14 22:00:38 +00:00
Michael Lazos	29235c7063	Handle aliases correctly in foreach (#119508 ) Fixes https://github.com/pytorch/pytorch/issues/119436 <s>In essence we need to ensure aliases are run in separate foreach kernels so that they are ordered correctly. Previously, aliases could end up in the same kernel which creates weird scheduling dependencies.</s> There was a bug in cycle detection/can_fuse which was creating cycles when more than two aliases were used in foreach nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119508 Approved by: https://github.com/jansel	2024-02-14 21:21:28 +00:00
gs-olive	e0f6fa6a7c	Windows Dynamo Error Removal CI Check (#115969 ) Rebase of #111313 onto `main`, for CI validation Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969 Approved by: https://github.com/PaliC, https://github.com/thiagocrepaldi	2024-02-14 21:14:36 +00:00
Angela Yi	9201d7335a	Add pixel_shuffle to core aten decomps (#119899 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53766709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-02-14 21:01:11 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
wz337	bb67a28738	[DTensor] Enable Adamax foreach optimizer (#119850 ) Enable Adamax foreach optimizer and add DTensor unit test for Adamax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119850 Approved by: https://github.com/wanchaol	2024-02-14 20:43:00 +00:00
Aaron Orenstein	2aad3f93f8	Fix guards for field access through properties (#119719 ) When building guards which went through a property we were analyzing the property using getattr_static but the guard wasn't built using getattr_static so if the property was "unusual" it generated misbehaved code which referenced a non-existent `__closure__` field. Fixes #118786 Note that after this change some of the referenced tests are still failing with a different error - but getting further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119719 Approved by: https://github.com/oulgen	2024-02-14 20:42:55 +00:00
Andrew M. James	7797a8c2cb	[testing][inductor] Allow grad tolerance override (#119844 ) Introduce `grad_atol` and `grad_rtol` kwargs, default behavior is preserved by using `atol` and `rtol` values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119844 Approved by: https://github.com/peterbell10	2024-02-14 20:18:48 +00:00
Edward Z. Yang	15f1b9f1c4	Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412 ) This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways: * The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj * We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message. * We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #117356	2024-02-14 20:01:07 +00:00
Jeff Daily	0e6eee3c89	[ROCm] TunableOp (#114894 ) Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides. See the README.md for additional details. TunableOp was ported from onnxruntime starting from commit `08dce54266`. The content was significantly modified and reorganized for use within PyTorch. The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following: - onnxruntime/core/framework/tunable.h -> Tunable.h - onnxruntime/core/framework/tuning_context.h -> Tunable.h - onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp - onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h - onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h - onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h - onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h - onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp - onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h - onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894 Approved by: https://github.com/xw285cornell, https://github.com/jianyuh	2024-02-14 19:03:49 +00:00
Edward Z. Yang	90f785dc34	Change default TORCH_LOGS format to match Meta/glog standard (#119869 ) Before: ``` [2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS: [2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['x'], 70049616) # assert x.shape[0] > 2 # b.py:5 in f [2024-02-13 19:34:50,592] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False # assert x.shape[0] > 2 # b.py:5 in f ``` After this change, the logs look like this: ``` V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1023 [0/0] GUARDS: V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] ___check_type_id(L['x'], 70050096) # assert x.shape[0] > 2 # b.py:5 in f V0214 07:00:49.355000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] hasattr(L['x'], '_dynamo_dynamic_indices') == False # assert x.shape[0] > 2 # b.py:5 in f ``` The main differences from what we had before: * We don't print DEBUG/INFO/WARNING, instead, we only print a single character. DEBUG, somewhat oddly, maps to V, because it corresponds to glog VERBOSE * The year is omitted, and a more compact representation for date/month is adopted. Somewhat perplexingly, six digits are allocated for the nanoseconds, even though Python typically doesn't have that level of resolution * The thread ID is included (in a containerized environment, this thread id will be typically much lower) * Instead of using the module name, we give a filepath, as well as the line the log message was emitted from. I think the line number is a nice touch and improvement over our old logs, but one downside is we do lose the artifact name in the log message, in case anyone was grepping for that. * I chose to move the compile id prefix to the very end so as to keep a uniform layout before it, but I do think there are benefits to having it before the filename Meta only: This format was reverse engineered off of `6b8bbe3b53/supervisor/logging.py` and https://www.internalfb.com/code/fbsource/[e6728305a48540110f2bdba198aa74eee47290f9]/fbcode/tupperware/front_end/log_reader/filter/StreamingLogLineFilter.cpp?lines=105-114 Now, I think this may be slightly controversial, but I have chosen to apply this format by default in OSS. My reasoning is that many PT2 developers work with the logs in OSS, and keeping the format identical to what we run in prod will make it easier for these skills to transfer. The non-negotiable portion of the new format is "V0213 19:28:32"; the date string is expected to be in exactly this form or Tupperware will fail to parse it as a date. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119869 Approved by: https://github.com/oulgen, https://github.com/mlazos, https://github.com/Skylion007	2024-02-14 18:56:35 +00:00
Tianyu Liu	d999222fba	[dtensor] add op support for nll_loss_backward (#119256 ) As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256 Approved by: https://github.com/wanchaol	2024-02-14 18:50:42 +00:00
albanD	47182a8f4b	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-14 18:40:23 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	6cf48187c5	[export] Remove references to capture_pre_autograd_graph inside test_export (#119875 ) Summary: Title Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D53728889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119875 Approved by: https://github.com/angelayi	2024-02-14 17:59:10 +00:00
Angela Yi	ee3a7bdc2d	[export] Don't error if nn_module_stack doesn't contain a class (#119753 ) Summary: When we deserialize nn_module_stack, sometimes the module no longer exists in the python environment so we cannot deserialize it back into the python type and instead it's kept as a string. This causes downstream failures when retracing due to one of our checks in export. This diff just bypasses the check. Test Plan: CI Reviewed By: chakriu Differential Revision: D53527706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119753 Approved by: https://github.com/zhxchen17	2024-02-14 16:56:11 +00:00
Jack Taylor	3e21c785a4	[ROCm] Initial ir.Scan/aten.cumsum lowering support on ROCm (#119369 ) It was noted in https://github.com/pytorch/pytorch/pull/117992 that ROCm is still falling back to eager with scan's with inductor. Initially as part of https://github.com/pytorch/pytorch/pull/106581 ROCm was disabled on this feature due to lack of triton support. This PR will enable support for lowering scan operations on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119369 Approved by: https://github.com/peterbell10	2024-02-14 16:13:46 +00:00
Taras Tsugrii	fb492f7ca1	[inductor] Reorder if check to avoid more expensive check. (#119817 ) If `mkldnn` is not enabled or not available there is no point in performing a relatively expensive `all` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119817 Approved by: https://github.com/Skylion007	2024-02-14 16:04:31 +00:00
Taras Tsugrii	184605ae7d	[inductor] Replace generators with map. (#119818 ) It's more concise and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119818 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze	2024-02-14 16:02:52 +00:00
laith sakka	edd9ddf73f	Propagate allow_non_graph_fake between get_fake_values_from_nodes and get_fake_values (#119731 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119731 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #119314, #119435	2024-02-14 15:26:17 +00:00
cyy	87c6cd2f00	[1/N] Replace std::tie with structural binding (#119774 ) This PR replaces some std::tie calls with structural binding from C++17. This not only makes the code more compact, but also has some performance gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-14 09:25:04 +00:00
Shuqiang Zhang	a45c627f27	[c10d][flight recorder] store a copy of string in entry (#119837 ) Summary: Previously, we just store the char pointer in entry, the string is a temp object and will be destructed when we want to dump/access it. A quick fix is to store a copy of the string, but without changing the upstream char*. An alternative is to change every profilingTitle into std:string, this however would needs comprehensive overhall of the code up to the c10d::work layer above workNCCL and RecordFunction etc. We chose the first option for this change Resolve #119808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837 Approved by: https://github.com/zdevito, https://github.com/wconstab	2024-02-14 09:13:56 +00:00
Adnan Akhundov	4a50572c92	[inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867 ) Summary: When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here: `31e59766e7/torch/_inductor/ir.py (L3805-L3816)` This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion: `31e59766e7/torch/_inductor/ir.py (L3479)` Here we add a special case handling for this to unwrap `x` recursively. Test Plan: This local repro: ``` torch.compile() def f(a, b, mat1, mat2): bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1) return torch.addmm(bias, mat1, mat2) f( torch.randn(3992, 20, 40).cuda(), torch.randn(3992, 40, 192).cuda(), torch.empty(3992, 1024).cuda(), torch.empty(1024, 3840).cuda(), ) ``` with this line: `690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)` changed to `if cond(args, *kwargs):` fails before and succeeds after this PR. Differential Revision: D53743146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867 Approved by: https://github.com/xw285cornell	2024-02-14 07:50:34 +00:00
Michael Lazos	9f44274373	Add tests to verify disabled optimizers (#118919 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919 Approved by: https://github.com/janeyx99	2024-02-14 07:45:16 +00:00
Omkar Salpekar	ca55468416	Target Determinator Indexer Workflow (#118824 ) As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator), we are experimenting with using CodeLlama-powered information retrieval for target determination. The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests. This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3. Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-02-14 06:21:18 +00:00
PyTorch UpdateBot	caf9d9d7c1	[executorch hash update] update the pinned executorch hash (#119733 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119733 Approved by: https://github.com/pytorchbot	2024-02-14 06:15:25 +00:00
Yanbo Liang	54a30f6d4e	[Dynamo] Update trace_rules.py and re-enable skipped tests (#119860 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119860 Approved by: https://github.com/angelayi	2024-02-14 05:22:55 +00:00
Oguz Ulgen	8ba2675488	Fix for-loop divisibility parsing (#119859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119859 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835, #119836, #119838	2024-02-14 05:09:59 +00:00
Oguz Ulgen	1f0e4ac146	Add support for while-loops in ttir analysis (#119838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119838 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835, #119836	2024-02-14 05:09:59 +00:00
Oguz Ulgen	5ffac768f6	Add support for labels to ttir analysis (#119836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119836 Approved by: https://github.com/aakhundov ghstack dependencies: #119834, #119835	2024-02-14 05:09:59 +00:00
Oguz Ulgen	3f09c5ee66	Add TTIR verification (#119835 ) Make sure the TTIR generated is valid before attempting to analyze. Incorrectly written triton code would produce broken TTIR. Minor discussion on https://github.com/openai/triton/issues/3120 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119835 Approved by: https://github.com/aakhundov ghstack dependencies: #119834	2024-02-14 05:09:59 +00:00
Oguz Ulgen	b257ff80da	Add test scf.for with multi return (#119834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119834 Approved by: https://github.com/aakhundov	2024-02-14 05:09:59 +00:00
Huy Do	72bbbab70a	Add the missing test_dynamo_list_index from #119151 (D53392287) (#119854 ) D53392287 botched the export somehow and the exported PR https://github.com/pytorch/pytorch/pull/119151 didn't contain the added test. The discrepancy is showing up on diff train patch up diff D53694548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119854 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-02-14 04:10:02 +00:00
Bert Maher	563f1b9fef	[inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662 ) `triton.testing.nvsmi` invokes `nvidia-smi` as a subprocess, and Meta prod usually doesn't make nvidia-smi available. Might as well just use something that's native to torch. Differential Revision: [D53235814](https://our.internmc.facebook.com/intern/diff/D53235814/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118662 Approved by: https://github.com/jansel	2024-02-14 03:23:49 +00:00
Animesh Jain	80379ef0aa	[dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853 ) Fixes https://github.com/pytorch/pytorch/issues/119715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119853 Approved by: https://github.com/jansel	2024-02-14 03:14:42 +00:00
Kurman Karabukaev	4240304da4	[TorchElastic] Handle SystemExit with code == 0 (#119697 ) Summary: Fix for a case where --run-path option fails to exit if the script exits with non-error status code. When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit. Test Plan: cat /tmp/script.py ~~~ import sys def main(): exit_code = 1 if len(sys.argv) > 1: exit_code = int(sys.argv[1]) sys.exit(exit_code) if __name__=="__main__": main() ~~~ Case of exit code with 0 (prior behavior - never exits): torchrun --run-path /tmp/script.py 0 ~~~ [2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. (conda:pytorch) ➜ workspace echo $? 0 ~~~ Existing behavior for non-zero exit code still works: torchrun --run-path /tmp/script.py ~~~ (conda:pytorch) ➜ workspace torchrun --run-path /tmp/script.py [2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last): [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] self._pc.join(-1) [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] raise ProcessExitedException( [2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1 Traceback (most recent call last): File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main run(args) File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run elastic_launch( File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-12_09:16:25 host : kurman-mbp.dhcp.thefacebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 64668) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ~~~ Differential Revision: D53653874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697 Approved by: https://github.com/wconstab	2024-02-14 03:09:09 +00:00
Aaron Meurer	5ce305270b	Add a decomposition for isin() (#115390 ) Co-authored-by: Peter Bell <peterbell10@live.co.uk> Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390 Approved by: https://github.com/peterbell10	2024-02-14 03:03:42 +00:00
Jason Ansel	75a6d6aef7	[inductor] Support storage resizing (#119749 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749 Approved by: https://github.com/yf225 ghstack dependencies: #119647, #119671	2024-02-14 03:03:38 +00:00
Joel Schlosser	31e59766e7	Fix meta registration for _flash_attention_forward() (#119812 ) Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case. Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812 Approved by: https://github.com/drisspg	2024-02-14 02:38:53 +00:00
Huy Do	179ecab7e7	Do full checkout in lint workflow to rebuild new Docker images (#119858 ) From https://github.com/pytorch/pytorch/pull/119575, using `fetch-depth: 1` didn't work for `calculate-docker-image` when rebuilding a new one. Specifically, doing a full checkout is needed for `git rev-parse HEAD~:.ci/docker` to get the Docker tag. This shows up as a trunk failure after the recent Docker image update `507db17675` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119858 Approved by: https://github.com/PaliC, https://github.com/clee2000, https://github.com/malfet	2024-02-14 02:37:54 +00:00
Taras Tsugrii	690f54b0f5	[dynamo][nit] Cleanup analyze_kernel_mutations nits. (#119703 ) Using `extend` is more efficient and other changes are stylistic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119703 Approved by: https://github.com/Skylion007	2024-02-14 02:04:13 +00:00
Brian Hirsh	f9f0c67445	beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826 ) This extra check is needed for some more complicated parameter sizes/strides for an internal model Pull Request resolved: https://github.com/pytorch/pytorch/pull/119826 Approved by: https://github.com/albanD	2024-02-14 01:46:30 +00:00
drisspg	c9459e7f55	Update atomicMaxFloat (#119577 ) # Summary Initially reported in https://github.com/pytorch/pytorch/issues/119320 I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions: https://godbolt.org/z/3sKqEqn4M However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working ### Update: I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577 Approved by: https://github.com/yifuwang	2024-02-14 01:17:16 +00:00
suo	8e029dc616	[export] fix tuple return with symints (#119829 ) as title. Differential Revision: [D53726648](https://our.internmc.facebook.com/intern/diff/D53726648/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119829 Approved by: https://github.com/zhxchen17, https://github.com/khabinov	2024-02-14 01:16:38 +00:00
PyTorch MergeBot	4a5b2cd6cb	Revert "Windows Dynamo Error Removal CI Check (#115969 )" This reverts commit 45e7af5818f1d4ab1cf568390b3721b9be4251a9. Reverted https://github.com/pytorch/pytorch/pull/115969 on behalf of https://github.com/PaliC due to this pr ended up breaking some of our periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115969#issuecomment-1942934386))	2024-02-14 01:11:46 +00:00
Jesse Cai	16369816a2	[sparse] semi-structured sparse refactor (#117302 ) Summary: This PR is a refactor of semi-structured sparsity support. deprecation: Before `torch.sparse.to_sparse_semi_structured` had a kwarg param `transposed=False`, which has been removed. This kwarg was unused and now thros a deprecation warning. Namely, I've taken the subclassing implementation that xFormers has created and brought it over to PyTorch, as part of our plan to upstream runtime 2:4 sparsity. I've also copied over all the op support that Daniel implemenented that did not depend on the fast sparsification routines, into `_sparse_semi_structured_ops.py` With this subclass, all of our internal tests pass, as well as those in xFormers. The main change is that we now define a base subclass, `SparseSemiStructuredTensor` that is inherited from for each of the specific backends. We also now can arbitrarily override the sparse dispatch table with `_load_dispatch_table()`, idea being this is still general enough where users don't need to modify pytorch source code to get their model working. This also adds in padding support and stores alg_id and fuse_transpose as flags on the tensor, instead of hardcoding them. There still remains two components in xFormers that will need to be ported over eventually: - the autograd functions (`Sparsify24`, `Sparsify24_like`) - fast sparsification routines that they rely on Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302 Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles	2024-02-14 01:10:40 +00:00
Nikita Shulga	2536c5186e	[BE] Properly mark destructor overrides (Take 2) (#119656 ) Otherwise, at least on MacOS builds are littered with: ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MTIAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~CUDAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MPSHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ ``` Likely introduced by https://github.com/pytorch/pytorch/pull/119329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656 Approved by: https://github.com/Skylion007	2024-02-14 01:05:58 +00:00
cyy	cb0886ecf2	[DeviceIndex][4/N] Use DeviceIndex in more places (#119741 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741 Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang	2024-02-14 00:29:10 +00:00
suo	b2e779868f	make internal lintrunner mypy clean (#119840 ) as title Differential Revision: [D53732505](https://our.internmc.facebook.com/intern/diff/D53732505/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119840 Approved by: https://github.com/ezyang	2024-02-14 00:25:42 +00:00
angelayi	507db17675	Update HF pin (#119717 ) Sometime between now and the previous pin update, HF introduced a ModelOutputs type, which was not pytree serializable, causing aot_compile to fail on new HF models (https://fb.workplace.com/groups/1075192433118967/permalink/1377977852840422/). With https://github.com/huggingface/transformers/pull/27871, we can now pytree serialize HF ModelOutputs types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119717 Approved by: https://github.com/desertfire	2024-02-14 00:17:16 +00:00
Ozan Aydin	b51e0246b7	sccache version update (#119554 ) Fixes #37928 `sccache` is updated to the newer version (`v0.7.4`) to fix non-cacheable calls `multiple input files` for `CUDA` builds. This should make `Cache hits (CUDA)` work as expected and improve the speed dramatically. --- Additional information: - Modified `install_sccache.bat` check structure due to GitHub Action error `Process completed with exit code 255.` - Error is occurring when freshly downloaded `sccache` is being called with `--show-stats` or `--start-server` arguments within the script - Now, it is checking file's existence and killing/deleting executable before the download - Removed `sccache-cl` since it is no longer needed with newer versions of `sccache` --- `win-vs2019-cpu-py3 / build` - `16m 27s` ![image](https://github.com/pytorch/pytorch/assets/148207261/b5628e6c-64bb-4293-9d07-480f56df44f1) `win-vs2019-cuda11.8-py3 / build` - `17m 4s` (previously ~45 mins - 1h30mins) ![image](https://github.com/pytorch/pytorch/assets/148207261/e4ab01cb-0f56-41e8-984f-110e643b9c09) Now `Cache Hits (CUDA)` hits all `304` object and the error `Non-cacheable reasons` is fixed. ![image](https://github.com/pytorch/pytorch/assets/148207261/c8c25d2e-3fc1-4edb-8982-99c1f490cb54) --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/119554 Approved by: https://github.com/malfet	2024-02-13 23:50:40 +00:00
Edward Z. Yang	be35fc9ea7	Size oblivious test for slice optimization (#119625 ) Fixes https://github.com/pytorch/pytorch/issues/119623 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119625 Approved by: https://github.com/albanD	2024-02-13 23:47:52 +00:00
Andrew Gu	d81d5f52d5	[FSDP2][ez] Replaced `groupby` with `all` for same-dtype check (#119825 ) The `groupby` logic to check if all all-gather inputs have the same dtype is not so readable. Let us use `all` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119825 Approved by: https://github.com/Skylion007 ghstack dependencies: #119550, #118136, #118223, #118755	2024-02-13 23:28:53 +00:00
Jason Ansel	cf117e37d5	Refactor THPStorage_resize_ (#119671 ) Moving code around to allow it to be reused in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671 Approved by: https://github.com/yf225 ghstack dependencies: #119647	2024-02-13 23:28:47 +00:00
albanD	ca777fbbb7	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-02-13 23:15:24 +00:00
Aaron Orenstein	e9b78f2db0	Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324 ) Improve performance of inductor searching large graphs for potential fusions. Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior. Fixes #98467 Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration). Fusion is still slow - but at least finishes. After this change the example given in #98467 has the following backend timings (on one particular CPU): eager timing: 3m:23s aot_eager timing: 4m:12s inductor timing: 22m:24s Possible future work to improve this further: 1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph. 2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324 Approved by: https://github.com/oulgen	2024-02-13 22:54:53 +00:00
Jeff Daily	ba1eb0e27f	[ROCm] upgrade CI to 6.0 (#119495 ) Co-authored-by: Jithun Nair <jithun.nair@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119495 Approved by: https://github.com/huydhn	2024-02-13 22:39:03 +00:00
blorange-amd	df9b44436a	[ROCm] Enable float16/complex32 fft tests on ROCm (#117296 ) This PR is to enable float16/complex32 fft tests on ROCm. Sample results are attached here: [test_spectral_ops_results.log](https://github.com/pytorch/pytorch/files/13908533/test_spectral_ops_results.log) test_decomp::TestDecompCUDA::test_comprehensive_fft* test_decomp::TestDecompCUDA::test_quick_fft* test_jit_fuser_te::TestNNCOpInfoCUDA::test_nnc_correctness_fft* test_meta::TestMetaCUDA::test_dispatch_meta_inplace_fft* test_meta::TestMetaCUDA::test_dispatch_meta_outplace_fft* test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_inplace_fft* test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_fft* test_meta::TestMetaCUDA::test_meta_inplace_fft* test_meta::TestMetaCUDA::test_meta_outplace_fft* test_ops::TestCommonCUDA::test_complex_half_reference_testing_fft* test_ops::TestCommonCUDA::test_python_ref__refs_fft* test_ops::TestCommonCUDA::test_python_ref_executor__refs_fft* test_ops::TestCommonCUDA::test_python_ref_meta__refs* test_ops::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft* test_schema_check::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft* test_spectral_ops::TestFFTCUDA::test_empty_fft__refs_fft* test_spectral_ops::TestFFTCUDA::test_empty_fft_fft* test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error__refs_fft* test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error_fft* test_spectral_ops::TestFFTCUDA::test_fft_round_trip_cuda* test_spectral_ops::TestFFTCUDA::test_fft_type_promotion_cuda* test_spectral_ops::TestFFTCUDA::test_fftn_round_trip_cuda* test_spectral_ops::TestFFTCUDA::test_hfftn_cuda_float16 test_spectral_ops::TestFFTCUDA::test_ihfftn_cuda_float16 test_utils::TestDeviceUtilsCUDA::test_device_mode_ops_fft Pull Request resolved: https://github.com/pytorch/pytorch/pull/117296 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-02-13 22:35:32 +00:00
Nikita Shulga	63d64c8995	[MPS] Enable more bfloat16 ops (#119738 ) Introduce conveninence inlinable `mps::supportedFloatingType` function that returns true if type is Float, Half or BFloat16 Test by running LLM inference using bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119738 Approved by: https://github.com/Skylion007	2024-02-13 22:11:00 +00:00
Nikita Shulga	eb9a3383c2	[MPS] Add naive std_mean implementation (#119777 ) By just calling `std_mps` and `mean` in sequence Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script: ```python from timeit import default_timer import torch from torch.utils.benchmark import Measurement, Timer def bench_var_mean( m, n, k, dtype = torch.float32, device:str = "cpu", ) -> Measurement: setup = f""" x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}") """ t = Timer( stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer ) return t.blocked_autorange() for x in [100, 1000]: rc = bench_var_mean(1000, x, 100, device="mps") print(f"{x:5} : {rc.mean*1e6:.2f} usec") ``` which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter. Fixes https://github.com/pytorch/pytorch/issues/119663 TODOs: - Refactor the codebase and implement proper composite function (that must be faster) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777 Approved by: https://github.com/albanD	2024-02-13 21:51:29 +00:00
Jeff Daily	ee5b59dd4b	[ROCm] CatArrayBatchedCopy performance improvement (#118685 ) Tune the grid and block sizes for ROCm. Add a contig kernel separate from aligned+contig. Verified new performance using pytorch/benchmarks/operator_benchmark. `python -m pt.cat_test --device=cuda --tag-filter all` On MI200 this improved performance on average 4%, and on MI300 14%. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685 Approved by: https://github.com/malfet	2024-02-13 21:51:20 +00:00
Edward Z. Yang	6665b96ebb	Rewrite maybe_reduce more carefully for unbacked SymInt (#119562 ) Fixes https://github.com/pytorch/pytorch/issues/119476 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562 Approved by: https://github.com/albanD ghstack dependencies: #119559	2024-02-13 21:40:06 +00:00
Ke Wen	28f299a870	[c10d] Fix compilation of NCCL_EXP path (#119805 ) Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621 When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly. Cc: @kunalb @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805 Approved by: https://github.com/H-Huang	2024-02-13 21:26:59 +00:00
Aaron Gokaslan	f9200c8608	[BE][Ez]: FURB129: remove unneeded readlines() (#119796 ) Applies a refurb rule to remove any readlines() in a for loop iteration as it just creates a temporary list in memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119796 Approved by: https://github.com/ezyang	2024-02-13 21:21:22 +00:00
Guilherme Leobas	3319dbcd23	Update vmap guard to avoid recompilations (#119061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061 Approved by: https://github.com/zou3519	2024-02-13 20:50:23 +00:00
Shuqiang Zhang	abadbbc4b0	[c10d][flight recorder] remove unintended assignment of entry (#119748 ) Summary: auto& entry = entries_.at(id % max_entries_); entry = entries_.at(id % max_entries_); The above line of code has unintended consequence of invoking copy/assignment of entry objects as ref itself cannot be re-assigned. Also what could cause the crash is that the entry ref could become invalid if entries_ are resized by other threads. and this could result in 'copy to a garbage location'. The fix is to use a pointer which can be re-assigned after re-acquiring the lock Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748 Approved by: https://github.com/wconstab, https://github.com/fegin	2024-02-13 20:18:58 +00:00
Catherine Lee	34638c82a6	[mergebot] No unique behavior for facebook bot re pending jobs (#119735 ) if fb bot says merge without -f, do normal behavior and wait for pending checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/119735 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn	2024-02-13 20:07:24 +00:00
vfdev	8ec3d8e35f	Fixed FxGraphDrawer compat constructor (#119767 ) Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed: ``` File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph g = graph_drawer.FxGraphDrawer( torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767 Approved by: https://github.com/eellison	2024-02-13 19:36:01 +00:00
andrewor14	8ec8d78ef2	[quant][pt2e][be] Rename eval_utils -> export_utils (#119725 ) It's not really eval_utils anymore, since we added some training related utils. Instead it should be util functions that are related to general export use cases. Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725 Approved by: https://github.com/tugsbayasgalan	2024-02-13 19:10:06 +00:00
Andrew Gu	0a2e000edf	[BE] Enabled mypy in `common_fsdp.py` (#118755 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118755 Approved by: https://github.com/Skylion007, https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119550, #118136, #118223	2024-02-13 19:05:30 +00:00
Andrew Gu	8c1480f568	[FSDP2] Added mixed precision (#118223 ) This PR adds mixed precision configured via `MixedPrecisionPolicy`. - By default (`cast_forward_inputs=True`), each FSDP module will cast forward floating-point input tensors to `param_dtype` if specified. If the user wants to own the cast, then the user can disable it by passing `False`. - Symmetrically, by default (`output_dtype=None`) each FSDP module will not cast the forward output. If the user wants to customize the output dtype, then the user can pass a `torch.dtype`. - `param_dtype` configures the unsharded parameters' dtype for forward/backward computation and hence the all-gather dtype. - `reduce_dtype` configures the gradient reduction dtype. If `reduce_dtype=None` and `param_dtype is not None`, then `reduce_dtype` inherits from `param_dtype` for simplicity. We test against a manually implemented reference implementation instead of comparing against existing FSDP since the comparison is more direct to what we want to test. --- Overhead benchmarks to inform design The dilemma is as follows: - The common path for FSDP is bf16 parameter mixed precision, where we cast sharded parameters from fp32 to bf16 before all-gathering them. - The baseline implementation is to `torch._foreach_copy_` the sharded parameters to the flat `all_gather_input`, which gets passed to `dist.all_gather_into_tensor`. - The baseline incurs 1 extra fp32 read and 1 extra bf16 write per parameter because `_foreach_copy` takes the slow path, calling `copy_` in a loop, and `copy_` calls `dst.copy_(src.to(bf16))` where `dst` is bf16 and `src` is fp32. - These `copy_` calls stay in C++ and do not require calling `at::as_strided`. - The issue with this baseline implementation is that it requires knowing that all parameters in the group will be cast from fp32 to bf16 to do this `_foreach_copy_` from fp32 sources to a bf16 destination. - We want per-parameter FSDP to support mixed dtype all-gathers, which would involve different parameters providing different dtype all-gather inputs and viewing them as uint8 for a combined flat all-gather input, where this viewing-as-uint8 step is only needed in the mixed dtype case. - However, this incurs more CPU overhead, so we want to investigate this in more detail. We consider 150 `nn.Parameter`s with shapes taken from an internal model (where the shapes only affect the copy bandwidth, not the CPU overhead). We focus on world size 128 first. We consider two experiments: (1) run the copy-in with no head start, allowing CPU boundedness affect GPU time, and (2) run the copy-in with a CPU head start, removing CPU overhead from affecting GPU time. No head start: - Baseline `torch._foreach_copy_`: 0.525 ms CPU; 0.528 ms GPU - `.to(bf16)` before `torch._foreach_copy_`: 0.828 ms CPU; 0.836 ms GPU - `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.933 ms CPU; 0.937 ms GPU Head start (removing CPU boundedness from GPU times): - Baseline `torch._foreach_copy_`: 0.393 ms GPU - `.to(bf16)` before `torch._foreach_copy_`: 0.403 ms GPU - `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.403 ms GPU Some other interesting notes: - Constructing a set of all all-gather input dtypes: ~0.015 ms -- this would be the overhead cost of checking whether we need to view as uint8 (i.e. whether we have mixed dtype); alternatively, we could always view as uint8 (but that loses the mixed precision policy info from the profiler trace) - Changing from `[t.to(bf16).view(uint8) for t in ts]` to two list comprehensions like `[t.to(bf16) for t in ts]; [t.view(uint8) for t in ts]` actually reduces CPU overhead 🤔 (by ~0.04 ms) We see that the main difference is just CPU overhead. The GPU times are almost the same. (Actually, sweeping over 8, 16, 32, 64 world size, we do see difference in GPU time inversely proportional to world size, as expected since smaller world sizes copy more data. However, even at world size 8, the difference is only 0.407 ms vs. 0.445 ms GPU time.) Note though that the CPU overhead differences are exacerbated when the PyTorch profiler is turned on, and how much so seems to depend on the CPU capability. Seeing these numbers, I am inclined to prefer to just incur the CPU overhead, especially given that if we want to support the mixed dtype case for fp8 all-gather, we will need to incur this anyway. If the CPU overhead becomes a problem on a real workload, then we will need to figure out options then, one being using `torch.compile` possibly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118223 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119550, #118136	2024-02-13 19:05:30 +00:00
Andrew Gu	3956ce01e0	[FSDP2] Added autograd/memory/overlap/frozen/2D/AC tests (#118136 ) This PR adds tests for autograd (mainly backward hooks), memory, overlap, and frozen parameters. - Autograd: unused forward output, unused forward module, non-tensor activations (common in internal models) - Memory: expected GPU memory usage after init, forward, backward, and optimizer step - Overlap: communication/computation overlap in forward and backward - Frozen: expected reduce-scatter size, training parity This PR adds some initial 2D (FSDP + TP) training and model state dict tests. The only change required for model sharded state dict is to make sure parameters are sharded before save and load. This PR adds tests that `fully_shard` can use `torch.utils.checkpoint`, `_composable.checkpoint`, and `CheckpointWrapper` on a transformer. (I squashed all of these into one PR now to save CI cost.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118136 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #119550	2024-02-13 19:05:30 +00:00
Jason Ansel	39c68efd85	[dynamo] Capture untyped_storage().resize_() (#119647 ) This makes storage resizing work with `backend=eager`, the next two PRs make it work for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119647 Approved by: https://github.com/yf225	2024-02-13 19:03:28 +00:00
Chien-Chin Huang	c0e5cca4f8	[DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437 ) Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437 Approved by: https://github.com/wconstab, https://github.com/xmfan	2024-02-13 16:53:56 +00:00
Edward Z. Yang	c2522554dd	Prevent DCE'ing unbacked SymInt for view outputs (#119552 ) Fixes https://github.com/pytorch/pytorch/issues/119414 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119552 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-02-13 16:32:21 +00:00
Edward Z. Yang	52de407b6c	Avoid performing replacements when it would unrefine ranges (#117356 ) Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background. This PR does the following: * Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I only consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1` * The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work. * It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356 Approved by: https://github.com/lezcano	2024-02-13 15:56:59 +00:00
jeejeeli	0fd371c868	fix torch.cumsum docs (#117944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117944 Approved by: https://github.com/zou3519	2024-02-13 15:29:06 +00:00
Adnan Akhundov	c2a835d710	[inductor] Refactor device guard Python codegen to allow nested indentation (#119673 ) Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing). In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing). This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673 Approved by: https://github.com/peterbell10	2024-02-13 15:05:30 +00:00
Toshiki Kataoka	f4b5f710e8	Fix typo in private attr of `inference_mode` (#119167 ) This PR amends #102642. `torch.inference_mode`'s attribute to store the actual context is inconsistent between `__init__` and `__enter__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119167 Approved by: https://github.com/albanD	2024-02-13 14:59:59 +00:00
Oguz Ulgen	3629287151	Implement analysis for for-loops (#119730 ) This PR adds support for for-loop parsing and analysis. While doing so, I ran into some constant value and function name problems so I fixed them as well. Technically, it should be possible to break this into multiple PRs but since these are small, I'm bundling them together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119730 Approved by: https://github.com/aakhundov	2024-02-13 09:02:53 +00:00
Alexander Mols	2ae655b4f1	caffe2: remove support for specifically running "flaky tests" (#112007 ) Summary: In March 2019 D14468816 introduced some infra to mark tests as flaky while still running them. In July 2019 D15797371 removed the last use of this feature. Remove the related code as well. Test Plan: ci Reviewed By: mlogachev Differential Revision: D50601204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112007 Approved by: https://github.com/malfet	2024-02-13 07:56:37 +00:00
Nikita Shulga	60148f1761	[EZ] Set maximum supported version of Python as 3.12 (#119743 ) Doesn't really affect anything other than metadata on PyPI website Otherwise programming languages tab on https://pypi.org/project/torch/2.2.0/ shows supported version 3.8 to 3.10: <img width="239" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/e17f9982-8833-4cd8-b8d8-b2f1cb538548"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119743 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-02-13 06:56:32 +00:00
Elias Ellison	beb0041384	improve cuda graph symint logging msg (#119739 ) Users were confused by `recording cudagraph tree for None` ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119739 Approved by: https://github.com/mlazos	2024-02-13 06:26:36 +00:00
Wanchao Liang	bfb9ea1a43	fix compile DTensor.from_local in trace_rule_look up (#119659 ) There's a bug when converting from TorchVariable to trace rule look ups, in some corner cases the DTensor.from_local calls not matching the trace name rule look up, resulting in a None look up, and falling back to the UserFunctionVariable, which makes the tracing silent wrong by tracing into the DTensor.from_local function. Not exactly sure yet why the look up failed This PR fixes the DTensor.from_local tracing to make sure in everycase we should hit the InGraphFunctionVariable Pull Request resolved: https://github.com/pytorch/pytorch/pull/119659 Approved by: https://github.com/yifuwang	2024-02-13 05:21:19 +00:00
Mihir Patel	379183a0dd	Skip log line if no tensors were dedupped (#119742 ) Skips log line if nothing was dedupped. Avoids unhelpful logs like: ``` 2024-02-13 01:31:52,113 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119742 Approved by: https://github.com/Skylion007	2024-02-13 05:18:16 +00:00
Yu, Guangye	a4c476a081	[BE] Use more GTest primitives in XPU unit tests (#119527 ) # Motivation Use `EXPECT_EQ` to refine XPU's UT when relying on gtest. # Solution use `EXPECT_EQ` directly instead of `ASSERT_EQ_XPU` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119527 Approved by: https://github.com/malfet	2024-02-13 05:18:03 +00:00
cyy	47a2e6b6b8	Fix C++20 build (#112333 ) Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333 Approved by: https://github.com/albanD	2024-02-13 05:10:19 +00:00
Yue Dong	2bda6b4cb8	[DTensor] Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716 ) Summary: This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`. Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")` Test Plan: CI: Wait for the CI test Test with prod model: - Tested with models and no-longer ran into the exception after checkpoint loading. Differential Revision: D53680406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716 Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337	2024-02-13 04:30:45 +00:00
min-jean-cho	2502a01110	Linear-BN Fusion: add precondition check (#119264 ) Fixes #118990 The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`. To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264 Approved by: https://github.com/eellison	2024-02-13 04:16:34 +00:00
Nikita Shulga	15ef52a015	[MPS] Enable `conj` and `conj_physical` (#119669 ) Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor Preliminary step for enabling FFT ops (without it `ifft` would never work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669 Approved by: https://github.com/albanD ghstack dependencies: #119681	2024-02-13 02:27:51 +00:00
PyTorch MergeBot	214f06ae3a	Revert "Add Accelerator device and shell hooks (#119329 )" This reverts commit 4b9568a360c4a90220e78e43435be8c56bc33fb2. Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))	2024-02-13 02:23:45 +00:00
PyTorch MergeBot	7d4b666870	Revert "[BE] Properly mark destructor overrides (#119656 )" This reverts commit 069581b3ca354c3b34079d23bc237442d6f28cc3. Reverted https://github.com/pytorch/pytorch/pull/119656 on behalf of https://github.com/huydhn due to I need to revert this to unblock the revert of https://github.com/pytorch/pytorch/pull/119329#issuecomment-1939637967 and will reland this after resolving the conflicts ([comment](https://github.com/pytorch/pytorch/pull/119656#issuecomment-1940270997))	2024-02-13 02:20:45 +00:00
Colin Peppler	2921c2b3d9	[mypy] refactor mkldnn_fusion._is_valid_binary to avoid [union-attr] has no attribute (#119085 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119085 Approved by: https://github.com/Skylion007	2024-02-13 02:13:46 +00:00
SandishKumarHN	db228f1efd	[Lint] replace [assigment] with [method-assign] for methods (#119706 ) started with TODO fix from here https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py#L746 using ignore[method-assign] instead of ignore[assigment] Pull Request resolved: https://github.com/pytorch/pytorch/pull/119706 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kit1980	2024-02-13 02:06:04 +00:00
PyTorch MergeBot	9f8c84a399	Revert "Add missing include for internal build (#119721 )" This reverts commit e0cabebad94f1cf35742f8ec14f9938be3a195ab. Reverted https://github.com/pytorch/pytorch/pull/119721 on behalf of https://github.com/huydhn due to This fixes the build failures but there is still an issue with the missing libcaffe2_torch_fb_sparsenn_sparsenn_operators_gpu.so on D53686094 ([comment](https://github.com/pytorch/pytorch/pull/119721#issuecomment-1940191340))	2024-02-13 01:56:12 +00:00
laith sakka	ea8e4fd5ac	Support FunctoolsPartialVariable::get_function, fix NamedTupleVariable::as_proxy and handle call_function in get_fake_values_from_nodes (#119435 ) partially address https://github.com/pytorch/pytorch/issues/118785 This diff fixes three things: 1. add get_function to FunctoolsPartialVariable note that it will be available only if all args constant otherwise, it would throw unimplemented in the call to asPythonConstant. 2. NamedTupleVariable takes args dispatched not as list ex: NamedTuple(a, b, c) vs NamedTuple([a, b, c]), hence fix that by specializing asProxy. 3. A call to create_arg from within create_proxy, changes a python NamedTuple to a function call node without associating an example value! Updated get_fake_values_from_nodes to handle such case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119435 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #119314	2024-02-13 01:44:08 +00:00
Jason Ansel	74d55b0e63	[dynamo] Support torch.distributed.fsdp._flat_param._same_storage_size (#119627 ) Replaces #117690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119627 Approved by: https://github.com/Skylion007	2024-02-13 01:27:37 +00:00
PyTorch MergeBot	472500e32a	Revert "Avoid performing replacements when it would unrefine ranges (#117356 )" This reverts commit 0e6b314fc2e7c965717e939a4e457a9b9d7e133e. Reverted https://github.com/pytorch/pytorch/pull/117356 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/117356#issuecomment-1940032407))	2024-02-13 01:16:58 +00:00
PyTorch MergeBot	2492f8748e	Revert "Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412 )" This reverts commit f208795182a22ebaef84a284750669fa372157cb. Reverted https://github.com/pytorch/pytorch/pull/119412 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/119412#issuecomment-1939937937))	2024-02-13 00:52:19 +00:00
andrewor14	830ed6d9b2	[quant][pt2] Fix _disallow_eval_train error message (#119694 ) Fix the message to use the right function name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119694 Approved by: https://github.com/tugsbayasgalan	2024-02-13 00:17:53 +00:00
soulitzer	55483fc2c9	Min-cut partitioner always saves tensors that are returned as-is in backward (#114970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114970 Approved by: https://github.com/Chillee	2024-02-13 00:04:41 +00:00
Sergii Dymchenko	bd9db6a9c7	Update to TorchFix 0.4.0 (#119424 ) `torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424 Approved by: https://github.com/zou3519	2024-02-12 23:30:12 +00:00
Huy Do	5acd1f0f7d	Add cherry-pick workflow (#119352 ) After https://github.com/pytorch/test-infra/pull/4758, we can create a new workflow on PyTorch to receive `try-cherry-pick` dispatch event from the bot, and create the cherry pick PR. * [x] Cherry pick a PR after it has been landed and create a cherry pick PR to the target release branch. * [ ] The second part after this is to update the release tracker with the info. This will be done in a subsequent PR. * [ ] ghstack is not yet supported * [ ] Cherry pick a reverted commit is not yet supported (from @kit1980 comment) ### Testing The script can be used locally: ``` python cherry_pick.py --onto release/2.2 --classification release --github-actor huydhn 118907 The cherry pick PR is at https://github.com/pytorch/pytorch/pull/119351 ``` The test cherry pick PR is created at https://github.com/pytorch/pytorch/pull/119351 Unit testing this on CI is tricky, so I test this out on canary instead. * https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933162707 creates the PR at https://github.com/pytorch/pytorch-canary/pull/201 * One more test on canary with the new token https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933229483. The minimum required permission from what I see is `workflow` * Cherry picking conflicts could happen and needs to be handled manually https://github.com/pytorch/pytorch-canary/pull/194#issuecomment-1933142975 * ~Require a linked issue when cherry picking regressions, critical fixes, or fixing new features https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933174520~ Relax this requirement to a suggestion Pull Request resolved: https://github.com/pytorch/pytorch/pull/119352 Approved by: https://github.com/atalman	2024-02-12 23:12:10 +00:00
suo	f15b517055	[export] suppress type error (#119720 ) Differential Revision: [D53681243](https://our.internmc.facebook.com/intern/diff/D53681243/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119720 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-02-12 22:54:36 +00:00
rzou	b3df3e4e94	Restore OpInfo/ModuleInfo tests in Inductor-wrapped tests (#119693 ) I accidentally disabled this without realizing it. It turns out that PYTORCH_TEST_WITH_INDUCTOR=1 implies PYTORCH_TEST_WITH_DYNAMO=1, which activates skipIfTorchDynamo decorators. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119693 Approved by: https://github.com/bdhirsh	2024-02-12 22:44:45 +00:00
albanD	e0cabebad9	Add missing include for internal build (#119721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119721 Approved by: https://github.com/huydhn	2024-02-12 22:36:16 +00:00
Bin Bao	70c93c6097	[inductor] Update JIT Inductor cpp wrapper entry function signature (#119280 ) Summary: Change JIT Inductor cpp wrapper entry function to use similar signature as AOTInductor, i.e. using an array of AtenTensorHandle instead of a vector of at::Tensor as the inputs and return output through a pointer. This makes it easier to consolidate the ABI compatible and non-compatible modes. Differential Revision: [D53478825](https://our.internmc.facebook.com/intern/diff/D53478825) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119280 Approved by: https://github.com/chenyang78	2024-02-12 22:24:35 +00:00
Brian Hirsh	02b60e76c9	make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500 ) `dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous. Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this: ``` grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2) grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) return grad_q, grad_k, grad_v ``` But (I think?) the logic in the sdpa backward impl was a typo. I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523). A minimal repro that I made looks like this: ``` import torch # in this repro, "grad_out" and "value" are transposed tensors, # but "key" and "value" are contiguous a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') f = torch.randn(2, 16, 513, device='cuda') g = None h = None i = 513 j = 513 k = 0.0 l = False m = torch.tensor(1, dtype=torch.int64) n = torch.tensor(1, dtype=torch.int64) out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) from torch._meta_registrations import meta__scaled_dot_product_flash_backward out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) # prints True True print(out1_ref.is_contiguous()) print(out1_test.is_contiguous()) # prints True True print(out2_ref.is_contiguous()) print(out2_test.is_contiguous()) # prints True False print(out3_ref.is_contiguous()) print(out3_test.is_contiguous()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500 Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007	2024-02-12 22:12:29 +00:00
cyy	10789ccd83	Remove redundant CMake NUMA code (#119650 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119650 Approved by: https://github.com/ezyang	2024-02-12 21:53:44 +00:00
PyTorch MergeBot	34a61c527b	Revert "Enable x86 CPU vectorization on windows (#118980 )" This reverts commit 5f69d95b2b303382fe4cf301e73e36414c879c5c. Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to This is breaking Window binary build https://github.com/pytorch/pytorch/actions/runs/7874475000/job/21484997298 where it failed to build sleef ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-1939619212))	2024-02-12 21:33:14 +00:00
cyy	10f3abc6b8	[DeviceIndex][3/N] Use DeviceIndex in more places (#119635 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635 Approved by: https://github.com/ezyang	2024-02-12 21:31:27 +00:00
jmarin	064b61009b	Correctly formatting the example in get_state_dict (#119532 ) This PR corrects the example formatting provided in https://pytorch.org/docs/stable/distributed.checkpoint.html. In this issue, @wz337 is also commenting that the return type was not showing up correctly. I didn't see any formatting issue, but I could be wrong. Fixes #118837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119532 Approved by: https://github.com/fegin	2024-02-12 21:28:22 +00:00
Catherine Lee	ad217d4266	[ez] Add try catch for deleting old branches (#119696 ) I think some chars in branch names affect the api calls, so just assume they're protected Pull Request resolved: https://github.com/pytorch/pytorch/pull/119696 Approved by: https://github.com/huydhn	2024-02-12 21:08:59 +00:00
rzou	7eecbf8a30	Remove unnecessary skipIfTorchDynamo from test_jit_fuser_te (#118728 ) And add some expected failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118728 Approved by: https://github.com/bdhirsh	2024-02-12 20:55:29 +00:00
maajidkhann	28c30f29be	Update documentation for set_flush_denormal support on ARM (#119354 ) Documentation update for set_flush_denormal(): -> set_flush_denormal() is now supported on ARM CPU's. -> PR: https://github.com/pytorch/pytorch/pull/115184 (Already merged) Reference page: https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/119354 Approved by: https://github.com/drisspg	2024-02-12 20:53:22 +00:00
PyTorch MergeBot	7d780ff86f	Revert "Enable fake tensor caching in fbcode by default (#118555 )" This reverts commit 0f2fbbff109cbc184a6a88247813dbcddaea2e5f. Reverted https://github.com/pytorch/pytorch/pull/118555 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing one model test internally. Please take a look at the diff for more info D53189048 ([comment](https://github.com/pytorch/pytorch/pull/118555#issuecomment-1939550273))	2024-02-12 20:51:23 +00:00
Sahdev Zala	110919c984	Check QNNPACK support for the platform before running test (#119139 ) Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported. For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac: [----------] 1 test from ConstantPropagation [ RUN ] ConstantPropagation.CustomClassesCanBePropagated unknown file: Failure as described in more details in the issue #88613. After the PR, test passes successfully as below: [----------] 1 test from ConstantPropagation [ RUN ] ConstantPropagation.CustomClassesCanBePropagated [ OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms) [----------] 1 test from ConstantPropagation (0 ms total) Fixes #88613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139 Approved by: https://github.com/jcaip	2024-02-12 20:21:07 +00:00
Andrey Talman	7adfeba47a	Add Python 3.12 as experimental to release 2.2 (#119705 ) Add 3.12 as experimental version to Release 2.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119705 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-02-12 20:13:54 +00:00
suo	82248f0b1c	[export] improve FakeTensor serialization (#119531 ) Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc. The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state. This was bad for a few reasons: - Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this. - It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph. This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface. As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week). Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531 Approved by: https://github.com/zhxchen17	2024-02-12 19:28:08 +00:00
Edward Z. Yang	482345d747	Refactor out shape test into InputMetadata::maybe_reduce (#119559 ) I'm going to gut this function shortly, and having it all on InputMetadata is convenient for this purpose. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119559 Approved by: https://github.com/soulitzer	2024-02-12 19:27:48 +00:00
PyTorch MergeBot	c24b74efc7	Revert "Optimize multi_tensor_apply (#119153 )" This reverts commit 24be7daf799ed94e1964e2ce440ccaad15962719. Reverted https://github.com/pytorch/pytorch/pull/119153 on behalf of https://github.com/yifuwang due to This PR is breaking cuda graph for multi_tensor_apply ([comment](https://github.com/pytorch/pytorch/pull/119153#issuecomment-1939365823))	2024-02-12 19:11:29 +00:00
Nikita Shulga	8d8fb9783c	[MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681 ) By using `view_as_real` when type casting between two complex types Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-02-12 19:09:10 +00:00
laith sakka	eb0f9efd31	fix is_ and is_not (#118978 ) Fix issue https://github.com/pytorch/pytorch/issues/118805 Note: this was a refresh PR of https://github.com/pytorch/pytorch/pull/118806 discussion there is relevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/118978 Approved by: https://github.com/lezcano	2024-02-12 19:04:40 +00:00
Yanbo Liang	0e5b6594b7	[Dynamo] Minor cleanup of redundant function lookup logics (#119666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119666 Approved by: https://github.com/angelayi	2024-02-12 19:00:39 +00:00
vfdev-5	ed20e9118b	Fixed hash issue in `fx_graph_cse` (#119567 ) Description: - Fixed issue with hash collision for `hash((primals_2, 1.0)) == hash((primals_2, 1))` Repro code: ```python import torch from torch._functorch.compile_utils import fx_graph_cse def func(inpt, osize): size = inpt.shape[-1] s1 = size - 1 s2 = size - 1.0 scale = s2 / (osize - 1.0) inpt = torch.clamp(inpt, 0, s1) return scale * inpt gms = [] def toy_backend(gm, _): gms.append(gm) return gm.forward torch._dynamo.reset() fn = torch.compile(backend=toy_backend, dynamic=True)(func) t = torch.rand(3, 100) out = fn(t, 50) gm = gms[0] print(gm.graph) new_fx_g = fx_graph_cse(gm.graph) print(str(new_fx_g)) ``` Original graph ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` New wrong graph where `sub_2` is replaced incorrectly with `sub`: ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=2] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` With this PR the new graph is the following: ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119567 Approved by: https://github.com/eellison	2024-02-12 18:52:11 +00:00
Yifu Wang	27ffede878	[reland] Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #119102	2024-02-12 18:48:06 +00:00
Ke Wen	b2043c0543	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-12 18:45:49 +00:00
Shuqiang Zhang	893dcac068	[c10d] explicitly abort communicators in destroy_process_group call (#119250 ) Summary: This PR tries to resolve issue #119215. Basically, processgroup shutdown (and hence ncclCommAbort) is called in destroy_process_group APIs for the corresponding PGs. and in the destructor of ProcessGroup, we avoid calling abort/ncclCommAbort. Instead, it just checks if the users have explicitly already called destroy_process_group. If not, Destructor will log a warning and encourage/expect users to do so for cleanup of resources of PGs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250 Approved by: https://github.com/minsii, https://github.com/kwen2501	2024-02-12 18:40:28 +00:00
Edward Z. Yang	31f00b0160	Clarify that legacy cat behavior only applies for 1-D tensor (#119684 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119684 Approved by: https://github.com/albanD	2024-02-12 18:13:04 +00:00
Catherine Lee	059bf1baa4	Separate clang lint? (#119575 ) 25 min -> 17 + 13 min, which is still not as fast as I want it to be but I'll take it Lintrunner provides some parallelism by default, but it's not perfect Reducing fetch-depth from all to 1 further reduces time by ~2-3 minutes From non clang's logs: ``` 2024-02-09T22:05:39.5297616Z Requirement already satisfied: PyYAML==6.0 in /opt/conda/lib/python3.11/site-packages (6.0) 2024-02-09T22:12:23.6164708Z Collecting black==23.12.1 ``` I don't know why this part takes so long, maybe it's just buffering? Clang version doesn't show this issue See `5a750c8035` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119575 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-02-12 17:46:31 +00:00
Tarun Karuturi	bc521f2ce3	In dynamo tracing for index() use None as the default indicator for end and not -1 (#119151 ) Summary: In dynamo tracing, `index()`'s implementation currently has the default begin index as `0` and the default end index as`-1` which means that by default we're dropping the last element. Rather we should be doing `None` which will ensure that the last element is also checked. Test Plan: CI Differential Revision: D53392287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119151 Approved by: https://github.com/yanboliang	2024-02-12 17:45:05 +00:00
rzou	cf474a09f5	Decompose torch.ops.higher_order.auto_functionalized in Inductor (#118673 ) We'd like to get auto_functionalized to work with AOTInductor. To get there, we decompose `output = auto_functionalized(inplace_op, ...)` into its corresponding aten ops (clones + inplace_op) before the Inductor lowering phase. This decomposition must happen at the end of the Inductor FX passes because it introduces in-place operations. The pattern matcher's "replace this single node with multiple nodes" API isn't robust enough here. The problem is that `auto_functionalized` returns a single output (this output is a List), but the decomposition ends up returning the unpacked List (e.g. it may return two tensors). Previously, there was an assertion that this was not the case; I fixed up `replace_with_graph` to handle this. Future: Not all of the clones are necessary (e.g. if the input's last usage is this operator, then we don't need to clone it). We can add this logic later. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118673 Approved by: https://github.com/oulgen	2024-02-12 17:30:01 +00:00
Zhengxu Chen	8069b29603	[export] Implement logging for scuba. (#119585 ) Summary: As we're growing the user surface of torch.export, we'd like to understand better how people are using our APIs. It's also possible to analyze the usages based on static analysis, but due to the fact that there could be many creative ways to call things in Python, I think just building some logging infra will benefit us in the short term and gain us some insights. Test Plan: buck test caffe2/test:test_export {F1454519846} Reviewed By: tugsbayasgalan Differential Revision: D53618220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119585 Approved by: https://github.com/avikchaudhuri	2024-02-12 17:28:14 +00:00
Han Qi	757201c213	Refactor ExportedProgram to expose the functions for pre and postprocessing (#119513 ) Reason: Consumers of ExportProgram might choose to further lower exported_program.graph_module to something else. Then, it will need to setup the calling convention to call it. This refactor concentrates these calling convention to one place and can be reused. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119513 Approved by: https://github.com/zhxchen17	2024-02-12 17:22:27 +00:00
laith sakka	72d9a38118	add get_function to TorchInGraphFunctionVariable (#119314 ) partially address https://github.com/pytorch/pytorch/issues/118785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119314 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-02-12 16:35:34 +00:00
Jesse Cai	1c1dc0e4e0	[sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296 ) Summary: Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with cuSPARSELt. Test Plan: ``` python test/test_sparse_semi_structured.py -k mixed ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296 Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic	2024-02-12 16:02:36 +00:00
Xu Han	5f69d95b2b	Enable x86 CPU vectorization on windows (#118980 ) Enable VEC on Windows OS. 1. Fix some type defination gap between Windows and Linux. 2. Fix some operator not support on Windows, such as [], /. 3. Enable static sleef library build on Windows. 4. Disable unsupported function overloading on MSVC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980 Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet	2024-02-12 16:01:30 +00:00
Bin Bao	52a3de6cbf	[AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392 ) Summary: Move common functionality into a separate header so that later JIT and AOT Inductor can share it. Test Plan: CI Differential Revision: D53523452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119392 Approved by: https://github.com/khabinov	2024-02-12 15:56:16 +00:00
PyTorch MergeBot	24bdd03d23	Revert "Reify view_func() closures as ViewFuncs (#118404 )" This reverts commit d5a6762263a98e5153bc057c8ba4f377542c7e55. Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))	2024-02-12 12:38:51 +00:00
Yifu Wang	79df897608	Fix some tests in test_c10d_functional_native.py (#119102 ) Summary: This PR fixes a few tests that were broken because `empty` became `empty_strided_cuda` in the generate code. Also changed some _c10d_functional calls to funcol calls so add coverage to tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119102 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-02-12 09:28:18 +00:00
PyTorch MergeBot	0342b227e5	Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 )" This reverts commit f3e7d809936d9f1bf63102e8afe241e13ed8766a. Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))	2024-02-12 07:34:20 +00:00
cyy	8a3c241094	Remove unused header inclusion (#119667 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667 Approved by: https://github.com/Skylion007	2024-02-12 05:36:25 +00:00
Mu-Chu Lee	dcb08a7044	Add CUDAEvent recording for constant folding to show up. (#119216 ) Summary: Add a layer of call to let CUDAEvent show up for constant folding. Test Plan: Existing tests Differential Revision: D53437934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119216 Approved by: https://github.com/khabinov	2024-02-12 03:46:36 +00:00
PyTorch UpdateBot	bc4d0277cd	[executorch hash update] update the pinned executorch hash (#119648 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119648 Approved by: https://github.com/pytorchbot	2024-02-12 03:42:07 +00:00
chilli	76fac69577	add a couple more cases to pointwise_cat perf tests (#119521 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119521 Approved by: https://github.com/ezyang, https://github.com/eellison	2024-02-12 03:41:08 +00:00
Oguz Ulgen	647564dbaa	Implement conditional statements in kernel analysis (#119664 ) This PR makes it so that ops is no longer a dict of RET => OP but rather it is now RET => List[OP] since now multiple OPs can return the same RET. In real execution, only one of these OPs will be executed, so no need to worry about renaming. For analysis, we pessimistically assume any one of them could be executed (which is safest for analysis purposes) Example TTIRs that can now be handled: ``` scf.if %13 { %14 = tt.get_program_id y : i32 loc(#loc13) %c0_i32_1 = arith.constant 0 : i32 loc(#loc14) %15 = arith.cmpi eq, %14, %c0_i32_1 : i32 loc(#loc14) scf.if %15 { %16 = arith.addf %8, %11 : tensor<4xf32> loc(#loc16) %17 = tt.splat %arg2 : (!tt.ptr<f32, 1>) -> tensor<4x!tt.ptr<f32, 1>> loc(#loc17) %18 = tt.addptr %17, %4 : tensor<4x!tt.ptr<f32, 1>>, tensor<4xi32> loc(#loc17) tt.store %18, %16, %5 {cache = 1 : i32, evict = 1 : i32} : tensor<4xf32> loc(#loc18) } else { } loc(#loc15) } else { } loc(#loc12) ``` and ``` %14 = scf.if %13 -> (tensor<4xf32>) { %17 = arith.addf %8, %11 : tensor<4xf32> loc(#loc13) scf.yield %17 : tensor<4xf32> loc(#loc13) } else { %17 = arith.mulf %8, %11 : tensor<4xf32> loc(#loc14) scf.yield %17 : tensor<4xf32> loc(#loc14) } loc(#loc12) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119664 Approved by: https://github.com/aakhundov	2024-02-12 01:54:26 +00:00
Bin Bao	663dd5d006	[inductor] Update the compile options for CppPythonBindingsCodeCache (#119415 ) Differential Revision: [D53554681](https://our.internmc.facebook.com/intern/diff/D53554681) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119415 Approved by: https://github.com/jansel, https://github.com/khabinov	2024-02-11 21:25:34 +00:00
Nikita Shulga	069581b3ca	[BE] Properly mark destructor overrides (#119656 ) Otherwise, at least on MacOS builds are littered with: ``` In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6: /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MTIAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~CUDAHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] virtual ~MPSHooksInterface() = default; ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here virtual ~AcceleratorHooksInterface() = default; ^ ``` Likely introduced by https://github.com/pytorch/pytorch/pull/119329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656 Approved by: https://github.com/Skylion007	2024-02-11 21:07:16 +00:00
Taras Tsugrii	a4cc6b85dc	[dynamo][eval][perf] Remove unnecessary dict copies. (#119305 ) Both of these variables are already created using `dict(...)` so making yet another `dict` copy is pure overhead and boilerplate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119305 Approved by: https://github.com/Skylion007	2024-02-11 20:29:26 +00:00
Adnan Akhundov	e5f46a1d35	Check alignment of ReinterpretView args of custom Triton kernels (#119649 ) Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x. In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment. Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`). Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view ... ---------------------------------------------------------------------- Ran 6 tests in 27.952s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649 Approved by: https://github.com/oulgen	2024-02-11 20:21:17 +00:00
Taras Tsugrii	b8e4423278	[torch][cuda][perf] Avoid unnecessary dicts. (#118011 ) It's unnecessary and inefficient to create a `dict` from list indices to list values just to check if particular `idx` exists there. This way leads to `O(N)` time and space complexity whereas using `list` directly is `O(1)` time and space complexity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118011 Approved by: https://github.com/Skylion007	2024-02-11 19:29:24 +00:00
Taras Tsugrii	95a8d5b1bc	[random] Replace for loop with list comprehension. (#119143 ) It's more idiomatic and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119143 Approved by: https://github.com/Skylion007	2024-02-11 19:29:19 +00:00
Taras Tsugrii	4394e0dc2c	[inductor] Use list comprehension to initialize unused_views. (#119618 ) It's more idiomatic and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119618 Approved by: https://github.com/Skylion007	2024-02-11 18:57:18 +00:00
Yifu Wang	24be7daf79	Optimize multi_tensor_apply (#119153 ) ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119153 Approved by: https://github.com/janeyx99	2024-02-11 18:12:22 +00:00
Pearu Peterson	2c91e13afc	Add lowerings to special functions (#119187 ) As in the title. In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187 Approved by: https://github.com/peterbell10	2024-02-11 16:35:40 +00:00
Nikita Shulga	4ee8aac432	[MPS] Enable `bfloat16` support on MacOS 14 (#119641 ) Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h` TODO: Enable more testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641 Approved by: https://github.com/Skylion007	2024-02-11 16:25:29 +00:00
Nikita Shulga	68e009dd8f	[BE][EZ] Use `dyspatch_sync_with_rethrow` in searchsorted (#119646 ) For the proper exception handling, otherwise raising C++ exception inside dispatch block will crash the app (discovered while enabling more BFloat16 ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119646 Approved by: https://github.com/Skylion007	2024-02-11 07:19:00 +00:00
lancerts	6cd82253ae	fix torch.set_float32_matmul_precision doc (#119620 ) Fixes #119606, clearify the explictly stored number of bits in doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/119620 Approved by: https://github.com/eqy, https://github.com/malfet	2024-02-11 06:41:37 +00:00
cyy	88183923d2	Remove unneeded linking of torch_shm_manager in CMake (#119540 ) This PR aims to clean up torch_shm_manager dependency in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119540 Approved by: https://github.com/ezyang	2024-02-11 06:33:35 +00:00
Adnan Akhundov	0bed0501fa	Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634 ) Summary: There has been some empirical evidence that, for (non-trivial) custom (user-written) Triton kernels, a register-spilling config yields the best result in auto-tuning. For this reason, we don't skip register-spilling config from auto-tuning of the custom Triton kernels. <details> <summary>An example of auto-tuning result with the register-spilling config outperforming others</summary> ``` BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.748896, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.723424, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 2.202656, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.748256, nreg 255, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.724896, nreg 249, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 2.201632, nreg 190, nspill 0, #shared-mem 8704 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.651664, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.846368, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.841792, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.651584, nreg 255, nspill 56, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.846432, nreg 255, nspill 14, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.841904, nreg 243, nspill 0, #shared-mem 13312 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.236448, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.484384, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.131168, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.236544, nreg 255, nspill 254, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.483648, nreg 255, nspill 174, #shared-mem 22528 BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.131408, nreg 255, nspill 6, #shared-mem 22528 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.516112, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.737792, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.411632, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.515904, nreg 255, nspill 28, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.736608, nreg 255, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.409808, nreg 193, nspill 0, #shared-mem 13312 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.553536, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569792, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.892448, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.553584, nreg 255, nspill 130, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569568, nreg 255, nspill 56, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.892240, nreg 255, nspill 4, #shared-mem 18432 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.332928, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.922256, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.758400, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.333440, nreg 255, nspill 366, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.922336, nreg 255, nspill 228, #shared-mem 28672 BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.758496, nreg 255, nspill 26, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.231648, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.639424, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.917952, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.230624, nreg 255, nspill 292, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.639168, nreg 255, nspill 90, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.917440, nreg 240, nspill 0, #shared-mem 22528 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.838080, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569184, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.614720, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.838048, nreg 255, nspill 354, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569472, nreg 255, nspill 178, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.615104, nreg 255, nspill 28, #shared-mem 28672 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.012128, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.861536, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.771584, nreg 255, nspill 134, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.012512, nreg 255, nspill 522, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.861024, nreg 255, nspill 378, #shared-mem 40960 BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.771712, nreg 255, nspill 134, #shared-mem 40960 ``` </details> In the above, the winning config is `BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2`, although it has non-zero `nspill 28`. This is an example where we need to consider all configs, including the register-spilling ones, to obtain the best result from auto-tuning. In the worst case, this will just make auto-tuning longer, but can't regress the results. And, as the number of custom Triton kernels in the model is normally much smaller than the number of Inductor-generated ones, this should be acceptable. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/119634 Approved by: https://github.com/oulgen	2024-02-11 02:13:25 +00:00
PyTorch MergeBot	3ab08946d5	Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 )" This reverts commit 0597dab523c0a341e136452a8f723f12700164c0. Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))	2024-02-10 23:04:36 +00:00
PyTorch MergeBot	d8e319a961	Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 )" This reverts commit 760056bbdc552314e7e81adc45e11766ac0f333c. Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))	2024-02-10 23:02:05 +00:00
Taras Tsugrii	6db6a1b526	[aten] Use emplace instead of insert. (#119614 ) this avoids pair construction in case inserted key is already present in dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/119614 Approved by: https://github.com/Skylion007	2024-02-10 22:35:00 +00:00
Taras Tsugrii	2c8722182e	[dynamo][guards] Avoid unnecessary stack copies. (#119115 ) There is no need to make a `frame_summary_stack` copy in case it's not modified. Proposed change uses copy-on-write functional approach that is easy to understand and is more efficient in case `self.loc_in_frame` is `None` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119115 Approved by: https://github.com/Skylion007	2024-02-10 21:56:00 +00:00
cyy	568740f080	[DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545 ) Follows #119142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545 Approved by: https://github.com/ezyang	2024-02-10 20:27:59 +00:00
Yanbo Liang	57d8f67619	[Dynamo][17/N] Rename SkipFilesVariable to SkipFunctionVariable and move to functions.py (#119619 ) This is follow-up-3 from https://github.com/pytorch/pytorch/pull/118971#issue-2114082018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119619 Approved by: https://github.com/jansel	2024-02-10 19:33:37 +00:00
Taras Tsugrii	dcce5327bb	[core][perf] Use set comprehensions in _RecreateLookupTables. (#119617 ) It's more idiomatic and much more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119617 Approved by: https://github.com/Skylion007	2024-02-10 18:53:25 +00:00
Alexander Kurakin	c5116d9e44	Fix optim.lr_scheduler examples in doc to use optimizer vs self.opt (#119563 ) Fixes #119561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119563 Approved by: https://github.com/janeyx99	2024-02-10 15:10:43 +00:00
PyTorch MergeBot	34db6f1b13	Revert "make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500 )" This reverts commit 095f4713077639f0e48fa33d051c0de2eb1f8525. Reverted https://github.com/pytorch/pytorch/pull/119500 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119500#issuecomment-1937003082))	2024-02-10 13:06:30 +00:00
Peter Bell	c0f1183eb4	[inductor] Fix compile error on scan with no mask (#119555 ) Fixes #119591 Currently this results in invalid syntax: ```python tmp4 = tl.where(, tmp1, tmp2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119555 Approved by: https://github.com/lezcano	2024-02-10 12:38:40 +00:00
Mu-Chu Lee	e71c202520	Use CUDA if cuda's macro is set for AOTI runner's pybind (#119616 ) Summary: Use CUDA if cuda's macro is set for AOTI runner's pybind This is a duplicate of #119438 for landing issues Test Plan: Existing tests (D52303882) Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119616 Approved by: https://github.com/khabinov	2024-02-10 11:00:47 +00:00
Oguz Ulgen	3581428ea0	Do not mark tt.load's arguments as mutated (#119631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119631 Approved by: https://github.com/aakhundov ghstack dependencies: #119581, #119615	2024-02-10 08:46:50 +00:00
Oguz Ulgen	6c5bf5a5ce	Implement kernel analysis for functions with multiple return values (#119615 ) This diff adds few improvements: * Parsing for multiple return value: `tt.return %1, %arg0` * Parsing for assignment for multiple values: `%1:2` means %1 has two values * Parsing for usage of a value with multiple values: `%1#0` means 0th index of %1 * Fixes a bug in memo-cycle detection when multiple tests are executed back to back Pull Request resolved: https://github.com/pytorch/pytorch/pull/119615 Approved by: https://github.com/aakhundov ghstack dependencies: #119581	2024-02-10 08:46:50 +00:00
Oguz Ulgen	e693089c7a	[Dynamo] Refactor tensor methods handling (#119581 ) Fixes part of #119128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119581 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-10 08:46:50 +00:00
Chien-Chin Huang	699ae72f51	[DCP][state_dict] Fix the issue that get_state_dict/set_state_dict ignore the buffer (#119573 ) get_state_dict and set_state_dict currently do not appropriately handle the buffers. This PR fixes thie issue. Fixes, https://github.com/pytorch/pytorch/issues/119535. Differential Revision: [D53616762](https://our.internmc.facebook.com/intern/diff/D53616762/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119573 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-02-10 06:36:58 +00:00
PyTorch UpdateBot	a82c50793e	[executorch hash update] update the pinned executorch hash (#119510 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119510 Approved by: https://github.com/pytorchbot	2024-02-10 03:40:34 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
PyTorch UpdateBot	f2778e3874	[vision hash update] update the pinned vision hash (#119511 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119511 Approved by: https://github.com/pytorchbot	2024-02-10 03:22:13 +00:00
PyTorch UpdateBot	42ca82dfb1	[audio hash update] update the pinned audio hash (#119612 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119612 Approved by: https://github.com/pytorchbot	2024-02-10 03:22:06 +00:00
Elias Ellison	3278b4c557	be more consrevative until regression is debugged (#119583 ) See, internal regression: https://www.internalfb.com/diff/D53375778?transaction_fbid=953511712782168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119583 Approved by: https://github.com/Chillee	2024-02-10 03:06:58 +00:00
Avik Chaudhuri	70a364d402	non-strict improvements: constant args and kwargs (#119529 ) This PR makes a couple of improvements to non-strict to bring it closer to strict. (This lets us remove some expected failures from test_export.) 1. Support constant arguments (easy). 2. Support keyword arguments. This forces us to add kwargs to `aot_export_module`. Indeed there is no way to make this work otherwise, because some arguments in a function signature can be keyword-only and thus cannot be simulated by positional arguments alone. Adding kwargs to `aot_export_module` turns out to be fairly routine, but there is a bit of a unsatisfactory fork between how it is called by strict and non-strict: because strict calls it on a graph module, kwargs must be converted to positional arguments. So kwargs in `aot_export_module` really only comes into play in non-strict. Differential Revision: D53600977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119529 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-02-10 02:55:40 +00:00
Yang Chen	760056bbdc	[aot_inductor] move CppWrapperCodeGen into a separate file (#119491 ) This PR moved CppWrapperCodeGen class into a seperate file, cpp_wrapper.py, to simplify wrapper.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491 Approved by: https://github.com/desertfire, https://github.com/albanD	2024-02-10 02:15:56 +00:00
Brian Hirsh	095f471307	make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500 ) `dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous. Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this: ``` grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2) grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2) return grad_q, grad_k, grad_v ``` But (I think?) the logic in the sdpa backward impl was a typo. I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523). A minimal repro that I made looks like this: ``` import torch # in this repro, "grad_out" and "value" are transposed tensors, # but "key" and "value" are contiguous a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2) e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda') f = torch.randn(2, 16, 513, device='cuda') g = None h = None i = 513 j = 513 k = 0.0 l = False m = torch.tensor(1, dtype=torch.int64) n = torch.tensor(1, dtype=torch.int64) out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) from torch._meta_registrations import meta__scaled_dot_product_flash_backward out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125) # prints True True print(out1_ref.is_contiguous()) print(out1_test.is_contiguous()) # prints True True print(out2_ref.is_contiguous()) print(out2_test.is_contiguous()) # prints True False print(out3_ref.is_contiguous()) print(out3_test.is_contiguous()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500 Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007	2024-02-10 02:04:56 +00:00
Jason Ansel	e1c1b8c2b2	[dynamo] Improve support for backwards hooks (#119525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-02-10 01:14:03 +00:00
cyy	05602915f5	Link torch_cpu to cudart only if CUPTI is enabled (#118232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118232 Approved by: https://github.com/ezyang	2024-02-10 00:53:51 +00:00
Riley Dulin	44796682d0	[torch][ao] Fix module name filter for pytorch2 quantization for underscores (#119344 ) Summary: There was a bug in the module name filter for modules that had an underscore already in them, as it was replaced with a "dot" notation. This is because it was thought that underscores always meant a module separator, but this isn't the case for modules whose name contains an underscore. Test Plan: Added a unit test. Before this change, that test failed (due to applying the wrong qscheme). Now it passes. Differential Revision: D53502771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119344 Approved by: https://github.com/jerryzh168	2024-02-10 00:29:08 +00:00
titaiwangms	34f7dc9eba	[ONNX] Support op consistency error reproduction (#119512 ) Fixes #119472 Introduce the debugging tool in onnxscript: https://github.com/microsoft/onnxscript/blob/main/onnxscript/tests/function_libs/torch_lib/error_reproduction.py This tool can help us quickly find the inputs leading to mismatched errors. NOTE: this produces `error_reports` folder where there are different markdown reports for each mismatched test cases. For example - CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool ### Summary The output of ONNX Runtime does not match that of PyTorch when executing test `test_fx_op_consistency.TestOnnxModelOutputConsistency_opset_version_18_model_type_TorchModelType.TORCH_NN_MODULECPU.test_output_match_fft_fft_cpu_bool`, `sample 3` in ONNX Script `TorchLib`. To recreate this report, use ```bash CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool ``` ### ONNX Model ``` < ir_version: 8, opset_import: ["pkg.onnxscript.torch_lib" : 1, "" : 18, "pkg.onnxscript.torch_lib.common" : 1], producer_name: "pytorch", producer_version: "2.2.0" > main_graph (bool[31] l_args_0_) => (float[31,2] _fft_r2c) <bool[31] l_args_0_, float[31] _to_copy, float[31,2] _fft_r2c> { _to_copy = Cast <to: int = 1> (l_args_0_) _val_2 = Constant <value: tensor = int64[1] {-1}> () _val_3 = Unsqueeze (_to_copy, _val_2) _val_4 = Constant <value: tensor = int64[1] {0}> () _val_5 = Unsqueeze (_val_3, _val_4) _val_6 = DFT <axis: int = 1, inverse: int = 0, onesided: int = 0> (_val_5) _val_7 = Constant <value: tensor = int64[1] {0}> () _val_8 = Squeeze (_val_6, _val_7) _fft_r2c = pkg.onnxscript.torch_lib._fftn_onnx_normalization <dims: ints = [0], forward: int = 1, normalization: int = 0> (_val_3, _val_8) } < domain: "pkg.onnxscript.torch_lib", opset_import: ["" : 18] > _fftn_onnx_normalization <normalization,forward,dims>(self, transformed) => (result_15) { self_shape = Shape (self) dims = Constant <value_ints: ints = @dims> () self_shape_subscripted = Gather <axis: int = 0> (self_shape, dims) total_sample_count = ReduceProd <keepdims: int = 0> (self_shape_subscripted) total_sample_count_0 = CastLike (total_sample_count, transformed) normalization = Constant <value_int: int = @normalization> () int64_1 = Constant <value: tensor = int64 int64_1 {1}> () cond = Equal (normalization, int64_1) result_15 = If (cond) <then_branch: graph = thenGraph_21 () => ( result_3) { forward = Constant <value_int: int = @forward> () forward_as_bool = Cast <to: int = 9> (forward) result_3 = If (forward_as_bool) <then_branch: graph = thenGraph_23 () => ( result) { tmp = Sqrt (total_sample_count_0) result = Div (transformed, tmp) }, else_branch: graph = elseGraph_23 () => ( result_2) { tmp_1 = Sqrt (total_sample_count_0) result_2 = Mul (transformed, tmp_1) }> }, else_branch: graph = elseGraph_21 () => ( result_14) { normalization_4 = Constant <value_int: int = @normalization> () int64_2 = Constant <value: tensor = int64 int64_2 {2}> () cond_5 = Equal (normalization_4, int64_2) result_14 = If (cond_5) <then_branch: graph = thenGraph_27 () => ( result_9) { forward_6 = Constant <value_int: int = @forward> () forward_6_as_bool = Cast <to: int = 9> (forward_6) result_9 = If (forward_6_as_bool) <then_branch: graph = thenGraph_29 () => ( result_7) { result_7 = Div (transformed, total_sample_count_0) }, else_branch: graph = elseGraph_29 () => ( result_8) { result_8 = Identity (transformed) }> }, else_branch: graph = elseGraph_27 () => ( result_13) { forward_10 = Constant <value_int: int = @forward> () forward_10_as_bool = Cast <to: int = 9> (forward_10) result_13 = If (forward_10_as_bool) <then_branch: graph = thenGraph_35 () => ( result_11) { result_11 = Identity (transformed) }, else_branch: graph = elseGraph_35 () => ( result_12) { result_12 = Mul (transformed, total_sample_count_0) }> }> }> } < domain: "pkg.onnxscript.torch_lib.common", opset_import: ["" : 18] > Rank (input) => (return_val) { tmp = Shape (input) return_val = Size (tmp) } < domain: "pkg.onnxscript.torch_lib.common", opset_import: ["" : 18] > IsScalar (input) => (return_val) { tmp = Shape (input) tmp_0 = Size (tmp) tmp_1 = Constant <value_int: int = 0> () return_val = Equal (tmp_0, tmp_1) } ``` ### Inputs Shapes: `['Tensor<torch.Size([31]), dtype=torch.bool>']` <details><summary>Details</summary> <p> ```python kwargs = {} inputs = (tensor([False, False, True, True, False, True, False, True, False, False, True, False, False, False, False, False, True, True, True, True, True, True, True, True, False, False, False, False, True, True, True]),) ``` </p> </details> ### Expected output Shape: `torch.Size([31, 2])` <details><summary>Details</summary> <p> ```python expected = tensor([[16.0000, 0.0000], [-0.2369, 2.6590], [ 0.7336, -4.9670], [ 2.2093, 2.9865], [-0.7166, 1.0928], [-3.0614, 3.0015], [-1.8945, -0.9677], [-2.1538, 0.2513], [-2.2432, 1.3978], [-0.3429, 1.9494], [-0.6495, -1.5423], [-0.6005, 2.2398], [ 2.2639, 2.6430], [ 1.7609, 0.2033], [-1.3829, -2.3365], [-1.6854, -0.0311], [-1.6854, 0.0311], [-1.3829, 2.3365], [ 1.7609, -0.2033], [ 2.2639, -2.6430], [-0.6005, -2.2398], [-0.6495, 1.5423], [-0.3429, -1.9494], [-2.2432, -1.3978], [-2.1538, -0.2513], [-1.8945, 0.9677], [-3.0614, -3.0015], [-0.7166, -1.0928], [ 2.2093, -2.9865], [ 0.7336, 4.9670], [-0.2369, -2.6590]]) ``` </p> </details> ### Actual output Shape: `torch.Size([31, 2])` <details><summary>Details</summary> <p> ```python actual = tensor([[ 1.6000e+01, -9.1791e-06], [-2.3695e-01, 2.6590e+00], [ 7.3355e-01, -4.9670e+00], [ 2.2093e+00, 2.9865e+00], [-7.1663e-01, 1.0928e+00], [-3.0614e+00, 3.0015e+00], [-1.8946e+00, -9.6773e-01], [-2.1538e+00, 2.5126e-01], [-2.2432e+00, 1.3978e+00], [-3.4294e-01, 1.9494e+00], [-6.4946e-01, -1.5423e+00], [-6.0044e-01, 2.2398e+00], [ 2.2639e+00, 2.6430e+00], [ 1.7609e+00, 2.0326e-01], [-1.3829e+00, -2.3365e+00], [-1.6854e+00, -3.1130e-02], [-1.6854e+00, 3.1161e-02], [-1.3829e+00, 2.3365e+00], [ 1.7609e+00, -2.0327e-01], [ 2.2639e+00, -2.6430e+00], [-6.0047e-01, -2.2398e+00], [-6.4945e-01, 1.5423e+00], [-3.4294e-01, -1.9494e+00], [-2.2432e+00, -1.3978e+00], [-2.1538e+00, -2.5129e-01], [-1.8945e+00, 9.6773e-01], [-3.0615e+00, -3.0015e+00], [-7.1663e-01, -1.0928e+00], [ 2.2093e+00, -2.9865e+00], [ 7.3354e-01, 4.9670e+00], [-2.3695e-01, -2.6589e+00]]) ``` </p> </details> ### Difference <details><summary>Details</summary> <p> ```diff --- actual +++ expected @@ -1,31 +1,31 @@ -tensor([[ 1.6000e+01, -9.1791e-06], - [-2.3695e-01, 2.6590e+00], - [ 7.3355e-01, -4.9670e+00], - [ 2.2093e+00, 2.9865e+00], - [-7.1663e-01, 1.0928e+00], - [-3.0614e+00, 3.0015e+00], - [-1.8946e+00, -9.6773e-01], - [-2.1538e+00, 2.5126e-01], - [-2.2432e+00, 1.3978e+00], - [-3.4294e-01, 1.9494e+00], - [-6.4946e-01, -1.5423e+00], - [-6.0044e-01, 2.2398e+00], - [ 2.2639e+00, 2.6430e+00], - [ 1.7609e+00, 2.0326e-01], - [-1.3829e+00, -2.3365e+00], - [-1.6854e+00, -3.1130e-02], - [-1.6854e+00, 3.1161e-02], - [-1.3829e+00, 2.3365e+00], - [ 1.7609e+00, -2.0327e-01], - [ 2.2639e+00, -2.6430e+00], - [-6.0047e-01, -2.2398e+00], - [-6.4945e-01, 1.5423e+00], - [-3.4294e-01, -1.9494e+00], - [-2.2432e+00, -1.3978e+00], - [-2.1538e+00, -2.5129e-01], - [-1.8945e+00, 9.6773e-01], - [-3.0615e+00, -3.0015e+00], - [-7.1663e-01, -1.0928e+00], - [ 2.2093e+00, -2.9865e+00], - [ 7.3354e-01, 4.9670e+00], - [-2.3695e-01, -2.6589e+00]]) +tensor([[16.0000, 0.0000], + [-0.2369, 2.6590], + [ 0.7336, -4.9670], + [ 2.2093, 2.9865], + [-0.7166, 1.0928], + [-3.0614, 3.0015], + [-1.8945, -0.9677], + [-2.1538, 0.2513], + [-2.2432, 1.3978], + [-0.3429, 1.9494], + [-0.6495, -1.5423], + [-0.6005, 2.2398], + [ 2.2639, 2.6430], + [ 1.7609, 0.2033], + [-1.3829, -2.3365], + [-1.6854, -0.0311], + [-1.6854, 0.0311], + [-1.3829, 2.3365], + [ 1.7609, -0.2033], + [ 2.2639, -2.6430], + [-0.6005, -2.2398], + [-0.6495, 1.5423], + [-0.3429, -1.9494], + [-2.2432, -1.3978], + [-2.1538, -0.2513], + [-1.8945, 0.9677], + [-3.0614, -3.0015], + [-0.7166, -1.0928], + [ 2.2093, -2.9865], + [ 0.7336, 4.9670], + [-0.2369, -2.6590]]) ``` </p> </details> ### Full error stack ``` Tensor-likes are not close! Mismatched elements: 21 / 62 (33.9%) Greatest absolute difference: 3.719329833984375e-05 at index (26, 1) (up to 1e-05 allowed) Greatest relative difference: 0.0005033136694692075 at index (15, 1) (up to 1.3e-06 allowed) File "/home/titaiwang/pytorch/test/onnx/test_fx_op_consistency.py", line 1763, in _compare_onnx_and_torch_exported_program torch.testing.assert_close( File "/home/titaiwang/pytorch/torch/testing/_comparison.py", line 1523, in assert_close raise error_metas[0].to_error(msg) ``` ### Environment ``` OS: Linux-5.15.135.1-2.cm2-x86_64-with-glibc2.35 Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] onnx==1.15.0 onnxruntime==1.17.0 onnxscript==0.1.0.dev20240207 numpy==1.26.0 torch==2.2.0a0+git684ce1b ``` Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119512 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2024-02-09 23:24:01 +00:00
titaiwangms	bb287d73ec	[ONNX] Apply modularizarion to exported program exporting (#119498 ) Apply modularization pass to exported program exporting. The only two things that needs to be taken care of are (1) the extra call stack generated by `torch.export.export` and (2) lifted placeholder has call stack (different from original placeholder). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119498 Approved by: https://github.com/thiagocrepaldi	2024-02-09 22:57:42 +00:00
Mikayla Gawarecki	3372aa51b4	Integrate swap_tensors into nn.Module.load_state_dict (#117913 ) Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors` * `param.module_load(sd[key])` This method can be overridden using `__torch_function__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913 Approved by: https://github.com/albanD	2024-02-09 22:32:29 +00:00
Xilun Wu	a7f82b7d62	[fix] tmp fix for import issue in dtensor (#119582 ) a temporary fix for S394053 which is likely caused by backward incompatible `import` introduced in D53437243. It's yet to be understood why this may cause an issue but let's forward "fix" it first then draft a follow up diff for a right fix. Differential Revision: [D53621345](https://our.internmc.facebook.com/intern/diff/D53621345/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119582 Approved by: https://github.com/tianyu-l	2024-02-09 20:50:27 +00:00
Andrew Gu	bf8db86a19	[FSDP] Added deprecation msg for `NO_SHARD` (#119553 ) This only includes the warning for world size >1 since we clamp to `NO_SHARD` for world size 1. We mainly do not want `NO_SHARD` to proliferate anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119553 Approved by: https://github.com/Skylion007	2024-02-09 20:32:03 +00:00
Ke Wen	f3e7d80993	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-09 20:23:20 +00:00
Yang Chen	0597dab523	[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448 ) wrapper.py is getting more complex. Let's first split it into smaller pieces. Will have another PR to move CppWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448 Approved by: https://github.com/desertfire	2024-02-09 20:18:04 +00:00
Alexander Kurakin	9a1df7cfd7	ReduceLROnPlateau init _last_lr (#119366 ) (#119556 ) Fixes #119366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119556 Approved by: https://github.com/janeyx99	2024-02-09 19:35:02 +00:00
Elias Ellison	bf8a5a11be	Fix Inductor CSE Across Separate Reductions (#119410 ) We were CSE'ing a load across two separate reduction loop bodies. This is because we were examining an indirect indexing that did not have an explicit rindex in its load. I've commented with more details and other potentials on the fix. Tried using minifier unsuccessfully and hand minified some but could do more.. Fix for https://github.com/pytorch/pytorch/issues/119327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119410 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-02-09 19:34:57 +00:00
Edward Z. Yang	f208795182	Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412 ) This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways: * The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj * We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message. * We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #117356	2024-02-09 19:15:28 +00:00
rzou	01e248d6f1	Fix FallbackKernel behavior on mutable ops (#118649 ) FallbackKernel wasn't handing mutable ops correctly: it would not report them in get_mutation_names or get_alias_names. This would lead to silent incorrectness -- Inductor would incorrectly reorder the mutable op with other mutable ops. This PR fixes that: - we only support mutable operations that are "auto_functionalizable". That is, they mutate inputs and do not return aliases of any inputs. - Following the Triton kernel work, any mutated inputs must be specified in get_alias_names and processed via mark_node_as_mutating - We also do some minor cleanup by killing dead code (FallbackKernel no longer processes OpOverloadPacket) and adding some handling around HOPs. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118649 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-02-09 19:01:54 +00:00
PyTorch MergeBot	25a0fa6d13	Revert "[dynamo] Improve support for backwards hooks (#119525 )" This reverts commit b1f4b2a63c038f0090886d7d213825f39c283ea5. Reverted https://github.com/pytorch/pytorch/pull/119525 on behalf of https://github.com/clee2000 due to broke test_autograd.py::TestAutograd::test_post_accumulate_grad_hook_gets_cleaned_up on dynamo https://github.com/pytorch/pytorch/actions/runs/7847212828/job/21416215820 `b1f4b2a63c`. The failure exists on the PR as well, but got masked by the other test. Putting this as no signal? ([comment](https://github.com/pytorch/pytorch/pull/119525#issuecomment-1936447169))	2024-02-09 18:58:55 +00:00
albanD	4b9568a360	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang	2024-02-09 18:54:28 +00:00
Joel Schlosser	d5a6762263	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-09 18:51:36 +00:00
Kefei Lu	261f0138a2	[easy] Fix pass_manager type annotation (#119499 ) Summary: passes are str not callable here. Test Plan: lint Reviewed By: frank-wei Differential Revision: D53592166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119499 Approved by: https://github.com/22quinn, https://github.com/Skylion007	2024-02-09 18:39:43 +00:00
suo	5747ec24b4	[export] fix canonicalization for input mutations (#119533 ) The comparison was off: user_input_mutation and buffer_mutation had the same numeric value, which led the comparison to move to the next element of the tuple and try to compare `None` to `spec.buffer_mutation.buffer_name`, which doesn't work. So make them different numbers. Differential Revision: [D53601300](https://our.internmc.facebook.com/intern/diff/D53601300/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119533 Approved by: https://github.com/zhxchen17	2024-02-09 18:30:39 +00:00
Andrew Gu	cf42dd09ca	[FSDP2] Replaced version-ctx with `no_grad`; removed `no_grad` (#119550 ) This PR replaces the `_unsafe_preserve_version_counters` context with a simple `torch.no_grad()` context instead. This decreases CPU overhead from (1 context enter/exit + `N` loop over tensors) with just (1 context enter/exit). This PR also removes a `torch.no_grad()` from `init_unsharded_param` as it helps compiling but does not affect eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119550 Approved by: https://github.com/Skylion007	2024-02-09 18:24:19 +00:00
Yanbo Liang	f3a2094065	[Dynamo][Export] Mitigate legacy issue that aten op as export entrance function (#119528 ) This is going to fix a legacy issue like: ``` torch._dynamo.export(torch.ops.aten.scaled_dot_product_attention, ...)(*inputs,) ``` This is not supported any more, now the top level ```torch.export``` only support ```nn.Module```, but there are still some tests using the internal APIs and caused the ```trace_rules.check``` assertion error. This PR is going to mitigate such cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119528 Approved by: https://github.com/ydwu4	2024-02-09 18:24:09 +00:00
Yanbo Liang	5356b5d1f0	[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432 ) This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432 Approved by: https://github.com/jansel	2024-02-09 18:18:23 +00:00
Jerry Zhang	7082e24ce8	[quant][pt2e][bc-breaking] Set `fold_quantize` to True in `convert_pt2e` (#119425 ) Summary: This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Differential Revision: D53550237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119425 Approved by: https://github.com/andrewor14	2024-02-09 18:13:43 +00:00
Catherine Lee	3f82e435eb	Fix delete branches (#119399 ) Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch. Instead, query separately for branches with the no-delete-branch label, which I created recently. Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399 Approved by: https://github.com/huydhn	2024-02-09 17:28:00 +00:00
PyTorch MergeBot	c6f39740c7	Revert "Fix delete branches (#119399 )" This reverts commit e1fc7e1ebcf4b87d5c34bf276806212c38ca00f0. Reverted https://github.com/pytorch/pytorch/pull/119399 on behalf of https://github.com/clee2000 due to has a bug ([comment](https://github.com/pytorch/pytorch/pull/119399#issuecomment-1936291560))	2024-02-09 17:14:23 +00:00
Nikita Shulga	53a6ab3fda	[BE] Update Pillow to 10.2.0 (#119517 ) As older versions have arbitrary code execution vulnerabilities Reported by Dependabot, documented in https://nvd.nist.gov/vuln/detail/CVE-2023-50447 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119517 Approved by: https://github.com/kit1980, https://github.com/seemethere	2024-02-09 17:05:28 +00:00
Jason Ansel	b1f4b2a63c	[dynamo] Improve support for backwards hooks (#119525 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525 Approved by: https://github.com/yanboliang	2024-02-09 17:02:40 +00:00
Catherine Lee	5d6e323549	No TD (test removal) option in CI (#118808 ) It currently doesn't do anything, but I will want these env vars later. Maybe I should start using ghstack Intention: --enable-td actually gets rid of tests I am open to better names Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808 Approved by: https://github.com/huydhn, https://github.com/osalpekar	2024-02-09 16:42:27 +00:00
Catherine Lee	e1fc7e1ebc	Fix delete branches (#119399 ) Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch. Instead, query separately for branches with the no-delete-branch label, which I created recently. Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399 Approved by: https://github.com/huydhn	2024-02-09 16:40:32 +00:00
Kai Londenberg	5d81ade484	[Inductor max autotune] Multithreaded Precompilation (#119386 ) When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386 Approved by: https://github.com/aakhundov	2024-02-09 16:11:30 +00:00
Nikita Shulga	173256424a	Update setuptools to 68.2.2 (#119456 ) Followup after itself: Anaconda does not have setuptools v65, but does v68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456 Approved by: https://github.com/Skylion007	2024-02-09 15:38:25 +00:00
PyTorch MergeBot	eff93fbd86	Revert "[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432 )" This reverts commit 56364124af8fe148ba8b0c935571ebae6500f33b. Reverted https://github.com/pytorch/pytorch/pull/119432 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119432#issuecomment-1936122795))	2024-02-09 15:25:25 +00:00
Kurt Mohler	90dabff260	Avoid COW materialize in various operations (#119506 ) Operations affected include dot, cross, scatter/gather, shape, sort, triangular, unary, scalar, pad, complex, to_list, fft Pull Request resolved: https://github.com/pytorch/pytorch/pull/119506 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502, #119503, #119504	2024-02-09 14:47:19 +00:00
Kurt Mohler	8a09f1320c	Avoid COW materialize in index, reduce, compare, unique, and copy ops (#119504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119504 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502, #119503	2024-02-09 14:47:19 +00:00
Edward Z. Yang	0e6b314fc2	Avoid performing replacements when it would unrefine ranges (#117356 ) Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background. This PR does the following: * Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I only consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1` * The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work. * It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356 Approved by: https://github.com/lezcano	2024-02-09 14:43:58 +00:00
Edward Z. Yang	064610d8ac	Don't guard if there are unbacked SymInts (#119312 ) Fixes https://github.com/pytorch/pytorch/issues/119309 Not sure how to write the test. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119312 Approved by: https://github.com/lezcano	2024-02-09 11:02:47 +00:00
Edward Z. Yang	a13bb9f6a8	Add symbol_guard_limit_before_specialize (#119347 ) Add a flag setting that controls a threshold of guards involving a symbol, after which we force a symbol to be specialized. The roll out plan is to enable this on OSS but not fbcode, and then roll out to fbcode after we get some telemetry from the previous PR. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119347 Approved by: https://github.com/lezcano	2024-02-09 08:44:37 +00:00
Jiong Gong	a050d146b7	[Inductor] Add Int8 data type into Inductor CPP backend vectorized code generation (#119179 ) Summary Part 1 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type. In the current implementation for quantization, the vectorized code generation only supports the `uint8` data type. In this PR, we introduce support for the `int8` data type within the vectorized code generation. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_dequant_relu_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_quant_lowering_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_maxpool2d_lowering_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_per_tensor_fake_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_non_contiguous_load_buf_quant_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output_int8 python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering_int8 ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119179 Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/jansel	2024-02-09 07:33:12 +00:00
Kurt Mohler	5918622d72	Avoid COW materialize in pooling, batch linalg, upsample, softmax ops (#119503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119503 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502	2024-02-09 06:52:16 +00:00
Kurt Mohler	53deddd66d	Avoid COW materialization for TensorInfo with const type (#119502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119502 Approved by: https://github.com/ezyang ghstack dependencies: #119501	2024-02-09 06:51:43 +00:00
Kurt Mohler	fba5b7f7c8	Avoid COW materialization for TensorAccessors with const type (#119501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119501 Approved by: https://github.com/ezyang	2024-02-09 06:46:00 +00:00
jmarin	fa071a2e1b	Clarifying windows cosine behaviour in the documentation (#119444 ) After following the discussion, I've created a PR to update the documentation clarifying the function's behaviour (@tqbl solution 1). Fixes #110541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119444 Approved by: https://github.com/malfet	2024-02-09 05:57:44 +00:00
Sam Larsen	0f2fbbff10	Enable fake tensor caching in fbcode by default (#118555 ) Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too. Test Plan: Ran torchbench benchmarks in fbcode Differential Revision: [D53189048](https://our.internmc.facebook.com/intern/diff/D53189048) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555 Approved by: https://github.com/eellison	2024-02-09 05:42:16 +00:00
Nikita Shulga	2cdf9b7674	[BE] Update requests to 2.31.0 (#119516 ) Fixes potential memory leak detected by DepandaBot and reported in https://nvd.nist.gov/vuln/detail/CVE-2023-32681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119516 Approved by: https://github.com/kit1980, https://github.com/seemethere	2024-02-09 05:10:16 +00:00
PyTorch MergeBot	458e83b5b3	Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186 )" This reverts commit 113506d2d4a0120e912c8f36e70a621f55378f81. Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/atalman due to Reverted Internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1935310344))	2024-02-09 04:19:20 +00:00
Elias Ellison	930b60f5aa	Add Debug Utility To Generate Inputs for AOT Graphs (#119409 ) ``` Takes in a function which has been printed with print_readable() and constructs kwargs to run it. Currently only handles Tensor inputs and a graph module which might have tensor constants. Example: Consider a function `forward` defined as follows: >>> def forward(self, primals_1: "f32[1001, 6]"): ... _tensor_constant0: "i64[4190]" = self._tensor_constant0 ... # Further implementation >>> kwargs = aot_graph_input_parser(forward) >>> forward(**kwargs) """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119409 Approved by: https://github.com/shunting314	2024-02-09 03:55:19 +00:00
lezcano	2d474e17cb	Don't log canonicalized expressions (#119471 ) Fixes https://github.com/pytorch/pytorch/issues/119467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119471 Approved by: https://github.com/ezyang	2024-02-09 02:46:11 +00:00
PyTorch MergeBot	8994f2367d	Revert "Fix jagged NT softmax semantics (#119459 )" This reverts commit 6adadbaf7943f760ea2375619b1783020b69d4e6. Reverted https://github.com/pytorch/pytorch/pull/119459 on behalf of https://github.com/malfet due to broke dynamo, see https://github.com/pytorch/pytorch/actions/runs/7835402753/job/21386634602 ([comment](https://github.com/pytorch/pytorch/pull/119459#issuecomment-1935246413))	2024-02-09 02:31:49 +00:00
Peter Bell	88429a8084	[inductor] Add split scan kernel (#117992 ) This PR adds a new type of triton kernel in which data is persistent but the reduction dimension is split over multiple blocks (up to the entire kernel). though this is called a reduction dimension, in actuality we only support scans. because of this limitation, i have to be able to block fusions of split scan operations with reductions so chose to add a new `ir.SplitScan` node which is identical but allows for differentiation in the scheduler. The split scan kernel is also the first to require an additional workspace buffer which is used to communicate between cuda blocks. this is slightly tricky as we the exact scratch space requirement isn't known until the grid size is calculated. here i workaround the issue by setting a minimum rblock size and always allocating to the maximum possible grid size for a given input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992 Approved by: https://github.com/jansel ghstack dependencies: #117991	2024-02-09 01:56:00 +00:00
Peter Bell	01edb8a559	[inductor] Refactor triton range_tree handling (#117991 ) Currently the dimension handling in triton kernels has various special cases e.g. - handling "r" for non-reduction vs persistent reduction vs non-persistent reduction. - handling "x" when `no_x_dim` is set This adds three new properties to the range tree objects which capture the same information in a more generic way: - `is_loop`: true for the "r" dimension of a non-persistent reduction - `tensor_dim`: Optional index of the triton tensor dimension - `grid_dim`: Optional index of the triton grid dimension The motivation here is I want to add a new split scan kernel type which is: - not a persistent reduction, yet has `is_loop=False` for the "r" dimension - Has a `grid_dim` for the "r" dimension These flags now only need to be set once in `initialize_range_trees`, instead of having to infer them throughout the code based on the tree prefix and various other kernel flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991 Approved by: https://github.com/lezcano	2024-02-09 01:56:00 +00:00
Mihir Patel	6efda849b5	Update chunk_dtensor to support HYBRID_SHARD (#119481 ) Fixes https://github.com/pytorch/pytorch/issues/118639. Adds support to replicate across HSDP dimensions instead of sharding for shard placement Pull Request resolved: https://github.com/pytorch/pytorch/pull/119481 Approved by: https://github.com/Skylion007, https://github.com/wz337	2024-02-09 01:30:53 +00:00
Ting Lu	454abb6b99	Disable tests that use bfloat 16 for SM < 80 (#118449 ) ``` `torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Internal Triton PTX codegen error: ptxas /tmp/compile-ptx-src-83b319, line 51; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 51; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 59; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 59; error : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 65; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-83b319, line 65; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher ptxas fatal : Ptx assembly aborted due to errors Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor.py -k test_bfloat16_to_int16_cuda` ``` Fixed test failure that uses bfloat 16 on pre SM80 (V100 is where the test failure is seen for this test) See also #113384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118449 Approved by: https://github.com/eqy, https://github.com/peterbell10	2024-02-09 01:27:22 +00:00
Yue Dong	915f9db03c	[Dynamo] Support kwargs for lazy module (#119445 ) Summary: Seems like `kwargs` is already support in `_infer_argument`, so we don't need the extra assertion `len(kwargs) == 0`. This optimization ensures compatibility with torch.compile() for LazyModules with kwargs inputs, preventing graph breaks. Test Plan: Unit tetst and CI Differential Revision: D53558778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119445 Approved by: https://github.com/yanboliang	2024-02-09 00:46:41 +00:00
Nikita Shulga	45c4a0ce9d	Update setup tools to 65.5.1 (#119456 ) Should some dependabot alerts by: - Updating setupttols to 65.5.1 - Updating jinja2 to 3.3.1 TODO: - Update jinja2 and sphinx for the docs builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456 Approved by: https://github.com/Skylion007	2024-02-08 23:34:41 +00:00
PyTorch MergeBot	a8d1645f15	Revert "Add lowering for logcumsumexp (#118753 )" This reverts commit 5a77ee65879b58e99911fd53d92ddb55a1c234eb. Reverted https://github.com/pytorch/pytorch/pull/118753 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but not seen until trunk job ([comment](https://github.com/pytorch/pytorch/pull/118753#issuecomment-1935074235))	2024-02-08 23:10:33 +00:00
cyy	560c92c324	[DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142 Approved by: https://github.com/ezyang	2024-02-08 23:00:56 +00:00
Jeff Daily	e98dbae0a0	[ROCm] enable hipsolver backend for linalg.eigh (#115177 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115177 Approved by: https://github.com/lezcano	2024-02-08 22:03:27 +00:00
suo	0f12c0af44	[export] allow user input mutation in aot_export (#119356 ) This PR enables input mutation in aot_export by removing the guard and ensuring that the GraphSignature is properly wired up. This allows to undo the gross hack in torch.export where we lift user inputs to buffers in order to get around aot_export upstream support. It also makes input mutation work properly for non-strict mode. Mutations on inputs that require_grad are still banned (I added a test for a non-parameter input as well, just to make sure). Differential Revision: [D53507440](https://our.internmc.facebook.com/intern/diff/D53507440/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119356 Approved by: https://github.com/bdhirsh, https://github.com/zhxchen17, https://github.com/titaiwangms	2024-02-08 22:02:24 +00:00
Yang Chen	9f8ade04cc	[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220 ) In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220 Approved by: https://github.com/hl475, https://github.com/desertfire	2024-02-08 21:57:27 +00:00
Qianli Scott Zhu	71e772f827	Update logging.cpp for explicit chrono import (#119469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119469 Approved by: https://github.com/davidberard98	2024-02-08 21:57:23 +00:00
gs-olive	45e7af5818	Windows Dynamo Error Removal CI Check (#115969 ) Rebase of #111313 onto `main`, for CI validation Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969 Approved by: https://github.com/ezyang	2024-02-08 21:23:45 +00:00
Angela Yi	0827510fd3	[export] Remove torch._export.export (#119095 ) XLA changes: https://github.com/pytorch/xla/pull/6486 Test Plan: CI Differential Revision: D53316196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168	2024-02-08 21:22:04 +00:00
Tianyu Liu	a7754b2b60	[dtensor] switch softmax backward ops to OpStrategy (#119255 ) As titled. This is a followup to PR #117723 on softmax forward ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119255 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-08 21:18:39 +00:00
Tobias Ringwald	d9a1b25807	Fixed an issue where nn.Linear would cause an internal int underflow … (#119221 ) …when trying to reshape a scalar input. Fixes #119161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119221 Approved by: https://github.com/albanD	2024-02-08 21:06:34 +00:00
Mark Saroufim	7fd6b1c558	s/print/warn in arch choice in cpp extension (#119463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119463 Approved by: https://github.com/malfet	2024-02-08 20:38:51 +00:00
Mikayla Gawarecki	db1a4dcb5a	[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039 ) Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested). This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626. Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039 Approved by: https://github.com/janeyx99	2024-02-08 20:35:32 +00:00
Jokeren	4e93b00b69	[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450 ) `CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`. Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450 Approved by: https://github.com/jansel	2024-02-08 20:19:18 +00:00
Joel Schlosser	6adadbaf79	Fix jagged NT softmax semantics (#119459 ) Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong) After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459 Approved by: https://github.com/soulitzer	2024-02-08 20:13:12 +00:00
David Berard	278a0e1600	[NestedTensor] Support binary pointwise ops with >2 inputs (if inputs are non-tensors) (#119419 ) It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419 Approved by: https://github.com/soulitzer	2024-02-08 20:06:40 +00:00
liqunfu	cd9a1934fb	[ONNX] Bump to onnx1.15.0 and ort1.17.0 in CI (#119106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119106 Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms	2024-02-08 19:26:13 +00:00
Andrew Gu	91f038161a	[FSDP2] Used `split_with_sizes_copy` for all-gather copy-out (#119451 ) This switches to using @yifuwang's `split_with_sizes_copy.out` fast path! Pull Request resolved: https://github.com/pytorch/pytorch/pull/119451 Approved by: https://github.com/yifuwang ghstack dependencies: #118017, #118118	2024-02-08 19:04:30 +00:00
suo	def572929b	[export/nonstrict] always create FakeTensorMode (#119446 ) Previously in non-strict mode we would source a FakeTensorMode from existing tensors if available. It turns out this is problematic, as it means we can't directly control the behavior of this FakeTensorMode. For example, if the user-provided FakeTensorMode does not set `allow_non_fake_inputs=True`, then we get into trouble with constant tensors, etc. At the moment, we still have to epxlicitly re-fakifky the module state. @ezyang has recommended against this, but it's necessary because `create_aot_dispatcher_function` calls `detect_fake_mode` on all the inputs, which will error if not all the FakeTensors are on the same mode. We should straighten this out, but leaving for the future. Differential Revision: [D53559043](https://our.internmc.facebook.com/intern/diff/D53559043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119446 Approved by: https://github.com/ezyang, https://github.com/zhxchen17	2024-02-08 18:54:18 +00:00
Pearu Peterson	7ec6ac89e8	Add lowering to special.modified_bessel_i0 (#118993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118993 Approved by: https://github.com/peterbell10	2024-02-08 18:42:40 +00:00
Jorge Pineda	9242523ad5	[ET-Vulkan] aten.pow.Tensor_Tensor (#119423 ) Summary: This wires the eager-mode operation to the Vulkan shader. We only cover the case where both inputs are Tensor type, which is on par with the existing operators: add, sub, mul, div, floor_div. It doesn't seem like we can cover [any other of the 8 cases](https://www.internalfb.com/code/fbsource/[e45c04564445b5e67ebb61e6ba53995729686526]/xplat/caffe2/torch/distributed/_tensor/ops/pointwise_ops.py?lines=310-317), right now. We categorize them and explain that what's missing for each. ## Category 1 The other 2/3 "standard" cases requires one of the values to be a scalar, ``` z = torch.pow(x, y) ``` ``` aten.pow.Scalar, aten.pow.Tensor_Scalar, aten.pow.Tensor_Tensor, ``` which is not currently supported. ``` F 00:00:01.746228 executorch:aten_bridge.cpp:21] In function check_tensor_meta(), assert failed (b.sizes().data() != nullptr): ETensor must have valid sizes array ``` ## Category 2 IIUC, these operators require an out argument in the declaration. However, when they are traced they collapsed into Category 1, e.g., we obtain `aten.pow.Tensor_Tensor` not `aten.pow.Tensor_Tensor_out`. This appears in line with current PT-Vulkan, which only [implements the other two categories](https://www.internalfb.com/code/fbsource/[f148c22604b8e409696fd64f814cda89d091fe7a]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/BinaryOp.cpp?lines=533-558). ``` torch.pow(x, y, out=z) ``` ``` aten.pow.Scalar_out, aten.pow.Tensor_Scalar_out, aten.pow.Tensor_Tensor_out, ``` ## Category 3 IIUC, in-place operators are written like this: ``` x.pow_(y) ``` ``` aten.pow_.Scalar, aten.pow_.Tensor, ``` They are not currently supported. ``` File "/data/users/jorgep31415/fbsource/buck-out/v2/gen/fbcode/b007eb344207ad7d/executorch/backends/vulkan/test/__test_vulkan_delegate__/test_vulkan_delegate#link-tree/torch/_export/verifier.py", line 188, in _check_valid_op raise SpecViolationError( torch._export.verifier.SpecViolationError: operator 'aten.copy_.default' is not functional ``` Test Plan: ``` [jorgep31415@devvm15882.vll0 /data/users/jorgep31415/fbsource (fd1ed5f81)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- test_vulkan_backend_pow File changed: fbcode//executorch/backends/vulkan/vulkan_preprocess.py Buck UI: https://www.internalfb.com/buck2/7f9ec9e5-cbac-4618-b8ad-d94d10bb50ff Test UI: https://www.internalfb.com/intern/testinfra/testrun/562950306906309 Network: Up: 3.2KiB Down: 0B (reSessionID-ea5af789-c131-4170-ba20-5c5c9718276b) Jobs completed: 7. Time elapsed: 48.5s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D53547865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119423 Approved by: https://github.com/SS-JIA, https://github.com/malfet	2024-02-08 18:31:33 +00:00
lancerts	b51b27922b	Add to_empty() suggestion in the error message (#119353 ) Fixes #119293, the comprehensive documentation is [here](`0f478d9d61/docs/source/meta.rst (id11)`). Just added the suggestion into the error message so it is more informative to user. @albanD Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119353 Approved by: https://github.com/mikaylagawarecki	2024-02-08 18:30:02 +00:00
Andrew M. James	5a77ee6587	Add lowering for logcumsumexp (#118753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753 Approved by: https://github.com/peterbell10	2024-02-08 18:29:34 +00:00
PyTorch MergeBot	7315ec7505	Revert "Fix estimate_nccl_collective_runtime (#118986 )" This reverts commit 0dab6fb35284ed47d1c6339e9d71e4ca3b50dc51. Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))	2024-02-08 18:11:53 +00:00
Nikita Shulga	1d61011c11	[MPS] Add support for complex scalars (#119318 ) - Switch to native complex support if running on MacOS Monterey or newer for binary ops. - Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats - Also add `cf`(for complex float) and `ch`(for complex half) to MPSScalar value union - Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view. - Add `test_tensor_scalar_binops` Fixes https://github.com/pytorch/pytorch/issues/119088 Test plan: CI (have quite a lot of tests, see new unexpected successes) + `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318 Approved by: https://github.com/albanD	2024-02-08 18:10:59 +00:00
Sheng Fu	2b9cba86cf	Fix deadlock in ExecutionTraceObserver (#119242 ) (#119398 ) Summary: With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex. This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex. Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern. Test Plan: Unit Test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D53533253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398 Approved by: https://github.com/aaronenyeshi	2024-02-08 18:00:51 +00:00
Jiong Gong	896cf9d1ce	[inductor][cpp] vectorization support for int32/int64 (#119001 ) This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details: 1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified. 2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs. 3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion. 4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages. Next steps: - [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors. - [ ] Fully utilize vector lanes for bfloat16/float16/int8. - [ ] Support indirect indexing with vectorized index via scalarization. - [ ] Clean up `CppVecKernelChecker`. - [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-02-08 17:38:49 +00:00
PyTorch MergeBot	8182fce769	Revert "Add cpp stack traces to our own reruns (#119408 )" This reverts commit fbe6f6236e25e27e5968715f824dc8bfb0e37213. Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))	2024-02-08 17:20:39 +00:00
Angela Yi	8da2f81527	[export] Convert internal tests to using .module() (#119105 ) Test Plan: CI Differential Revision: D53091904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119105 Approved by: https://github.com/ydwu4	2024-02-08 17:19:07 +00:00
Angela Yi	c3e0836084	[export] Remove CallSpec (#117671 ) Summary: This is not really being used anywhere Test Plan: CI Differential Revision: D52842563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117671 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-02-08 17:19:03 +00:00
Yukio Siraichi	9436710afd	Implement shallow copy functions for `FunctionalTensorWrapper`. (#118783 ) Fix: #115792 This PR implements 2 virtual functions of `TensorImpl` that are called when setting the `tensor.data`: - `shallow_copy_from`: which calls `copy_tensor_metadata`; and - `copy_tensor_metadata`: which copies all `FunctionalTensorWrapper` metadata and ~calls `dest->value_.set_data(src->value_)`~ assigns `dest->value_ = src->value_`, so as to copy also the inner tensor using the same method Before this PR, the inner tensor of a `FunctionalTensorWrapper` was being ignored. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118783 Approved by: https://github.com/bdhirsh	2024-02-08 17:15:46 +00:00
Chien-Chin Huang	6d8f192fd0	[DCP] Call os.sync if os.fsync does not work for fsspec (#119287 ) Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync() If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call. Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #118888	2024-02-08 17:10:38 +00:00
ydwu4	b251bca205	[dynamo] inlining into __iter__ of user defined object (#119243 ) Fixes #119198. This PR make dynamo inline `__iter__` of a user defined object instead of creating a graph break. Also added a new test, which shows: 1. the loop is unrolled 2. the length of the loop is guarded when inlining `__iter__` ```python class Mod: def __init__(self): self.a = [torch.randn(2, 2), torch.randn(2, 2)] def __iter__(self): return iter(self.a) def f(mod): ret = [] for x in mod: ret.append(x + 1) return ret ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119243 Approved by: https://github.com/jansel	2024-02-08 17:07:30 +00:00
angelayi	b181e52a8f	[export] Support non-tensor tuple hoo outputs (#119402 ) There's an internal custom op which has a None output, so when it becomes auto_functionalized, the HOO's output is (None, Tensor, Tensor, ...). This PR adds support for the None output, and any int/bool outputs from HOOs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119402 Approved by: https://github.com/suo, https://github.com/avikchaudhuri	2024-02-08 16:54:40 +00:00
zdevito	7f05c72864	[nccl flight recorder] record time we discover start and complete (#119249 ) Some APIs like ncclCommAbort can cause nccl kernels to finish even if they were previously stuck. Because we can gather the trace buffer after those calls, we can end up seeing some collectives marked completed eventhough that complete happened several minutes after they started and clearly after the timeout. This changes how we record state so that we keep track of the time we discover a state change, so even if eventually the collective gets marked complete, we can observe it happened minutes after it was schedule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249 Approved by: https://github.com/wconstab	2024-02-08 16:48:33 +00:00
Peter Bell	3a8bf25fdd	[SparseCsr] Remove triton sdpa skip after triton pin update (#109601 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601 Approved by: https://github.com/desertfire, https://github.com/amjames	2024-02-08 16:40:25 +00:00
Chien-Chin Huang	d947534782	[DCP] Enable filesystem/fsspec auto detection (#118888 ) This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id. Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-02-08 16:38:04 +00:00
lezcano	4f2bf7fa87	Print the value of constants in __str__ (#119276 ) Not sure why we haven't been doing this really... Pull Request resolved: https://github.com/pytorch/pytorch/pull/119276 Approved by: https://github.com/jansel	2024-02-08 16:23:36 +00:00
Banit Agrawal	579999a731	[PyTorch] Back scalar value to pinned memory for .item() (#119202 ) Summary: This diff optimizes the .item() call by backing the scalar value storage with pinned memory, so we dont create an implicit synchronization with libcuda library. Test Plan: # Prod VDD model on H100 Vanguard runs 9.8k qps -> 10.1k qps (~3% improvement) # .item() Benchmark 1 thread 50k iterations consistent ~2-3% improvements With pinned memory item() took 1.627608060836792 seconds item() took 1.635591983795166 seconds item() took 1.6398141384124756 seconds item() took 1.6378591060638428 seconds item() took 1.618534803390503 seconds item() took 1.6467158794403076 seconds item() took 1.6278800964355469 seconds item() took 1.6205573081970215 seconds item() took 1.64951753616333 seconds item() took 1.6286702156066895 seconds w/o pinned memory item() took 1.6783554553985596 seconds item() took 1.6670520305633545 seconds item() took 1.6748230457305908 seconds item() took 1.6708712577819824 seconds item() took 1.6836023330688477 seconds item() took 1.6518056392669678 seconds item() took 1.6769678592681885 seconds item() took 1.661888837814331 seconds item() took 1.6627326011657715 seconds item() took 1.6908581256866455 seconds Differential Revision: D53431148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119202 Approved by: https://github.com/xw285cornell	2024-02-08 16:23:15 +00:00
Peter Bell	08657b82f5	Reduce scope of dispatching in logcumsumexp_backward (#119397 ) Everything inside the `AT_DISPATCH` block is being compiled 5 times, so it makes sense to limit it to the only line that uses `scalar_t` which is the `numeric_limits` query. Also a small optimization, instead of computing `grad.log()` and `(-grad).log()` we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397 Approved by: https://github.com/lezcano, https://github.com/albanD	2024-02-08 15:09:22 +00:00
Yanbo Liang	56364124af	[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432 ) This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432 Approved by: https://github.com/jansel	2024-02-08 09:41:52 +00:00
Yu, Guangye	0a41ac3cf3	[1/2] Intel GPU Runtime Upstreaming for Stream (#117611 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`. # Design Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per priority per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like - `XPUStream getStreamFromPool` - `XPUStream getCurrentXPUStream` - `void setCurrentXPUStream` - `void device_synchronize` # Additional Context In our plan, 2 PRs should be submitted to PyTorch for `Stream`: 1. for c10 2. for python frontend. The differences with CUDA: no default and external stream in XPU and lack of the below API: - `getDefaultCUDAStream` - `getStreamFromExternal` for cuda, `cuda::device_synchronize` can sync all streams on the device, but for xpu, `xpu::sync_streams_on_device` only sync all reserved streams on the device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117611 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-08 09:07:23 +00:00
cyy	7d516bbd5f	Update MacOS deployment target to OS version 11.1 (#119373 ) To avoid the following error: ``` 2024-02-07T12:49:51.8306390Z ld: warning: dylib (/Users/runner/work/_temp/anaconda/envs/wheel_py38/lib/libomp.dylib) was built for newer macOS version (11.1) than being linked (11.0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119373 Approved by: https://github.com/huydhn	2024-02-08 08:19:42 +00:00
PyTorch UpdateBot	5f6b35915a	[executorch hash update] update the pinned executorch hash (#119336 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119336 Approved by: https://github.com/pytorchbot	2024-02-08 03:38:53 +00:00
Pritam Damania	f579c65ef6	Release GIL for torch::autograd::clear_autocast_cache (#119416 ) Fixes #119262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119416 Approved by: https://github.com/albanD	2024-02-08 03:22:48 +00:00
Andrew Gu	9d6bf20022	[FSDP2] Added backward prefetching (#118118 ) This PR adds explicit backward prefetching to overlap communication and computation in backward (namely, needed for `reshard_after_forward=True` or `reshard_after_forward: int`). We do this by recording the post-forward order and using its reverse to approximate the backward order. This works for the typical 1 forward / 1 backward training. However, for more complex schedules, this can run into some gaps: - We need to know the _true end of backward_. - At the true of end of backward, we can clear our recorded post-forward order and pre-backward hook state, and we should wait on gradient reductions. - There is no easy way to know whether the current backward marks the true end of backward. Therefore, we introduce an API for the user to set this: `fsdp_module.set_is_last_backward(bool)`. For example, for pipeline parallelism's DFS cooldown backward, we can call `fsdp_module.set_is_last_backward(is_last_microbatch)`. - When the user runs backward through only part of the model, our reverse-post-forward-order heuristic risks _mistargeted prefetches_ for unused modules, which would mean the module's parameters are all-gathered and not freed until the end of backward. - To error on the side of less memory usage (but no overlap), this PR introduces logic to check whether a module will need its unshard in the current backward (by recording the module's `forward` outputs' `grad_fn`s and querying the autograd engine). - Note that there may be _no_ overlap in backward for some parts due to no prefetching. - Note further that when running multiple backwards, if the user does not use `set_is_last_backward`, we may not be able to provide a meaningful error message, as the pre-backward hook could be erroneously cleared on the 1st backward. - In the future, we may expose more APIs from the autograd engine (similar to `_current_graph_task_execution_order`) to make the prefetching exact. (Currently, `_current_graph_task_execution_order` requires the `with torch.autograd.set_multithreading_enabled(False)`, which is too hard of a constraint as we cannot easily modify users' training loops. We can replace the multi-threading check with a device check. Moreover, in the partial backward case in this PR's unit test, I still hit an [internal assertion](`b816760a2f/torch/csrc/autograd/engine.cpp (L476)`), so some follow-up is required.) <details> <summary> Old Discussion </summary> For discussion: - The PR includes a counter `expected_backward_unshard_count` to mitigate mistargeted prefetches in backward. However, it can be seen as a necessary but not sufficient solution. - If a module's outputs do not require gradient, then we certainly do not need to unshard the module in backward. - However, if a module's outputs do require gradient, then we still may not need to unshard the module for _this_ backward (e.g. if the module did not contribute to `loss` for the current `loss.backward()`). - This counter will only address the first case but not the second. If we want to address the second, then we may need more info from the autograd engine. - For now, I did not include any unit test to cover these behaviors, as I do not have a good example yet. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118118 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #118017	2024-02-08 03:17:45 +00:00
Chien-Chin Huang	1d2382f141	[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 ) Summary The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. Key Logic 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. Bucketing The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab	2024-02-08 03:03:15 +00:00
Thiago Crepaldi	113506d2d4	Add FakeTensor support to torch._utils._rebuild_tensor (#108186 ) Partially fixes https://github.com/pytorch/pytorch/issues/105077 Repro: ```python import tempfile import torch from torch._subclasses import fake_tensor class TheModelClass(torch.nn.Module): def __init__(self): super(TheModelClass, self).__init__() self.fc1 = torch.nn.Linear(5, 10) def forward(self, x): return self.fc1(x) with tempfile.NamedTemporaryFile() as state_dict_file: # Create state_dict to be loaded later model = TheModelClass() torch.save(model.state_dict(), state_dict_file.name) fake_mode = fake_tensor.FakeTensorMode() with fake_mode: # This is where the bug is triggered state_dict = torch.load(state_dict_file.name) ``` Error: ```bash Traceback (most recent call last): File "issue_gh_torch_105077.py", line 22, in <module> state_dict = torch.load(state_dict_file.name) File "/opt/pytorch/torch/serialization.py", line 1014, in load return _load(opened_zipfile, File "/opt/pytorch/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor return t.set_(storage._untyped_storage, storage_offset, size, stride) File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants _, new_kwargs = normalize_function( File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function torch_op_schemas = get_signature_for_torch_op(target) File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp> signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature arg_type = _torchscript_type_to_python_type(arg.type) File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type return eval(ts_type.annotation_str, _type_eval_globals) File "<string>", line 1, in <module> NameError: name 'Storage' is not defined ``` This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor. Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186 Approved by: https://github.com/ezyang	2024-02-08 03:01:34 +00:00
Yu, Guangye	9a992b0918	[4/4] Intel GPU Runtime Upstreaming for Device (#116869 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization. # Design This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability. # Additional Context We adopt a similar design to CUDA. So we share some code with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet ghstack dependencies: #119248	2024-02-08 03:01:21 +00:00
Jorge Pineda	3cb7ec312c	[PT-Vulkan] aten::conv1d - opt: width-pack weight tensor (>2x speedup) (#118835 ) ## This diff This optimization reduces calls to `texelFetch(uKernel, ...)` by 4. We borrow MatMul's work to do the re-packing: https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50 ## Future optimziations We are already batching reads from input/weight tensors, and writes to output tensor. Here are other ideas, which I won't pursue for now. (2) is the most doable. 1. Batch reads/writes along the dimension that is most commonly > 1. For weights, the length dimension is definitely correct here, but input/outputs could potentially leverage the length dimensions too. However, `stride != 1` would complicate this optimization. 2. Batch an optimal number of reads/writes. Instead of default-ing to 4 elements (since that corresponds to 1 texel), consider more elements such as MatMul's 4x4 texel tile. 3. Obscure shader compiler optimizations. Since MatMul seemed to benefit from several seemingly equivalent ways to write code. Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118835 Approved by: https://github.com/SS-JIA, https://github.com/liuk22	2024-02-08 02:23:51 +00:00
Edward Z. Yang	2349e473f1	Forward fix for same_shape oblivious guard (#119383 ) Fixes internal test ``` buck2 test '@fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn_test -- --exact 'accelerators/workloads/models/slimdsnn:slimdsnn_test - test_generate (accelerators.workloads.models.slimdsnn.test_slimdsnn.SlimDSNN)' ``` And I added an OSS test that approximates the internal situation. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D53544208](https://our.internmc.facebook.com/intern/diff/D53544208) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119383 Approved by: https://github.com/atalman, https://github.com/albanD	2024-02-08 02:11:46 +00:00
Mateus Devino	64aaa8f508	Fix typo on Contribution Guide (#119428 ) Fixes #119427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428 Approved by: https://github.com/awgu, https://github.com/kit1980	2024-02-08 01:07:27 +00:00
albanD	fbe6f6236e	Add cpp stack traces to our own reruns (#119408 ) Note that I'm not sure why we both have pytest rerun the failing test twice via `81abc2b249/test/run_test.py (L966)` before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408 Approved by: https://github.com/huydhn	2024-02-08 00:54:16 +00:00
Mihir Patel	33761969a4	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/Skylion007	2024-02-08 00:49:28 +00:00
Ke Wen	029a16c41f	[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 ) Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-02-07 22:29:29 +00:00
Natalia Gimelshein	6fe5a3adaf	release GIL for cudaEventDestroy (#119393 ) cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393 Approved by: https://github.com/Skylion007	2024-02-07 22:16:18 +00:00
Colin Peppler	ad75d9e2ca	[easy] Fix test_triton_kernel_reinterpret_view_mem_leak by cloning fwd input (#119219 ) ``` $ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view_mem_leak # Before RuntimeError: Found following user inputs located at [0] are mutated. This is currently banned in the aot_export workflow. If you need this functionality, please file a github issue. fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=True, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutates_storage_metadata=False, requires_grad=False, mutation_type=<MutationType.MUTATED_OUT_GRAPH: 3>),...) # Now Ran 6 tests in 13.851s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119219 Approved by: https://github.com/oulgen	2024-02-07 21:30:16 +00:00
PyTorch MergeBot	81abc2b249	Revert "[quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701 )" This reverts commit 482d952e880cf78c103a06f2d483556ab0a89138. Reverted https://github.com/pytorch/pytorch/pull/118701 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118701#issuecomment-1932866964))	2024-02-07 20:56:16 +00:00
albanD	a6e16fe202	Fix global in header warning (#119380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119380 Approved by: https://github.com/janeyx99	2024-02-07 20:35:21 +00:00
Kaiming Ouyang	35aa353c48	Change watchdog log from "NCCL" to "Process group" (#118121 ) This PR changes the watchdog log. In order to avoid confusion that NCCL creates a watchdog thread and reports the error log, it is better to change "NCCL" to "Process group" to better indicate the source of the log. @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/118121 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-02-07 20:14:49 +00:00
Aaron Gokaslan	892a7bf674	[BE]: Add filelock typing to mypy stubs (#119390 ) Realized we used filelock in some places, but didn't have a mypy type stub for it. Noticed it in this PR: https://github.com/pytorch/pytorch/pull/119386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119390 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-07 20:14:28 +00:00
Nikita Shulga	d0db80126e	[EZ][CI] Fetch full history for MPS jobs (#119401 ) Otherwise emitting TD stats will fail with following warning: ``` Emiting td_test_failure_stats /Users/ec2-user/runner/_work/pytorch/pytorch/tools/testing/target_determination/heuristics/edited_by_pr.py:37: UserWarning: Can't query changed test files due to Command '['git', 'merge-base', 'origin/main', 'HEAD']' returned non-zero exit status 1. warn(f"Can't query changed test files due to {e}") ``` Test plan: Observe that MPS jobs finishes without those warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/119401 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-02-07 19:29:30 +00:00
Jack Zhang	51fb99250b	Fix missing MAST log when there is Unicode non-decodable text in logs (#119298 ) Summary: ## Issue When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102 In the example, the process stopped producing Python logs after 17:20:21 untill the job finished ``` [0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % \| (gpu mem: 25.8GB, free CPU mem: 1387.8GB) I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0] ``` At the end, `UnicodeDecodeError` was thrown at the end with no call stack. ## Fix Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens. Test Plan: f528854819 Differential Revision: D53483644 Co-authored-by: Jack Zhang <jackzh@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298 Approved by: https://github.com/XilunWu	2024-02-07 19:25:43 +00:00
Hirochika Matsumoto	02c24b0b5e	Add Python binding `resizable` to class `{Untyped,Typed}Storage` (#119286 ) This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users. Fixes #119233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286 Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki	2024-02-07 19:15:55 +00:00
Andrew Gu	d054cd3e44	[FSDP2] Added `reshard_after_forward` (#118017 ) This PR adds the `reshard_after_forward: Union[bool, int]` arg and a `reshard()` method. The `reshard_after_forward` argument trades off communication and memory. - `reshard_after_forward=True`: reshard parameters after forward; unshard (all-gather) in backward - `reshard_after_forward=False`: no reshard of parameters after forward; no unshard (all-gather) in backward - `reshard_after_forward: int`: reshard parameters to a smaller world size; unshard (all-gather) over small world size in backward In comparison with DeepSpeed and existing FSDP: - `reshard_after_forward=True` == `FULL_SHARD` == ZeRO-3 - `reshard_after_forward=False` == `SHARD_GRAD_OP` == ZeRO-2 - `reshard_after_forward=8` == ZeRO++ ZeRO-1 is `reshard_after_after_forward=False` without gradient reduction (implemented in a later PR). If we need gradient reduction on an iteration, then ZeRO-2 supersedes ZeRO-1. We prefer a simple state transition between `SHARDED` / `SHARDED_POST_FORWARD` and `UNSHARDED`, where the state directly defines what tensors are registered to the module. In particular, we _do not_ have a state where the sharded parameters are registered but the unsharded parameters are still in GPU memory. This greatly simplifies our state transitions, but it means that parameters may be non-intuitively registered to the module (e.g. if only the root does not reshard after forward, then the root will be the only without sharded parameters registered). To address this, we introduce a simple `reshard()` method that can force-reshard the parameters. This makes sense to me because the typical case does not care about the registered parameters after forward (in fact, for existing FSDP with `use_orig_params=False`, the unsharded parameters are still registered and are dangling tensors without storage.) I plan to expose a complementary `unshard(async_op: bool = True)` method in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118017 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-02-07 19:14:20 +00:00
Jerry Zhang	482d952e88	[quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701 ) Summary: This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel	2024-02-07 19:10:51 +00:00
Michael Suo	0e2330d84c	fix lint (#119395 ) Summary: as title Test Plan: lint Differential Revision: D53532399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119395 Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet	2024-02-07 19:06:41 +00:00
Mikayla Gawarecki	23b030a79c	[easy] Add testing utilties for torch.nn.utils.set_swap_module_params_on_conversion (#118023 ) For above PR to parametrize existing `load_state_dict` tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118023 Approved by: https://github.com/albanD ghstack dependencies: #118028, #117167	2024-02-07 18:55:44 +00:00
Mikayla Gawarecki	d5a718d27b	Add swap_tensors path to nn.Module._apply (#117167 ) Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass. From offline discussion, for now we are not allowing `swap_tensor` after the first module forward has been run* if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](`6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)`). Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary. From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected. If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error. `RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now* Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167 Approved by: https://github.com/albanD ghstack dependencies: #118028	2024-02-07 18:55:44 +00:00
Lei Mao	91d1d2c421	Make MHA Query Scaling Behaviors Consistent (#119323 ) The multi-head attention (MHA) query scaling behaviors are not consistent when [`need_weights`](`8ac9b20d4b/torch/nn/modules/activation.py (L1073)`) values are different. On the current main, when `need_weights = True`, the query scaling was performed using a [division](`8ac9b20d4b/torch/nn/functional.py (L5434)`) and it will be exported as a `Div` operator in ONNX. When `need_weights = False`, the query scaling was performed using a [multiplication](`422b4271ae/aten/src/ATen/native/transformers/attention.cpp (L711)`) and it will be exported as a `Mul` operator in ONNX defined in the [PyTorch ONNX Symbolics](`422b4271ae/torch/onnx/symbolic_opset14.py (L177)`). We should make the query scaling behaviors consistent. On most of the platforms, multiplication performs no worse than division. Therefore, we should use multiplication consistently for both `need_weights = True` and `need_weights = False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119323 Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD	2024-02-07 18:42:57 +00:00
William Wen	5eda355e54	[inductor, test] remove cast for test_pow2_cpu (#114912 ) Verifies https://github.com/pytorch/pytorch/issues/94010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114912 Approved by: https://github.com/angelayi	2024-02-07 18:32:30 +00:00
Yifu Wang	0dab6fb352	Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #118910, #118911, #118437	2024-02-07 18:02:51 +00:00
PyTorch MergeBot	088d538a8d	Revert "[Inductor] GEMM shape padding improvements (#118522 )" This reverts commit cc46829f96dba05b9b46bae31a1e6d2a053f667e. Reverted https://github.com/pytorch/pytorch/pull/118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](https://github.com/pytorch/pytorch/pull/118522#issuecomment-1932557670))	2024-02-07 17:42:14 +00:00
Edward Z. Yang	f6bf7d26e1	Print full exception info in Graph break log (#119292 ) So, this is a little awkward, so I don't mind more thoughts on how best to do this. Let's suppose that you have a graph break inside of an inlined function call. We are not actually going to print this graph break yet; instead, we are going to restart analysis so that we can run up until the inlined function call. When this happens, the only log message we ever get is the log to `graph_break` (seen here) reporting that a graph break has occurred. In the current code, we don't print the fully formatted exception if you are only using `graph_breaks` logging. So the exception that induced the graph break has its traceback lost forever. For some classes of errors, esp., guard on data-dependent SymInt, this is quite bad. With this change, we do print the traceback. On this sample program: ``` import torch import torch._dynamo.config torch._dynamo.config.capture_scalar_outputs = True def g(x, y): y = x.item() if y < 3: return x + 2 else: return x + 3 @torch.compile() def f(x, y): y = y * y return g(x, y) f(torch.tensor(4), torch.randn(4)) ``` It looks like this: ``` [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: Traceback (most recent call last): [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 878, in evaluate_expr [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return guard_scalar(self.sym_num) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 414, in guard_scalar [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return guard_bool(a) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 663, in guard_bool [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return a.node.guard_bool("", 0) # NB: uses Python backtrace [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/sym_node.py", line 366, in guard_bool [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return fn(args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3670, in evaluate_expr [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] concrete_val = self.size_hint(orig_expr) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3403, in size_hint [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] raise self._make_data_dependent_error(result_expr, expr) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic". [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] During handling of the above exception, another exception occurred: [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Traceback (most recent call last): [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return inner_fn(self, inst) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.call_function(fn, args, {}) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] self.push(fn.call_function(self, args, kwargs)) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 279, in call_function [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return super().call_function(tx, args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 87, in call_function [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return tx.inline_user_function_return( [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2262, in inline_call [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return cls.inline_call_(parent, func, args, kwargs) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2372, in inline_call_ [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] tracer.run() [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] and self.step() [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] getattr(self, inst.opname)(inst) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 431, in inner [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] eval_result = value.evaluate_expr(self.output) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 880, in evaluate_expr [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] raise UserError( # noqa: TRY200 [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch._dynamo.exc.UserError: Consider annotating your code using torch._constrain_as_(). It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic". [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] From user code at: [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/b.py", line 16, in f [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] return g(x, y) [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] File "/data/users/ezyang/b/pytorch/b.py", line 8, in g [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] if y < 3: [2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] ``` The end of the log at restarted computation maybe can be improved too. Right now it looks like this: ``` [2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 2 [UserFunctionVariable(), LazyVariableTracker(), TensorVariable()] [2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='Consider annotating your code using torch._constrain_as_*(). It appears that you\'re trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3). For more information, run with TORCH_LOGS="+dynamic".\n\nFor more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example', user_stack=[<FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 16 in f>, <FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 8 in g>], graph_break=True) ``` An alternative to doing it this way, is I can make symbolic shapes print a warning log when guard on unbacked SymInt itself, so we don't have to worry about Dynamo generating the backtrace well. If, for the most part, the backtrace for other graph breaks is irrelevant, then this would seem to be a more expedient solution. PTAL and submit your opinions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119292 Approved by: https://github.com/yanboliang	2024-02-07 17:20:31 +00:00
Michael Suo	f79ae7599a	[export] fakify module state in nonstrict (#119297 ) Summary: Previously, we were not fakifying module state explicitly in the nonstrict path. This led to errors when modules were constructed under a fake mode, since the user-provided fake mode was clashing with the one that we had constructed internally to fakify the inputs. This fixes things to use a single fake mode for everything. As a side effect, this raised the question of how we ought to serialize state_dicts/constants that might be fake tensors. Naively calling torch.save understandably explodes—so this diff piggybacks on our infra for doing this on meta["val"]. Open to revising this, I'm low confidence that it's the best way to do it. Test Plan: unit tests Differential Revision: D53484942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119297 Approved by: https://github.com/tugsbayasgalan	2024-02-07 17:12:22 +00:00
Bin Bao	40ec155e58	[AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066 ) Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle. Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066 Approved by: https://github.com/khabinov	2024-02-07 16:54:00 +00:00
Jane Xu	059994d2b7	Migrate load_state_dict hook tests to OptimizerInfo (#119310 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119310 Approved by: https://github.com/albanD ghstack dependencies: #119283, #119288, #119299, #119308	2024-02-07 16:00:01 +00:00
Jane Xu	0320e62255	Migrate test_state_dict hooks to OptimizerInfo (#119308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119308 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283, #119288, #119299	2024-02-07 16:00:01 +00:00
Yu, Guangye	5c46600f84	[RELAND] refactor lazy init to device-agnostic (#119248 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846). This is a common PR, and does not trigger xpu ciflow. Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman	2024-02-07 15:58:51 +00:00
Jane Xu	3625ccfbea	Move step global hooks test to OptimizerInfo (#119299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119299 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283, #119288	2024-02-07 15:50:31 +00:00
Jane Xu	7b3762e6bc	Move step pre/post hook tests to OptimizerInfo (#119288 ) Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though! With the torch cuda synchronization: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .................................................... ---------------------------------------------------------------------- Ran 52 tests in 13.680s OK ``` Excluding the torch cuda synchronization: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .................................................... ---------------------------------------------------------------------- Ran 52 tests in 1.038s OK ``` The old tests: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" .. ---------------------------------------------------------------------- Ran 2 tests in 0.518s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #119283	2024-02-07 15:50:31 +00:00
Edward Z. Yang	99ddfaf572	Add symbol guard counts instrumentation (#119290 ) This helps us understand if there are symbols which are extremely hot (i.e., have a lot of guards mentioning them). Extremely hot symbols are candidates for being turned static. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119290 Approved by: https://github.com/bdhirsh	2024-02-07 14:35:14 +00:00
Peter Bell	7c95cc5e03	Add basic reference documentation for symbolic_shapes.py (#118997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997 Approved by: https://github.com/albanD	2024-02-07 14:33:42 +00:00
Simon Fan	1435cfecfa	Increase accumulate_grad_ gradient's expected refcount to account for pybind (#119068 ) Account for pybind of the op holding 1 ref when torch.ops.inductor.accumulate_grad_.default is called during run time Pull Request resolved: https://github.com/pytorch/pytorch/pull/119068 Approved by: https://github.com/jansel ghstack dependencies: #118817, #119334	2024-02-07 10:25:43 +00:00
Simon Fan	326dcf9dc8	Never reuse accumulated gradients' buffers (#119334 ) Since accumulate grad may steal the gradient's `c10::Storage`, we can't reuse the op otherwise the gradient will get overwritten. From benchmarks, using the inductor's codegen'd _empty_strided_cpu/cuda and assigning to it has lower overhead than deep copying the gradient and reusing its buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119334 Approved by: https://github.com/jansel ghstack dependencies: #118817	2024-02-07 10:25:42 +00:00
Simon Fan	8e14e1d514	Fix gradient refcounts in pybind and compiled autograd (#118817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817 Approved by: https://github.com/jansel	2024-02-07 10:25:42 +00:00
PyTorch MergeBot	d85631b721	Revert "Fix deadlock in ExecutionTraceObserver (#119242 )" This reverts commit 6fc775ae13b675f8d02f7f85bc4348bba3ae3dd3. Reverted https://github.com/pytorch/pytorch/pull/119242 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119242#issuecomment-1931445631))	2024-02-07 07:37:22 +00:00
CaoE	dfdbd73360	add Half support for flash attention (#119247 ) Re-open for https://github.com/pytorch/pytorch/pull/118368. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119247 Approved by: https://github.com/drisspg, https://github.com/malfet	2024-02-07 05:57:41 +00:00
Yanbo Liang	0f478d9d61	[Dynamo][15/N] Merge allow_in_graph/inline/skip trace rules check into trace_rule.lookup (#118971 ) Finally we have this PR to merge allow_in_graph/inline/skip trace rules into ```trace_rules.lookup_inner```, where we can define and lookup trace rules at both function level and file level. Going forward, this is the central place that we define and consulte Dynamo trace rule for any function. * ```trace_rules.looup``` is the API can return allow_in_graph, inline or skip. * ```skipfiles.check``` is the API can return inline or skip, since we have multiple places that only do inline/skip check. * I'll move ```skipfiles.check``` to ```trace_rules.check``` as one of the follow-ups. * Both functions consulte ```trace_rules.lookup_inner``` to get the tracing rule. To avoid a single big PR, I left a few items as the follow-ups: * Remove ```skipfiles.py``` and merge the code into ```trace_rules.py```. * We do double check in ```symbolic_convert.check_inlineable```, will refactor and simplify it. We should only do inline/skip check before generating ```SkipFilesVariable``` and ```UserFunctionVariable```. * Rename ```SkipFilesVariable``` as ```SkipFunctionVariable```, since we only handle functions. * The inline/skip reasons are not logged for some cases, since the new lookup framework doesn't always return inline/skip reasons. I'll refactor loggings to record the inline/skip reason in next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118971 Approved by: https://github.com/jansel	2024-02-07 05:15:39 +00:00
Simon Fan	284b0b5f44	Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-02-07 04:29:54 +00:00
Shan19900305	6c3600d008	Enable optional tensorList fallback to cpu. (#119273 ) add optional tensorList fallback to cpu. Add testcases and old pr is: https://github.com/pytorch/pytorch/pull/106449 @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/119273 Approved by: https://github.com/bdhirsh	2024-02-07 03:54:13 +00:00
PyTorch UpdateBot	53ee47ca32	[vision hash update] update the pinned vision hash (#119337 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119337 Approved by: https://github.com/pytorchbot	2024-02-07 03:43:26 +00:00
William Wen	ee1c2449f7	[dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107 ) Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107 Approved by: https://github.com/jansel	2024-02-07 03:32:42 +00:00
BowenBao	fcc36de9d6	[ONNX][dynamo_export] Turn off opmath type promotion for div (#119112 ) Skip opmath promotion for `_prims_common.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` as well. Fixes https://github.com/pytorch/pytorch/issues/118941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119112 Approved by: https://github.com/thiagocrepaldi	2024-02-07 03:27:00 +00:00
Tamir Cohen	45a79323fe	Add torch.dtype instances to the public API (#119307 ) Fixes #91908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119307 Approved by: https://github.com/albanD	2024-02-07 02:57:49 +00:00
Nikita Shulga	8c2fde1fcf	[EZ][BE] [CMake] Remove checks for GCC-7 (#119306 ) As PyTorch now uses C++17 and needs gcc-9.4+ to compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/119306 Approved by: https://github.com/Skylion007	2024-02-07 01:24:01 +00:00
Scott Wolchok	e9907a3446	[PyTorch] Free up 8 bytes per intrusive_ptr_target (#117986 ) We don't need 64-bit reference and weak counts. (We also probably don't need a full 32 bits, but we'll deal with that later.) Differential Revision: [D52851891](https://our.internmc.facebook.com/intern/diff/D52851891/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117986 Approved by: https://github.com/ezyang	2024-02-07 00:48:00 +00:00
Mateus Devino	5f2ad407a9	Fix typo on torch.frombuffer() documentation (#119214 ) Fixes #114345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119214 Approved by: https://github.com/albanD	2024-02-07 00:41:51 +00:00
Svetlana Karslioglu	5ae6f6cffe	Test seo torch cuda (#119324 ) Testing if this will help improve SEO of this page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324 Approved by: https://github.com/albanD	2024-02-07 00:39:51 +00:00
Shunting Zhang	728228a7c7	LazyGraphModule: improve the fix for the FakeTensorMode mismatch issue (#119311 ) The previous fix https://github.com/pytorch/pytorch/pull/118981 misses some corner cases. It works when both LazyGraphModule and compiled-autograd are enabled. But it fail with FakeTensorMode mismatch error again if LazyGraphModule+CompiledAutograd+DynamicShape are all enabled. Note that disabling any of the three does not trigger the issue. The reason why enabling DynamicShape cause the previous fix not working is, we will call the bw_compiler here before running the backward pass if there are symints saved for backward: `73f0fdea5b/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L382)` The bw_compiler may cause extra GraphModule recompilation on the bw_module which cause it's forward method become the lazy one again. The fix is just to delay applying the previous fix after the potential extra call of the bw_compiler. Repro on hf_Whisper: ``` CUDA_VISIBLE_DEVICES=1 time benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --disable-cudagraphs --accuracy --only hf_Whisper --repeat 1 --compiled-autograd --dynamic-batch-only ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119311 Approved by: https://github.com/xmfan, https://github.com/jansel	2024-02-07 00:35:39 +00:00
Bin Bao	e868a7fedd	[AOTI] Rename config.aot_inductor.abi_compatible (#119065 ) Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag. Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065 Approved by: https://github.com/khabinov	2024-02-07 00:14:33 +00:00
laith sakka	c814d8e5c2	Fix handling random() calls encountered inside inlined code. (#119218 ) Fix https://github.com/pytorch/pytorch/issues/118787 In the compiled function, calls to random() are replaced with a single function call to a function that generates all the random variables . The random calls encountered during compilation used to be tracked inside a variable stored inside the instruction translator. And when there are nested translators, the tracked calls used to get lost when the inner instructions translator popped out. This diff fixes that by moving the tracked calla to the output graph which is shared across translators that are generating the same function. More details about the issue and why this solution is picked are in the github issue above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119218 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-06 23:48:21 +00:00
Jason Ansel	5e78c4b0f4	[dynamo] Functools partial reconstruct (#118583 ) Replaces #117721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118583 Approved by: https://github.com/yanboliang ghstack dependencies: #118901, #118616	2024-02-06 23:42:43 +00:00
Jason Ansel	62cc1053d8	[dynamo] Fix missing guards in FunctoolsPartialVariable (#118616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118616 Approved by: https://github.com/yanboliang ghstack dependencies: #118901	2024-02-06 23:42:43 +00:00
Sheng Fu	6fc775ae13	Fix deadlock in ExecutionTraceObserver (#119242 ) Summary: With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex. This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex. Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern. Test Plan: Unit Test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D53299183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242 Approved by: https://github.com/aaronenyeshi	2024-02-06 23:36:22 +00:00
Elias Ellison	d0ca849fdf	Refactor Symint Deduping to separate pass (#118938 ) Previously Symint Deduping was done during proxy tracing which made it more difficult to reason about. This refactors the deduping to a separate pass. We only dedupe symints which are resolvable from input symint nodes so as to avoid inducing a dependency on the backward in the forward. potential fix for : https://github.com/pytorch/pytorch/issues/118224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118938 Approved by: https://github.com/ezyang	2024-02-06 23:07:31 +00:00
PyTorch MergeBot	dea15c9fdc	Revert "Add meta registration for _foreach_norm (#118604 )" This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05. Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))	2024-02-06 22:20:44 +00:00
andrewor14	6c1cca153e	[quant][pt2e] Allow users to override train/eval behavior (#119091 ) Summary: This commit adds a util for PT2E quantization users to call `model.train()` and `model.eval()` without error. Instead, these will automatically call the equivalent `move_exported_model_to_train/eval` for the user, which only switch behavior for special ops like dropout and batchnorm. This enables users to onboard to the PT2E flow more easily. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_allow_exported_model_train_eval Reviewers: jerryzh168, tugsbayasgalan, zhxchen17 Subscribers: jerryzh168, tugsbayasgalan, zhxchen17, supriyar Differential Revision: [D53426636](https://our.internmc.facebook.com/intern/diff/D53426636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119091 Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-02-06 22:19:58 +00:00
PyTorch MergeBot	9d46fe603d	Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 )" This reverts commit 4ab852b6c558a0b8e9fea0c863c782fe65f00be0. Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))	2024-02-06 22:14:36 +00:00
Jason Ansel	0f68bcaa5c	Make filename optional in update_failures.py (#119289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119289 Approved by: https://github.com/zou3519	2024-02-06 21:56:09 +00:00
Chen_Liqing	422b4271ae	Change PrivateUse1's resize_bytes to PrivateUse1HooksInterface (#117839 ) Reopen from https://github.com/pytorch/pytorch/pull/117211 Modify the logic for entering the registration branch so that existing uts are not affected. Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117839 Approved by: https://github.com/albanD	2024-02-06 20:51:56 +00:00
William Wen	ae4e866bba	[dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438 ) Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090. Changes: - Move CacheEntry and ExtraState to C++ - Use pybind to control reference counting - Use std::list instead of manually implementing a linked list Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438 Approved by: https://github.com/jansel	2024-02-06 20:48:11 +00:00
Vladimir Malinovskii	73f0fdea5b	[fix] accounting for dilation in pool padding assertion (#118897 ) Fixes https://github.com/pytorch/pytorch/issues/7541 It is a copy of https://github.com/pytorch/pytorch/pull/111427, I have failed to fix all its issues in time, and it got closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118897 Approved by: https://github.com/mikaylagawarecki	2024-02-06 20:32:58 +00:00
Jason Ansel	ec31d11580	[dynamo] Skip dynamo when inside a functorch context (#118901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118901 Approved by: https://github.com/zou3519	2024-02-06 20:22:24 +00:00
angelayi	f3645fc38b	Update auto_functionalize docs (#119228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119228 Approved by: https://github.com/zou3519	2024-02-06 19:50:54 +00:00
Jane Xu	f85b0ea8bb	Migrate last lbfgs test over to OptimizerInfo (#119283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119283 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-06 19:49:05 +00:00
Edward Z. Yang	3f0fd36835	Introduce size oblivious guards (#118579 ) Fixes https://github.com/pytorch/pytorch/issues/117361 The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one. This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds. The infra pieces of this PR are: * Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv * When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`. * Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way. The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises. As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.) When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579 Approved by: https://github.com/eellison, https://github.com/lezcano	2024-02-06 19:45:32 +00:00
ydwu4	5410385c42	[dynamo] support comparing stream with constant (#119199 ) Before the pr, we have a graph break for: ```python def f(): if torch.cuda.current_stream() is not None: return torch.randn(2, 2) torch.compile(f, backend="eager", fullgraph=True)() ``` This pr supports comparson ops of StreamVariable and ConstantVariable by returning a constant. It's safe to return a constant in this case becuase the StreamVariable is guarded by ID_MATCH when created. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119199 Approved by: https://github.com/yifuwang, https://github.com/anijain2305, https://github.com/jansel	2024-02-06 19:26:03 +00:00
Colin Peppler	fa157af69c	[mypy] declare type for DynamoTestCase._exit_stack (#119084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119084 Approved by: https://github.com/Skylion007	2024-02-06 18:26:07 +00:00
lancerts	238d87f74d	Add a short code snippet in the RNN doc (#119150 ) Fixes #109443, also remove a duplicated comment line `# Efficient implementation equivalent to the following:` in scaled_dot_product_attention doc. @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/119150 Approved by: https://github.com/malfet	2024-02-06 17:41:51 +00:00
Edward Z. Yang	169c070076	Move catch_errors_wrapper to convert_frame (#119253 ) With this change, we now have the invariant that eval_frame only contains "hot" functions that are called at runtime, as opposed to cold functions which are only called at compile time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119253 Approved by: https://github.com/yanboliang ghstack dependencies: #119251	2024-02-06 17:40:07 +00:00
Edward Z. Yang	790858afa9	Make start compiling stack trace omit framework frames (#119251 ) Fixes https://github.com/pytorch/pytorch/issues/119238 Here's what it looks like now: ``` $ TORCH_LOGS=+torch._dynamo.convert_frame python a.py [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] torchdynamo start compiling f /data/users/ezyang/b/pytorch/a.py:3, stack (elided 5 frames): [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/a.py", line 7, in <module> [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] f(torch.randn(2)) [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] return fn(args, kwargs) [2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] $ cat a.py import torch @torch.compile def f(x): return x 2 f(torch.randn(2)) ``` The eval_frame frame is intentionally present, since what happens is you run the torch.compile wrapper, and then you actually hit the user frame to be compiled. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119251 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2024-02-06 17:40:07 +00:00
Iosif Spulber	22669843c2	Reserve sizes in c10::VaryingShape::concrete_sizes(), c10::TensorType::computeStrideProps() (#119189 ) Summary: Costly reallocs. Test Plan: CI Reviewed By: efiks Differential Revision: D53264908 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119189 Approved by: https://github.com/Skylion007	2024-02-06 17:13:37 +00:00
Yanbo Liang	8ee9f26ce8	[Dynamo] Remove build_checkpoint_variable from call_getattr (#119236 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119236 Approved by: https://github.com/jansel	2024-02-06 16:59:40 +00:00
CK Luk	2ad3599a71	Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST (#118979 ) Summary: Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST Test Plan: See the one in D53154041 Reviewed By: yjhao, yanboliang, Yuzhen11 Differential Revision: D53154041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118979 Approved by: https://github.com/yanboliang	2024-02-06 16:25:33 +00:00
CJMenart	a77be631e0	Bugfix to MixtureSameFamily's _pad_mixture_dimension (#118947 ) Fixes Issue #73792 This is a duplicate of pull request. #73864. It's a small bugfix that should have happened a long time ago, but it didn't because I didn't actually follow up with the pull request after originally submitting. That's my bad. Trying to remedy the error. This contains a fix to _pad_mixture_dimension, which intends to count the number of dimensions in its referent tensors, but accidentally counts the number of elements (and can thus end up creating tensors with potentially thousands of dimensions by mistake). Also contains a single test for the fixed behavior. Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118947 Approved by: https://github.com/soulitzer	2024-02-06 16:24:22 +00:00
PyTorch MergeBot	499040ac32	Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186 )" This reverts commit 426339e4de2efc0cbd501e2bff947ba890ec9817. Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1929978008))	2024-02-06 15:04:48 +00:00
Mengwei Liu	1e4b408b02	[decomp] Add tests for different dtypes to SDPA decomposition (#119239 ) Summary: As titled. Skipping torch.bfloat16 because for some reason the difference is 0.01. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119239 Approved by: https://github.com/drisspg	2024-02-06 11:17:07 +00:00
leslie-fang-intel	85033759d6	Update scatter_reduce_ test with parallel backend check (#118708 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2024-02-06 09:43:40 +00:00
Colin Peppler	7d7a3f0b37	[inductor] Support sympy.expr in user-defined Triton kernel grid fn (#119165 ) ## Problem A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`. Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below. ## Approach Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`. Differential Revision: D53367012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165 Approved by: https://github.com/aakhundov	2024-02-06 08:39:55 +00:00
Aiden Brent	8a8e70477e	Fix type hints on nn.attention.sdpa_kernel (#119140 ) Fixes #119133 Altered type hint and assert to include SDPBackend; disallowed None in assert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119140 Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch, https://github.com/drisspg	2024-02-06 07:33:22 +00:00
Valentine233	720f781160	[CPU] Optimize softmax as flash attention v2 (#118957 ) ### Descriptions According to flash attention v2, optimize softmax by dividing sum out of the KV inner loop. ### Performance Stable Diffusion V2.1 on GNR \| Version \| Kernel time (s) \| Speedup \| \|---------\|----------------\|----------------\| \| BF16 Before \| 28.67 \| \| BF16 After \| 23.55 \| 17.86% \| \| FP32 Before \| 54.20 \| \| FP32 After \| 49.47 \| 8.73% \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/118957 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-02-06 07:06:36 +00:00
Ke Wen	4ab852b6c5	[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 ) Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099 Approved by: https://github.com/wconstab	2024-02-06 06:59:47 +00:00
Andrew M. James	884b6d2a67	[inductor] Implementing missing magic methods on IR values. (#118933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118933 Approved by: https://github.com/peterbell10	2024-02-06 05:50:26 +00:00
PyTorch MergeBot	e47f571da7	Revert "Update scatter_reduce_ test with parallel backend check (#118708 )" This reverts commit d670dfb7ae0a88cf010455301eb1d0ef91950f1a. Reverted https://github.com/pytorch/pytorch/pull/118708 on behalf of https://github.com/leslie-fang-intel due to Test Case still fail ([comment](https://github.com/pytorch/pytorch/pull/118708#issuecomment-1928767568))	2024-02-06 04:37:08 +00:00
PyTorch UpdateBot	12ac3ba383	[executorch hash update] update the pinned executorch hash (#118936 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936 Approved by: https://github.com/pytorchbot	2024-02-06 03:41:33 +00:00
angelayi	3497388b9f	[export] Fix serialization for auto_functionalization (#118810 ) - Added support for serializig the auto_functionalization op, which required adding the functions `serialize_arbitrary_inputs` and `serialize_arbitrary_outputs` which will serialize the inputs/outputs without needing a schema, since HOOs do not have a schema. - Added support for serializing user input mutations - Added support for serializing operator inputs. They just get turned into strings. Differential Revision: [D53331039](https://our.internmc.facebook.com/intern/diff/D53331039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118810 Approved by: https://github.com/suo	2024-02-06 03:41:05 +00:00
Yanbo Liang	03db96c248	[Dynamo] Enhance autograd.Function strict mode test (#119237 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119237 Approved by: https://github.com/zou3519	2024-02-06 02:54:19 +00:00
chuanqiw	074f2bb5ce	Fix dynamo benchmark runner for torchbench skip sets (#118615 ) Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032 This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615 Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang	2024-02-06 02:06:54 +00:00
Catherine Lee	9250965f8b	[ez] Lower windows timeout limit for trunk, set test step timeout (#119234 ) Lower windows timeout to be the same as linux Step timeout thing for win (linux version + details for why at https://github.com/pytorch/pytorch/pull/93084) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119234 Approved by: https://github.com/huydhn	2024-02-06 01:54:31 +00:00
ydwu4	86d5d1650b	[dynamo] support dict.clear() (#119197 ) For code like following: ```python import torch def f(): a = {"a": torch.randn(2, 2)} a.clear() return a torch.compile(f, backend="eager", fullgraph=True)() ``` We have a graph break before the pr: ``` torch._dynamo.exc.Unsupported: call_method ConstDictVariable() clear [] {} ``` Test Plan: Added new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/119197 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-06 01:17:55 +00:00
PyTorch MergeBot	c0164f2393	Revert "[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039 )" This reverts commit 04d52d5399ad4abb8af9e8405be79e2a7f8b4c7a. Reverted https://github.com/pytorch/pytorch/pull/119039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MPS test in trunk `04d52d5399`, may be a landrace ([comment](https://github.com/pytorch/pytorch/pull/119039#issuecomment-1928595240))	2024-02-06 01:13:28 +00:00
Colin Peppler	3829b55416	[inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166 ) Differential Revision: D53398312 ## Problem Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead. ``` # What we see aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...); # What we want aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...) ``` ## Approach Use C++ wrapper's expression printer to handle this conversion Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166 Approved by: https://github.com/aakhundov	2024-02-06 00:33:25 +00:00
Jane Xu	781f7c9080	[BE] Use OptimizerInfo step_requires_closure, only_supports_sparse_grads (#119230 ) So I had planned ahead of time to use these but forgot to actually use them when migrating tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119230 Approved by: https://github.com/albanD	2024-02-06 00:13:43 +00:00
Alexander Grund	69344fe987	c10d: Don't add NCCL backend by default without CUDA (#119149 ) The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh: > RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Fixes #117746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149 Approved by: https://github.com/kwen2501	2024-02-05 23:55:07 +00:00
Shunting Zhang	fd0bf96c2b	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-05 23:35:41 +00:00
Mikayla Gawarecki	04d52d5399	[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039 ) Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested). This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626. Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039 Approved by: https://github.com/janeyx99	2024-02-05 23:19:01 +00:00
Mihir Patel	d9d8c2b79f	Remove HSDP validation check (#112435 ) Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check. However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112435 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-02-05 22:27:53 +00:00
PyTorch MergeBot	966db82c9d	Revert "Remove extra graph breaks (#118987 )" This reverts commit 9a8e3b07d75e3e9bb902f81b4b6e1042bbe06b58. Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))	2024-02-05 22:19:37 +00:00
Jane Xu	b8bb12cd45	Add meta registration for _foreach_norm (#118604 ) This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls. For script: ``` import torch ts = [torch.rand(32, 16, device="cuda") for _ in range(128)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: res = torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` OG baseline: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 25.36% 4.209ms 99.94% 16.586ms 16.586ms 8.000us 88.89% 9.000us 9.000us 1 cudaLaunchKernel 61.21% 10.159ms 61.21% 10.159ms 2.540ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.43% 71.000us 58.35% 9.683ms 9.683ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.33% 55.000us 57.35% 9.517ms 9.517ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.42% 69.000us 57.01% 9.462ms 9.462ms 1.000us 11.11% 1.000us 1.000us 1 aten::select 8.04% 1.335ms 11.29% 1.873ms 14.633us 0.000us 0.00% 0.000us 0.000us 128 aten::as_strided 3.24% 538.000us 3.24% 538.000us 4.203us 0.000us 0.00% 0.000us 0.000us 128 aten::empty 0.90% 150.000us 0.90% 150.000us 75.000us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceSynchronize 0.06% 10.000us 0.06% 10.000us 10.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 11.11% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 66.67% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 22.22% 2.000us 2.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 16.596ms Self CUDA time total: 9.000us ``` And here's after this PR: ``` STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 30.95% 4.653ms 99.95% 15.026ms 15.026ms 9.000us 90.00% 10.000us 10.000us 1 cudaLaunchKernel 52.41% 7.879ms 52.41% 7.879ms 1.970ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.39% 58.000us 48.29% 7.260ms 7.260ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.35% 53.000us 47.25% 7.103ms 7.103ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.43% 65.000us 46.90% 7.050ms 7.050ms 1.000us 10.00% 1.000us 1.000us 1 aten::empty 15.42% 2.318ms 15.42% 2.318ms 17.969us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceSynchronize 0.05% 7.000us 0.05% 7.000us 7.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 10.00% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 60.00% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 30.00% 3.000us 3.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 15.033ms Self CUDA time total: 10.000us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604 Approved by: https://github.com/albanD	2024-02-05 22:01:01 +00:00
Edward Z. Yang	51e096114b	Increase recommended logging in DEFAULT_LOGGING (#119207 ) For long running batch jobs, it is best to opt for logs that are too spammy rather than not spammy enough. This lines up DEFAULT_LOGGING with our current internal guidance at Meta. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119207 Approved by: https://github.com/bdhirsh	2024-02-05 21:59:10 +00:00
Yifu Wang	5086e1cf3f	Remove distributed/c10d/Functional.hpp (#119138 ) This file is useless and was accidentally checked in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119138 Approved by: https://github.com/Skylion007	2024-02-05 21:58:08 +00:00
Catherine Lee	200108c6e6	Delete old branches (#117079 ) Example https://github.com/pytorch/pytorch/actions/runs/7562281351/job/20592425611?pr=117079 (The code to delete branches isn't being run, it's just listing the branches it wants to delete) Internal code: https://fburl.com/code/hdvvbfkj Threshold for branch with PR is 30 days regardless of whether or not the PR is merged or not (compared to 3 days if merged and 30 days if closed). Threshold for branch without PR is 1.5 years (same internally). Threshold of ~400 queries to github so it doesn't hit token usage limits. Currently this leads to about 350 branches deleted per run. Only query for the last 90 days of updated PRs to reduce token usage, so if a branch has a PR but it was updated 90+ days ago, it will think it doesn't have a PR and will wait for the 1.5 years branch update check instead, regardless of whether the PR is open or closed. I tested that it could delete my own branch and it worked. labeled with test-config/crossref because I just want the smallest test config possible to reduce CI usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/117079 Approved by: https://github.com/malfet	2024-02-05 20:50:05 +00:00
Edward Z. Yang	b816760a2f	More progress on type checking ValueRanges (#118870 ) Type checking Python is a pain. Here are my learnings: * The types for heavily polymorphic code is going to be verbose, no way around it. I originally was hoping I could lean on polymorphism with a bounded TypeVar to compactly write signatures for many of the ValueRanges methods, but I ran into some unworkaroundable mypy bugs. Writing out all the types explicitly and using `@overload` liberally works pretty well, so I think I recommend people do that instead of trying to do fancy things. * Sympy is missing annotations for assumptions, because they are all metaprogrammed. I don't really relish maintaining a typeshed for sympy, so I wrote a small mypy plugin to add them in. * GADT style refinement is... just not a good idea in practice. Mypy easily gets confused whether or not a return value from a refined section is allowed for the outer return type. So many of these have been replaced with less informative implementation types and more informative external types via overloads. Hopefully this is good for use sites. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118870 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-02-05 20:29:25 +00:00
Mikayla Gawarecki	b92819a039	Move nn.Module.load_state_dict tests from test_nn.py to separate file (#118028 ) Move these tests out so in https://github.com/pytorch/pytorch/pull/117913 where we can to run these tests with both `torch.nn.utils.set_swap_module_params_on_conversion({True/False})` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118028 Approved by: https://github.com/albanD	2024-02-05 20:17:28 +00:00
Huy Do	71655bccbe	Fix wrong mobile build Docker image (#119213 ) It turns out that the Docker image name hasn't been updated yet referring to a non-existing name, may be we could update `calculate-docker-image` to fail in this case if there is a way to separate a non-existing name failure v.s. missing tag failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119213 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-02-05 19:48:10 +00:00
Taras Tsugrii	962fca6839	[storage][perf] Reduce _get_device_from_module overhead. (#119144 ) Using `rsplit` with maxsplit=1 is more efficient since it 1) stops traversal as soon as the first `.` from the right side is encountered 2) creates no more than 2-element list This change also reuses `last_part` to avoid unnecessary repetition of a split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119144 Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki	2024-02-05 19:33:18 +00:00
PyTorch MergeBot	b964a1222c	Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813 )" This reverts commit c24ffc3f66b2270dfc65a404687b91b55ed580e9. Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))	2024-02-05 19:25:39 +00:00
Yang Chen	b2e0f8d82d	[mypy] added type annotations to codegen_nodes methods (#119080 ) added correct type annotations to scheduler and backends' codegen_nodes methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080 Approved by: https://github.com/eellison	2024-02-05 18:33:52 +00:00
Mihir Patel	88e346680b	Patch all_gather to support HSDP + TP (#118638 ) Update all_gather to support HSDP + TP. Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118638 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337	2024-02-05 18:29:23 +00:00
Catherine Lee	f481835115	Revert "add Half support for flash attention on CPU (#118368 )" (#119204 ) This reverts commit a5a63db3bf937a6eff993d1222fab18cc63f9cb2. Fixes #ISSUE_NUMBER Reverts #118368 Got reverted internally but branch got deleted to automation didn't work Mildly edited stack trace ``` ... return torch._dynamo.disable(fn, recursive)(args, kwargs) File "torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, *kwargs) File "torch/_dynamo/external_utils.py", line 25, in inner return fn(args, *kwargs) File "torch/fx/experimental/proxy_tensor.py", line 635, in dispatch_trace graph = tracer.trace(root, concrete_args) File "torch/fx/experimental/proxy_tensor.py", line 995, in trace res = super().trace(root, concrete_args) File "torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, *kwargs) File "torch/_dynamo/external_utils.py", line 25, in inner return fn(args, *kwargs) File "torch/fx/_symbolic_trace.py", line 793, in trace (self.create_arg(fn(args)),), File "torch/fx/experimental/proxy_tensor.py", line 665, in wrapped out = f(tensors) File "<string>", line 1, in <lambda> File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 357, in _functionalized_f_helper f_outs = fn(f_args) File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 68, in inner_fn outs = fn(args) File "torch/_functorch/_aot_autograd/utils.py", line 161, in flat_fn tree_out = fn(args, *kwargs) File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 618, in functional_call out = PropagateUnbackedSymInts(mod).run( File "torch/fx/interpreter.py", line 145, in run self.env[node] = self.run_node(node) File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 593, in run_node result = super().run_node(n) File "torch/fx/interpreter.py", line 202, in run_node return getattr(self, n.op)(n.target, args, kwargs) File "torch/fx/interpreter.py", line 274, in call_function return target(args, *kwargs) File "torch/_ops.py", line 571, in __call__ return self_._op(args, *kwargs) File "torch/_subclasses/functional_tensor.py", line 380, in __torch_dispatch__ outs_unwrapped = func._op_dk( File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/fx/experimental/proxy_tensor.py", line 744, in __torch_dispatch__ return self.inner_torch_dispatch(func, types, args, kwargs) File "torch/fx/experimental/proxy_tensor.py", line 779, in inner_torch_dispatch return proxy_call(self, func, self.pre_dispatch, args, kwargs) File "torch/fx/experimental/proxy_tensor.py", line 423, in proxy_call r = maybe_handle_decomp(proxy_mode, func, args, kwargs) File "torch/fx/experimental/proxy_tensor.py", line 1225, in maybe_handle_decomp return CURRENT_DECOMPOSITION_TABLE[op](args, **kwargs) File "torch/_decomp/decompositions.py", line 4322, in scaled_dot_product_flash_attention_for_cpu torch._check( File "torch/__init__.py", line 1133, in _check _check_with(RuntimeError, cond, message) File "torch/__init__.py", line 1116, in _check_with raise error_type(message_evaluated) RuntimeError: query must be FP32, FP64, BF16 but got torch.float16 While executing %_scaled_dot_product_flash_attention_for_cpu : [num_users=1] = call_function[target=torch.ops.aten._scaled_dot_product_flash_attention_for_cpu.default](args = (%l_q_, %l_k_, %l_v_), kwargs = {attn_mask: %l_attn_mask_}) Original traceback: File "executorch/backends/xnnpack/partition/graphs/sdpa.py", line 34, in forward return torch.nn.functional.scaled_dot_product_attention( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119204 Approved by: https://github.com/kit1980	2024-02-05 18:24:53 +00:00
PyTorch MergeBot	ab613a4019	Revert "refactor lazy init to device-agnostic (#118846 )" This reverts commit 520771d7b35034c96c5b4604ecf8960e6aab856f. Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11 ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))	2024-02-05 18:06:30 +00:00
Taras Tsugrii	124a54ef16	[jit][perf] Reduce lookupInModule overhead. (#119145 ) It's inefficient to split remaining parts of the module name by '.' just to join it back again. Instead it's more idiomatic and efficient to use `maxsplit=1` to ensure that all remaining parts remain intact. This improves best case time and space complexity since scan can terminate on first encountered `.` and only 2 parts are returned in a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119145 Approved by: https://github.com/Skylion007	2024-02-05 18:01:00 +00:00
Wei Wei	fa8d97776c	[aotinductor] Migrate fuse_split_linear_add from dper_pass to AOTI based on predispatch IR (#118983 ) Summary: As titled. Added support of fuse_split_linear_add in pregrad passes based on predispatch IR Test Plan: TORCH_LOGS=inductor,aot buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb Differential Revision: D53302168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118983 Approved by: https://github.com/kflu, https://github.com/chenyang78	2024-02-05 17:58:42 +00:00
wz337	5f9f771711	[DeviceMesh][Test] Remove test_raises_mesh_dim_less_than_2 (#119172 ) The test is no longer applicable after we allow 1D slice from 1D mesh. https://github.com/pytorch/pytorch/pull/118895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119172 Approved by: https://github.com/awgu, https://github.com/atalman	2024-02-05 17:34:51 +00:00
watarungurunnn	d444a3b443	[MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771 ) Fixes #114285 (However, still have NotImplementedError ```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771 Approved by: https://github.com/lezcano	2024-02-05 15:36:55 +00:00
Shunting Zhang	a72190fd51	make nanogpt work with both compiled autograd and _LazyGraphModule (#118981 ) @xmfan and @fegin reported that _LazyGraphModule ( https://github.com/pytorch/pytorch/pull/117911 ) makes nanogpt training fail with compiled autograd. We have a repro: ``` python benchmarks/dynamo/torchbench.py --training --backend=inductor --disable-cudagraphs --accuracy --only nanogpt --repeat 1 --compiled-autograd ``` but it's still mysterious how to trigger the issue with a toy model. The error message for the failure is https://gist.github.com/shunting314/6402a6388b3539956090b6bc098952fb . In compile_fx we will call `detect_fake_mode`. This function will look for an active FakeTensorMode from both TracingContext and example inputs. The error is triggered because we find different FakeTensorMode from these 2 sources. Although I don't know what really causes the discrepancy of FakeTensorMode above, the fix here is to force _LazyGraphModule recompilation if we have compiled autograd enabled. This does not hurt compilation time most of the time because we anyway will call the graph module here in the backward pass when compiled autograd is enabled: `855d5f144e/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L705)` Let me know if we can have a better fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118981 Approved by: https://github.com/jansel	2024-02-05 10:40:06 +00:00
leslie-fang-intel	d670dfb7ae	Update scatter_reduce_ test with parallel backend check (#118708 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-02-05 08:48:45 +00:00
PrincipalsOffice	0348975a87	Set up new logging artifact for SymNode (#119158 ) Fixes #113876 Hi, I updated various logging configs and the SymNode module to use the new dedicated logging artifact. This is my first pytorch PR, mirrored my changes off of https://github.com/pytorch/pytorch/pull/111808. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119158 Approved by: https://github.com/ezyang	2024-02-05 07:34:54 +00:00
Iris Zhang (PyTorch)	0245000be8	[DeviceMesh] Temporarily disable re-use subgroup (#118940 ) Summary: The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan). We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940 Approved by: https://github.com/wanchaol	2024-02-05 06:30:00 +00:00
Animesh Jain	0c3a1c893e	[dynamo] Setup the globals for guard_fn without a reference to f_locals (#118447 ) UPDATE - I changed the PR because from discussion with @jansel it was clear that someone else was holding on to a reference to f_locals. This PR now solves that problem first. I removed the eval_frame.c part because it was failing tests that use `exec` or `eval` with weird error like `no no locals found when storing 'math'`. I would debug that in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118447 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #118975, #118420	2024-02-05 05:39:39 +00:00
Kurman Karabukaev	b8307513e5	[torchelastic][rendezvous] Add option to enable libuv for TCPStore based rendezvous backend (#118944 ) Summary: Expose an option to enable libuv in TCPStore based rendezvous backend that will allow better scaling. Libuv support has been added recently and allows scaling for more than 2K nodes. Test Plan: Unit tests Differential Revision: D53335860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118944 Approved by: https://github.com/wconstab	2024-02-04 23:11:32 +00:00
Wilson Hong	5ebed6f1c3	[torch] fix comment typo (#118656 ) Summary: as title Differential Revision: D49841787 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118656 Approved by: https://github.com/Skylion007, https://github.com/zhxchen17	2024-02-04 22:20:09 +00:00
Michael Suo	0d5f53a2f9	fix forward test_memory_planning.py (#119109 ) Summary: fixes a broken test, also makes it run in fbcode correctly Test Plan: test Reviewed By: angelayi Differential Revision: D53373709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119109 Approved by: https://github.com/angelayi	2024-02-04 21:45:07 +00:00
Zheng Yan	052e824467	improve CUDACachingAllocator lock contention (#118550 ) Summary: NativeCachingAllocator has a global lock which shows lock contention with one process using multiple GPUs. The lock is required to lookup Block from pointer. We can make the lock more fine grain to reduce the lock contention. Test Plan: existing unittests, verified on prod models using eight GPUs showing double digits improvements Differential Revision: D52493091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118550 Approved by: https://github.com/albanD	2024-02-04 16:45:25 +00:00
Bin Bao	b41f3e8df1	[AOTI] Make abi_compatible as default for OSS CI (#119126 ) Summary: Introduce an environment varible AOT_INDUCTOR_ABI_COMPATIBLE to control the ABI-compatible mode, and turn it on for OSS CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119126 Approved by: https://github.com/chenyang78 ghstack dependencies: #119125	2024-02-04 15:48:58 +00:00
Bin Bao	79b20aec76	[AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125 ) Summary: These ops exist in GoogleFnet. Also add a Complex fallback for convert_element_type. After this PR, we can enable ABI-compatible for AOTInductor OSS CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119125 Approved by: https://github.com/chenyang78	2024-02-04 15:48:58 +00:00
Yanbo Liang	cee16353db	[Dynamo][autograd.Function] Should graph break on stride accesses in backward (#119137 ) Fixes #118399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119137 Approved by: https://github.com/oulgen	2024-02-04 09:08:45 +00:00
Yifu Wang	8f82a44a5b	Run device mesh tests with native funcol enabled (#118437 ) ### Summary Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled. All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437 Approved by: https://github.com/LucasLLC ghstack dependencies: #118910, #118911	2024-02-04 04:11:11 +00:00
cyy	e3371ff739	Use correct type of indices in ForeachUtils.h (#119116 ) Fix a type mismatch detected by MSVC: ``` C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): warning C4267: “初始化”: 从“size_t”转换到“_Ty”，可能丢失数据 with [ _Ty=int ] C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): note: 模板实例化上下文(最早的实例化上下文)为 pytorch/aten/src\ATen/native/ForeachUtils.h(363): note: 查看对正在编译的函数模板实例化“_Ty &std::vector<_Ty,std::allocator<_Ty>>::emplace_back<const I&>(const I &)”的引用 with [ _Ty=int, I=size_t ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119116 Approved by: https://github.com/Skylion007	2024-02-04 04:03:54 +00:00
Edward Z. Yang	6620176da7	Add documentation for meta device (#119119 ) Fixes https://github.com/pytorch/pytorch/issues/119098 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119 Approved by: https://github.com/bdhirsh	2024-02-04 01:05:22 +00:00
Edward Z. Yang	dab16b6b8e	s/supress/suppress/ (#119132 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119132 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-02-04 00:54:14 +00:00
Edward Z. Yang	abc09b27b9	Some minor type stub improvements (#118529 ) I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529 Approved by: https://github.com/Skylion007	2024-02-04 00:19:00 +00:00
Huy Do	3ed9df36a9	Clean up some obsolete TODOs in run_test and several test files (#119113 ) * The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference. * ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~ * The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen. I have never seen a flaky C++ test that needs to be disabled before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113 Approved by: https://github.com/kit1980	2024-02-03 23:54:30 +00:00
lancerts	26a2743162	Fix placeholder tensor is empty for relu in mps (#118965 ) Fixes #118845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965 Approved by: https://github.com/malfet	2024-02-03 23:50:35 +00:00
lancerts	0ddcb5c3ca	Include the documentation on scale arg being a keyword only arg (#119129 ) Fixes #117240 @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/119129 Approved by: https://github.com/drisspg	2024-02-03 23:41:06 +00:00
Nikita Shulga	ffae20e594	[BE][MPS] Add `dictionaryFromPlaceholders` (#119077 ) Which are a convenience methods that create a dictionary from placeholder, making code a more compact. Also added `runMPSGraph` overloaded function with Placeholder instead of an output dictionary, as majority of the operators have just one output. Typical change looks as follows ```patch - NSDictionary<MPSGraphTensor, MPSGraphTensorData>* feeds = @{ - selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(), - }; - NSDictionary<MPSGraphTensor, MPSGraphTensorData>* results = - @{outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()}; - runMPSGraph(stream, cachedGraph->graph(), feeds, results); + auto feeds = dictionaryFromPlaceholders(selfPlaceholder); + runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119077 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-02-03 22:07:02 +00:00
Tianyu Liu	2d64fddd48	[dtensor] add op support for nll_loss_forward (#118917 ) This is part of the work to support cross entropy in dtensor. This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917 Approved by: https://github.com/wanchaol	2024-02-03 20:08:10 +00:00
Yanbo Liang	4c397e6ec6	[Dynamo] Add correct guards for tracable tensor subclasses (#119110 ) Fixes #118896 ``` (pt) [ybliang@devgpu002.ash8 ~/local/pytorch (subclass)]$ TORCH_LOGS="+guards" python test/dynamo/test_subclasses.py -k test_torch_dispatch_subclass_guard_recompile /home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( [2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS: [2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['w'], 110557008) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].a, '_dynamo_dynamic_indices') == False # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].b, '_dynamo_dynamic_indices') == False # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:388 in init_ambient_guards [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224) # _dynamo/output_graph.py:394 in init_ambient_guards [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].a, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1]) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].b, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1]) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,206] [0/1] torch._dynamo.guards.__guards: [DEBUG] GUARDS: [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'], '_dynamo_dynamic_indices') == False # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:388 in init_ambient_guards [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224) # _dynamo/output_graph.py:394 in init_ambient_guards [2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1]) # return torch.add(w, 1.0) # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119110 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh, https://github.com/yoyoyocmu	2024-02-03 18:12:51 +00:00
Jason Ansel	7a52455102	[dynamo] Refactor TensorVariable method handling (#119111 ) This should slightly improve compile times and be easier to maintain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119111 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-02-03 17:18:19 +00:00
laith sakka	fcf22a853d	Enable test_ellipsis_index_2 with Torch dynamo (#118773 ) Fix issue #118819 test_ellipsis_index_2 is specifically testing properties of torch._numpy.array() and that a field tensor is being added hence overriding the imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118773 Approved by: https://github.com/anijain2305, https://github.com/lezcano	2024-02-03 10:33:48 +00:00
angelayi	1adedc3c86	[decomp] Remove pixel_shuffle from core aten decomps (#118921 ) pixel_shuffle is a core aten op (https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it. https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure (https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921 Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano	2024-02-03 08:21:32 +00:00
Aaron Orenstein	4dc53f777b	Fix dynamo failure w/ astype (#117952 ) The torch "fake" ndarray had some mismatches vs numpy.ndarray which caused test_sparse_to_sparse_compressed to fail under dynamo. This also fixes (because the test now hits it) a problem where unpacking a sequence with the incorrect number of args would assert in dynamo instead of graph breaking (because it would throw an exception). Added a unit test for this condition. Fixed: - torch._numpy._ndarray.astype() (actually used by the test) - torch._numpy._ndarray.put() (drive-by discovery) - torch._numpy._ndarray.view() (drive-by discovery) (burndown item 7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117952 Approved by: https://github.com/yanboliang ghstack dependencies: #117951	2024-02-03 08:10:15 +00:00
Aaron Orenstein	c6c851102f	Fix test_compressed_layout_conversions_coverage to check BSC format (#117951 ) test_compressed_layout_conversions_coverage verifies torch's conversions between different memory layouts using numpy as a reference. Since numpy doesn't support BSC format it just skipped that. Instead fake it by using a transposed BSR format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117951 Approved by: https://github.com/zou3519	2024-02-03 08:10:15 +00:00
Lucy Qiu	6c8faf4680	[executorch] Run llama in xplat (#118831 ) Summary: Error running llama in xplat, where half type isnt part of c10_mobile targets. See: D53158320 This diff: - Creates a `torch_mobile_all_ops_et` target, which is the same as `torch_mobile_all_ops`, except with a preprocessor flag (C10_MOBILE_HALF) to support Half type - Check C10_MOBILE_HALF in LinearAlgebra.cpp and include it - Use `torch_mobile_all_ops_et` for executorch, instead of `torch_mobile_all_ops`. Considerations: - Using `torch_mobile_all_ops_et` across executorch means that our runtime binary size for xplat aten increases (see test plan for increase amount, thanks tarun292 for the pointer). This may be okay, as aten mode isn't used in production. Test Plan: Run language llama in xplat: ``` buck2 run xplat/executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos ``` And in fbcode: ``` buck2 run fbcode//executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos ``` Test executor_runner size increase with: ``` buck2 build fbcode//executorch/sdk/fb/runners:executor_runner_aten ``` \|\|original\|this diff (+half dtype)\|diff\| \|unstripped\|214975784\|214976472\|+688\| \|stripped\|71373488\|71373808\|+320\| Differential Revision: D53292674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118831 Approved by: https://github.com/larryliu0820	2024-02-03 08:07:19 +00:00
Michael Lazos	a64b03a58e	Move lr tensor to cuda if needed (#119073 ) Fixes https://github.com/pytorch/pytorch/issues/119026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119073 Approved by: https://github.com/eellison	2024-02-03 07:34:33 +00:00
Taras Tsugrii	41b63b26c2	[dynamo] Fix incorrect docstring placements in _guards.py. (#119114 ) This makes them unavailable when using help and other tools accessing them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119114 Approved by: https://github.com/kit1980	2024-02-03 06:25:54 +00:00
Michael Lazos	9a8e3b07d7	Remove extra graph breaks (#118987 ) Fixes https://github.com/pytorch/pytorch/issues/104053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987 Approved by: https://github.com/janeyx99	2024-02-03 05:55:09 +00:00
Andrew Gu	ce40ee8ecd	[FSDP] Fixed `device_mesh` and auto wrap (#119064 ) If the user passes `device_mesh`, then we should not forward the process groups to the children during auto wrap and instead just rely on the `device_mesh` argument. This should fix https://github.com/pytorch/pytorch/issues/118906. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119064 Approved by: https://github.com/wz337	2024-02-03 03:57:29 +00:00
Nikita Shulga	18fc1ca7d9	[MPS][BE] Add native lerp support (#119036 ) By implementing `out = self + weight * (end-self)` as MPS graph LERP is tested by `test_output_match_lerp_cpu_float[32\|16]` based on OpInfo and 10+ tests from `test_optim.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119036 Approved by: https://github.com/albanD	2024-02-03 02:58:50 +00:00
James Wu	30d3ff1659	Inline gradcheck functions since they don't have C bindings (#119047 ) Gradcheck functions are in python, so they shouldn't be in `torch_c_binding_in_graph_functions` fixes #118792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119047 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-02-03 02:46:39 +00:00
Yifu Wang	372e9550bd	ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. ### This PR We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following: - Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now. - By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`. - The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118910	2024-02-03 02:42:47 +00:00
Shuqiang Zhang	65314a6129	[c10d] add an unit test for unordered destruction of PGs (#119045 ) Summary: We were suspecting ncclCommsAbort was hung due to NCCL 2.17's 'bug' triggered by different ranks calls desctructors of different PGs in different order. This can be reproed in a NCCL level test for 2.17 We need a test case in c10d to constantly check if PGs can be destructed in different order Test Plan: Run the test and print out the distruction orders are expected ``` [$ python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_close_multi_pg_unordered NCCL version 2.19.3+cuda12.0 [rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 0] ProcessGroupNCCL destructor entered. [rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 0] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 1] ProcessGroupNCCL destructor entered. [rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 1] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 0] ProcessGroupNCCL abort finished. [rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 0] ProcessGroupNCCL destructor entered. [rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 0] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 1] ProcessGroupNCCL abort finished. [rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 1] ProcessGroupNCCL destructor entered. [rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 1] ProcessGroupNCCL aborting communicators, check for 'abort finished' logs or look for abort hang [rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 0] ProcessGroupNCCL abort finished. [rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 1] ProcessGroupNCCL abort finished. . ---------------------------------------------------------------------- Ran 1 test in 18.969s OK](url) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119045 Approved by: https://github.com/yifuwang	2024-02-03 02:37:12 +00:00
lancerts	857508fa36	Change the internal assert to torch_check in torch::nn::functional::InterpolateFuncOptions (#117831 ) Fixes #117333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117831 Approved by: https://github.com/malfet	2024-02-03 02:15:11 +00:00
Mikayla Gawarecki	9ffed22391	Document file format returned by torch.save (#118719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719 Approved by: https://github.com/albanD	2024-02-03 02:11:44 +00:00
ydwu4	2eba82d122	[dynamo] decrease logging level for graph break in higher order op. (#119079 ) Fixes https://github.com/pytorch/pytorch/issues/119059. This hides both logs behind TORCH_LOGS=dynamo. Just logging the exception seems not very informative. So I just put both under log.info(). For the example in the issue the log now looks like: ``` (pytorch-3.10) ~/local/pytorch$ python test.py (pytorch-3.10) ~/local/pytorch$ ``` ``` (pytorch-3.10) ~/local/pytorch$ python test.py (pytorch-3.10) ~/local/pytorch$ TORCH_LOGS=dynamo python test.py [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267 [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 272, in <module> [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] y = linear(x) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return func(args, *kwds) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__ [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) [2024-02-02 14:08:19,001] [0/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env [2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] speculate_subgraph: while introspecting autograd.Function, we were unable to trace function `backward` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] call_method GetAttrVariable(AutogradFunctionContextVariable(Function), needs_input_grad) __getitem__ (ConstantVariable(int),) {} [2024-02-02 14:08:19,017] [0/0] torch._dynamo.convert_frame: [INFO] Restarting analysis due to _dynamo/symbolic_convert.py:141 in fail_and_restart_analysis [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267 [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 272, in <module> [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] y = linear(x) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return func(args, *kwds) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__ [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) [2024-02-02 14:08:19,017] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] create_env [2024-02-02 14:08:19,021] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] produce_guards [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward /home/yidi/local/pytorch/test.py:257 [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 272, in <module> [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] y = linear(x) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/test.py", line 268, in linear [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return UseNeedsInputGradFunction.apply(x) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/autograd/function.py", line 572, in apply [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return super().apply(args, *kwargs) # type: ignore[misc] [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return func(args, *kwds) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__ [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) [2024-02-02 14:08:19,025] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env [2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics: [2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] Function, Runtimes (s) [2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0283 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119079 Approved by: https://github.com/zou3519	2024-02-03 02:10:13 +00:00
briancoutinho	d91d21fd6f	[submodule kineto] Enable profiler connection to daemon during init for cpu only jobs (#118320 ) Fixes #112389 and https://github.com/facebookincubator/dynolog/issues/208 This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch. * When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run). * Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own. * Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set. Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch. The change in PyTorch is to correctly set the `cpuOnly` argument. ## TestPlan: Build PyTorch from source with USE_CUDA=0 so we have CPU only based build. Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79` (See instructions in PyTorch repo) For the setup we run dynolog daemon in another terminal ``` buck2 run dynolog/src:dynolog -- --enable_ipc_monitor & ``` Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'. ``` export KINETO_USE_DAEMON=1 python linear_model_example.py ``` Output shows the profiler registration with dynolog ``` (pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly = 1 INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 ``` We can also collect a trace using ``` [bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json Kineto config = ACTIVITIES_LOG_FILE=/tmp/test.json PROFILE_START_TIME=0 ACTIVITIES_DURATION_MSECS=500 PROFILE_REPORT_INPUT_SHAPES=false PROFILE_PROFILE_MEMORY=false PROFILE_WITH_STACK=false PROFILE_WITH_FLOPS=false PROFILE_WITH_MODULES=false response length = 147 response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]} Matched 1 processes Trace output files will be written to: /tmp/test_1807792.json ``` And trace file contains the trace correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320 Approved by: https://github.com/aaronenyeshi	2024-02-03 01:40:56 +00:00
Chien-Chin Huang	494c2ec054	[DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887 ) There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam. Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887 Approved by: https://github.com/wz337	2024-02-03 01:14:13 +00:00
drisspg	6b009aceea	Enable scaled_mm on sm89 devices (#118881 ) Fixes #118703 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118881 Approved by: https://github.com/malfet	2024-02-03 00:44:03 +00:00
angelayi	440b7d5279	[auto_functionalize] Remove mutated_args_name from args (#119050 ) `auto_functionalize` currently takes a custom op, a list of mutated argument names, and inputs to the custom op as kwargs. The list of mutated argument names is computed from the schema, and gets created when we're tracing. However, it seems that having the list of mutated argument names is a little unnecessary since we can always recompute it from the schema during runtime. This also prevents the case where users might incorrectly modify the inputs to this operator, as we will now just recompute it during the runtime. This probably won't affect things too much because inductor will decompose auto_functionalize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119050 Approved by: https://github.com/zou3519	2024-02-03 00:27:14 +00:00
PyTorch MergeBot	3aeaa21eb0	Revert "Remove parent device mesh check (#118620 )" This reverts commit 3f1f057adfcd4cef67fff9605a894cb075c02881. Reverted https://github.com/pytorch/pytorch/pull/118620 on behalf of https://github.com/atalman due to broke periodic linux-focal-cuda11.8-py3.9-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/118620#issuecomment-1924933878))	2024-02-03 00:22:56 +00:00
Yuzhen Huang	de6a906093	Expose aggressive_recomputation as an inductor config (#118943 ) Summary: As title. We found aggressive_recomputation shows memory savings (7% on APS COFFEE model) with 2% QPS loss. It also gives very promising signal on our auto ac experiments: https://docs.google.com/document/d/1S2qgMg1CwAQ4U1Ffuk2epbEOx06ogZhioX2jKCwL7ZQ/edit {F1426175073} Test Plan: APS COFFEE from silverlakeli - Zoom of baseline job: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=927380488801910&tab=overview - Zoom of job with aggressive_recomputation: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1126815608217470&tab=overview APS 1100x shrunk version: - baseline: https://www.internalfb.com/mast/job/aps-yuzhenhuang-afe049505a - test: https://www.internalfb.com/mast/job/aps-yuzhenhuang-709e41bf0d Memory from 42.98% -> 41.04%. Reviewed By: yf225, yuxihu, silverlakeli, richqyz Differential Revision: D53248057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118943 Approved by: https://github.com/anijain2305, https://github.com/yanboliang	2024-02-03 00:17:03 +00:00
Linus	7bbd9befed	Improve example for ``torch.mode()`` (#115308 ) Fixes #89820 and improves the documentation. Co-authored-by: Sam Gross <colesbury@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115308 Approved by: https://github.com/colesbury	2024-02-03 00:13:26 +00:00
Shunting Zhang	c24ffc3f66	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-03 00:06:21 +00:00
lancerts	576383c2eb	Add torch check for dtype within bilinear (#118900 ) Fixes https://github.com/pytorch/pytorch/issues/117237 Short-term fix, when dtype does not match, it will be reflected in the torch check. @ezyang a cpp test case is added Pull Request resolved: https://github.com/pytorch/pytorch/pull/118900 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-02-03 00:02:00 +00:00
PyTorch MergeBot	a4355d6b9a	Revert "Add --filter-rank to torchrun: allow logs filtering by rank (#118562 )" This reverts commit 73229b4f931f8cd1799b0905d61e3d8e85157bcd. Reverted https://github.com/pytorch/pytorch/pull/118562 on behalf of https://github.com/xmfan due to breaks MAST precheck, flag naming conflict ([comment](https://github.com/pytorch/pytorch/pull/118562#issuecomment-1924916601))	2024-02-02 23:56:21 +00:00
willfengg	63fd6883fd	[c10d] logging utility for cpp-python stacktrace (#118924 ) user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce`` ``` LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: " << get_python_cpp_trace(); ``` output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838`` ``` ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0 #1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0 #2 c10d::get_python_cpp_trace[abi:cxx11]() from :0 #3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0 #4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0 #5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > ()(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from :0 #6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from autograd_not_implemented_fallback.cpp:0 #7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0 #8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > ()(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 #9 pybind11::cpp_function::dispatcher(_object, _object, _object) from :0 #10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543 #11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838 #15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75 #18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399 #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308 #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332 #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448 #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413 #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839 #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520 #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511 #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431 #44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494 #45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #47 inner from /data/users/weif/pytorch/run_fsdp.py:72 #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #50 run from /data/users/weif/pytorch/run_fsdp.py:76 #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #53 main from /data/users/weif/pytorch/run_fsdp.py:133 #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #56 <module> from /data/users/weif/pytorch/run_fsdp.py:137 #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134 #59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291 #60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312 #61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208 #62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456 #63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90 #64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357 #65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090 #66 __libc_start_call_main from ??:0 #67 <unwind unsupported> from ??:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924 Approved by: https://github.com/kwen2501	2024-02-02 23:49:18 +00:00
titaiwangms	a3cec6a7fa	[ONNX] Eliminate redundant TODOs (#119060 ) Remove titaiwangms/AllenTiTaiWang/titaiwang created TODOs: 1. Resolved TODOs 2. Turned TODOs to NOTEs if they are not actionable 3. Merge duplicated TODOs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119060 Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi	2024-02-02 23:37:52 +00:00
Angela Yi	454e6b380c	[export] Prevent specialization on backends (#118683 ) Summary: https://github.com/pytorch/pytorch/issues/118289 shows that sometimes we will decompose into backend-specific operators, causing some specializations. We should probably avoid this by disabling these by default? Test Plan: CI Differential Revision: D53241300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118683 Approved by: https://github.com/zhxchen17	2024-02-02 23:33:59 +00:00
Michael Suo	db2225da37	[export] fix forward test_lift_unlift (#119090 ) Test Plan: fixes test Reviewed By: zhxchen17 Differential Revision: D53367522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119090 Approved by: https://github.com/kit1980	2024-02-02 23:07:36 +00:00
ydwu4	9fe3693bbb	[dynamo] bypass graph break due to masking if inference mode (#119056 ) Relax the constraints in https://github.com/pytorch/pytorch/issues/114123 when we're in inference mode. Test Plan: See added tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119056 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-02-02 22:53:23 +00:00
suo	4d45c68ca6	[fx] fix for subgraph rewriter (#119052 ) the semantics of `try_get_attr` are to default to None if the attribute doesn't exist; but we were throwing an exception in `get_submodule`. Catch that exception and return None. Differential Revision: [D53358747](https://our.internmc.facebook.com/intern/diff/D53358747/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119052 Approved by: https://github.com/angelayi	2024-02-02 22:47:53 +00:00
wz337	c908caf92b	[DeviceMesh] Alllow 1d slice from 1d mesh (#118895 ) Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851) i.e. mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp")) then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895 Approved by: https://github.com/wanchaol	2024-02-02 22:00:24 +00:00
Animesh Jain	6379010ebd	[dynamo][higher order ops] Remove restore side effects logic (#118420 ) The problem was exposed in https://github.com/pytorch/pytorch/pull/118071 where the control flow tests were always recompiling. The issue turned out that the same nonlocal variable used in `true_fn` and `false_fn` was getting lifted twice and thus creating two inputs in the main Fx graph. Dynamo Tensor guards does not like it because it wants all input tensors to be non-aliased. We already have logic to check if two different sources (closure of true_fn and closure of false_fn) point to the same tensor using side effects infra. But we were restoring side_effects after subtracing the true and false branches. This is not needed anymore. side_effects trace both read-only as well as actual writes to the variables. For higher order ops, any mutation which is not read-only leads to a graph break and safely exits the tracing. For read-only side effects, its doesn't matter. This PR removes the restoring of side_effects, which turns on the logic for checking if two different sources point to the same tensor, and thus lifts the common non local tensor to just once in the main graph. Related discussion at https://github.com/pytorch/pytorch/issues/113235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118420 Approved by: https://github.com/ydwu4, https://github.com/mlazos, https://github.com/zou3519 ghstack dependencies: #118975	2024-02-02 21:54:22 +00:00
CaoE	113138aa55	add test cases for GradScaler on CPU (#109994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-02-02 21:49:07 +00:00
Thiago Crepaldi	426339e4de	Add FakeTensor support to torch._utils._rebuild_tensor (#108186 ) Partially fixes https://github.com/pytorch/pytorch/issues/105077 Repro: ```python import tempfile import torch from torch._subclasses import fake_tensor class TheModelClass(torch.nn.Module): def __init__(self): super(TheModelClass, self).__init__() self.fc1 = torch.nn.Linear(5, 10) def forward(self, x): return self.fc1(x) with tempfile.NamedTemporaryFile() as state_dict_file: # Create state_dict to be loaded later model = TheModelClass() torch.save(model.state_dict(), state_dict_file.name) fake_mode = fake_tensor.FakeTensorMode() with fake_mode: # This is where the bug is triggered state_dict = torch.load(state_dict_file.name) ``` Error: ```bash Traceback (most recent call last): File "issue_gh_torch_105077.py", line 22, in <module> state_dict = torch.load(state_dict_file.name) File "/opt/pytorch/torch/serialization.py", line 1014, in load return _load(opened_zipfile, File "/opt/pytorch/torch/serialization.py", line 1422, in _load result = unpickler.load() File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor return t.set_(storage._untyped_storage, storage_offset, size, stride) File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs) File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants _, new_kwargs = normalize_function( File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function torch_op_schemas = get_signature_for_torch_op(target) File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp> signatures = [_torchscript_schema_to_signature(schema) for schema in schemas] File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature arg_type = _torchscript_type_to_python_type(arg.type) File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type return eval(ts_type.annotation_str, _type_eval_globals) File "<string>", line 1, in <module> NameError: name 'Storage' is not defined ``` This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor. Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186 Approved by: https://github.com/ezyang	2024-02-02 20:35:38 +00:00
Joel Schlosser	3b41793412	Purge redundant module init tests (#119028 ) Fixes #118784 This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028 Approved by: https://github.com/zou3519	2024-02-02 20:17:00 +00:00
Pearu Peterson	a69016a741	Add lowering to special.bessel_j1 (#118992 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118992 Approved by: https://github.com/peterbell10	2024-02-02 20:16:08 +00:00
Bin Bao	c7ba5f6c6f	[AOTI] Fix a cpp kernel missing arg type issue (#119021 ) Summary: The current way of fetching the kernel arg types only works for tensors, not symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021 Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov	2024-02-02 20:11:58 +00:00
rzou	debc3b3254	Download reports only if they're necessary (#119027 ) Previously we were downloading all of (eager311, dynamo38, dynamo311). Now we just download what's necessary. This is useful for update_failures.py because the dynamo tests finish much faster than the eager tests and it only needs the result from the dynamo tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119027 Approved by: https://github.com/jamesjwu ghstack dependencies: #118874, #118882, #118931	2024-02-02 20:11:01 +00:00
rzou	a68cf3ef7d	update_failures.py: add option to also remove "skipped" tests (#118931 ) Previously, you could run update_failures.py (with a commit hash) and it would add new expected failures and skips for newly failing tests and remove expected failures for newly passing tests. This PR teaches update_failures.py to also remove skips for tests that are now passing without them. The way we do this is: - dynamo_test_failures.py doesn't actually skip tests -- it runs the test and then suppresses the signal. - if the test actually passed, then the test gets skipped with a special skip message - we teach update_failures.py to look for the presence of that skip message. Test Plan: - Used this to generate https://github.com/pytorch/pytorch/pull/118928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118931 Approved by: https://github.com/yanboliang ghstack dependencies: #118874, #118882	2024-02-02 20:11:01 +00:00
ydwu4	1de50f8654	[HigherOrderOp] fix stack trace to report user stack (#118826 ) Fixes https://github.com/pytorch/pytorch/issues/111020 For the following code: ```python import torch import torch._higher_order_ops.wrap glob = [] def f(x): glob.append(x) return x.clone() @torch.compile(backend='eager', fullgraph=True) def g(x): return torch.ops.higher_order.wrap(f, x) x = torch.randn(3) g(x) ``` The stacktrace now becomes: ``` [2024-02-01 15:23:34,691] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting wrap, we were unable to trace function `f` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown. [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] HigherOrderOperator: Mutating a variable not in the current scope (SideEffects) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Traceback (most recent call last): [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] output = f.call_function(tx, args, sub_kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return super().call_function(tx, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return tx.inline_user_function_return( [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return cls.inline_call_(parent, func, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_ [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] tracer.run() [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] and self.step() [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] getattr(self, inst.opname)(inst) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return inner_fn(self, inst) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.call_function(fn, args, {}) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.push(fn.call_function(self, args, kwargs)) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return self.obj.call_method(tx, self.name, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] return super().call_method(tx, name, args, kwargs) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] tx.output.side_effects.mutation(self) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] self.check_allowed_side_effect(var) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] unimplemented( [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] raise Unsupported(msg) [2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects) Traceback (most recent call last): File "/home/yidi/local/pytorch/test.py", line 219, in <module> g(x) File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors return callback(frame, cache_entry, hooks, frame_state) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert return _compile( File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper r = func(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner out_code = transform_code_object(code, transform) File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object transformations(instructions, code_options) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn return fn(args, **kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 496, in transform tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2125, in run super().run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1227, in call_function p_args, p_kwargs, example_value, body_r, treespec, _ = self.create_wrapped_node( File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1190, in create_wrapped_node ) = speculate_subgraph( File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 453, in speculate_subgraph raise ex File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph output = f.call_function(tx, args, sub_kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function return super().call_function(tx, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function return tx.inline_user_function_return( File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_ tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function return self.obj.call_method(tx, self.name, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method return super().call_method(tx, name, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method tx.output.side_effects.mutation(self) File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation self.check_allowed_side_effect(var) File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect unimplemented( File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects) from user code: File "/home/yidi/local/pytorch/test.py", line 216, in g return torch.ops.higher_order.wrap(f, x) File "/home/yidi/local/pytorch/test.py", line 211, in f glob.append(x) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118826 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-02-02 20:08:01 +00:00
Edward Z. Yang	3c0c387429	Support symbolic min/max on unbacked SymInt (#118953 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118953 Approved by: https://github.com/ColinPeppler, https://github.com/aakhundov	2024-02-02 20:01:46 +00:00
Edward Z. Yang	f641c55c9b	Make torch._dynamo.mark_static work inside graph (#118962 ) I livecoded the entire PR authoring process, you can watch it at https://youtu.be/06HuwNR9-uI Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118962 Approved by: https://github.com/yanboliang	2024-02-02 20:01:27 +00:00
Edward Z. Yang	29f99a3365	Update XLA commit pin (#118871 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118871 Approved by: https://github.com/albanD	2024-02-02 19:55:04 +00:00
rzou	bd8c91efc0	Remove some now-succeeding tests from dynamo_test_failures.py (#118928 ) Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/118928 Approved by: https://github.com/aorenste, https://github.com/anijain2305, https://github.com/yanboliang	2024-02-02 19:49:26 +00:00
Michael Suo	bf4e171539	[export] support non-persistent buffers (#118969 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1817 Basic support for non-persistent buffers, which are buffers that do not show up in the state dict. One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict. This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them. As a side effect, this diff tightened up quite a few sloppy behaviors around state dict handling: - Tensor attributes were getting promoted to be buffers—bad! - Tracing through a module not in the children of the root module would add its parameters/buffers to the state dict—bad! This behavior is unlikely to show up in user code since the model would be totally broken, but did show up in a bunch of tests. #buildmore Test Plan: unit tests sandcastle Differential Revision: D53340041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118969 Approved by: https://github.com/guangy10, https://github.com/huydhn, https://github.com/titaiwangms	2024-02-02 19:16:08 +00:00
Jane Xu	b5ba80828f	[optim] Rectify capturable testing and fix bugs! (#118326 ) This PR fixes several bugs, listed in priority: 1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed. 2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks 3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos 4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place. 5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected. The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device. Details for posterity: 4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct. ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={}, desc=default params=None, kwargs={'lr': 0.01}, desc=non-default lr params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad params=None, kwargs={'capturable': True}, desc=capturable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad . ---------------------------------------------------------------------- Ran 1 test in 19.229s OK ``` 5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct. ``` /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={'differentiable': False}, desc=default params=None, kwargs={'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable .params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused . ---------------------------------------------------------------------- Ran 2 tests in 11.112s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326 Approved by: https://github.com/mlazos	2024-02-02 19:13:00 +00:00
Andrew Gu	8b00e5aa12	[FSDP2] Added pre/post-backward (#118004 ) This PR adds the pre- and post-backward logic: - Pre-backward hook: `FSDPState` and `FSDPParamGroup` define this, and `FSDPState` is responsible for registering since its pre-backward should run even if the `FSDPState` does not manage any parameters (in case it is the root). - Post-backward hook: Only `FSDParamGroup` defines this since the post-backward hook reshards parameters and reduce-scatters gradients (functionality only needed with managed parameters). The `FSDPParamGroup` is responsible for registering this. - Post-backward final callback: `FSDPState` defines this, and each `FSDPParamGroup` defines a `finalize_backward()` to call in the final callback. ### Pre-Backward The pre-backward hook is registered on the module outputs (that require gradient), and it should run when the first such output has its gradient computed. The hook may run multiple times per backward, once per module forward. Specifically, there will be one `(pre-backward, post-backward)` interval for each of the module's `forward()` calls. This is contrast with the existing FSDP semantics, which only defines a single `(pre-backward, post-backward)` interval that is equivalent to the union of this FSDP's `(pre-backward, post-backward)` intervals. This avoids spiking memory from having multiple modules not resharding and avoids some autograd edge cases. We implement the pre-backward hook by having a flag that is set upon the 1st calls to disable subsequent calls. This flag could be maintained by FSDP, but for a cleaner design, we augment `register_multi_grad_hook` with a `mode="any"` option and use that instead. ### Post-Backward The post-backward hook is equivalent to a module full backward hook (`nn.Module.register_full_backward_hook`) except it adds pytree logic to work with data structures other than just flat `Tensor` args passed to `nn.Module.forward`. If we were to use `register_full_backward_hook`, then the hook could fire early (before all gradients for the module have been computed). Most internal models use custom data structures as `forward` inputs, and they find that unifying under pytree is an acceptable solution. Unlike existing FSDP, we are able to reshard the parameters in the post-backward hook _before_ 'concatenating' the autograd-computed gradients, achieving a lower peak memory usage. (Existing FSDP has `SplitWithSizesBackward` that calls a `CatArrayBatched`, and here we have the reduce-scatter copy-in.) ### Final Callback The final callback runs as a queued callback to the autograd engine, meaning that it runs at the end of backward. In the future, if we do not want to wait for the reduce-scatter (or similar for CPU offloading), we can augment the final callback. The code is written such that each reduce-scatter can be waited on separately (via CUDA event). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118004 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #117950, #117955, #117973, #117975	2024-02-02 19:10:11 +00:00
Elias Ellison	a688b4b397	Update pointwise concat heuristics (#118453 ) This PR updates the heuristics for lowering to pointwise cat to trigger when we have either a small number of arbitrary pointwise inputs (8) or up to 128 pointwise inputs when they correspond to simple pointwise kernels or data movement. This originally came from an internal use case which noticed poor codegen: https://fb.workplace.com/groups/1075192433118967/posts/1365770660727808. In our initial heuristics for lowering to a masked loads pointwise concat kernel we were conservative with the number of inputs we would allow by setting a maximum of 4. However, I've noticed that we can much more aggressively fuse to pointwise_concat codegen performantly. In the following benchmark I compare foreach and pointwise_cat codegen : https://gist.github.com/eellison/2bf83231f2940d9b9b33eb4721d35e15. Here is the [csv output](https://gist.github.com/eellison/529da68b326e1d832c26c1dcdb42c313). When there is neither `gelu` applied on prologue or epilogue pointwise concat is faster (this is just the data movement case). Applying gelu on the epilogue does not affect this result. When you apply gelu on the prologue, then as the # of inputs starts to increase you end up getting register spills with pointwise concat and it gets slower. ![image](https://github.com/pytorch/pytorch/assets/11477974/0d6612b8-d60f-4984-99eb-9b518cd4af74) ![image](https://github.com/pytorch/pytorch/assets/11477974/4dda3341-68f9-4d1d-8334-67d7196371fb) When I benchmarked with relu instead of gelu, only as inputs got up to 256 did the pointwise and foreach even out. ![image](https://github.com/pytorch/pytorch/assets/11477974/985418f8-ddb8-47c1-baea-ccd9de72cd7f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118453 Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/mlazos ghstack dependencies: #118452	2024-02-02 18:31:37 +00:00
Catherine Lee	3a1ae86a93	Fix internal failure D53291154 (#118907 ) Fix internal failure D53291154 from alban: the change is breaking because the alpha argument is now kwarg only (via the * marker) while it was ok for it to be positional before for the rsub.Scalar overload ``` _wrapped_call_impl return self._call_impl(args, kwargs) File "torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) File "torch/_dynamo/eval_frame.py", line 453, in _fn return fn(args, *kwargs) File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) File "torch/_dynamo/eval_frame.py", line 615, in catch_errors return callback(frame, cache_entry, hooks, frame_state) File "torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert return _compile( File "python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "torch/_dynamo/convert_frame.py", line 650, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "torch/_dynamo/utils.py", line 248, in time_wrapper r = func(args, *kwargs) File "torch/_dynamo/convert_frame.py", line 531, in compile_inner out_code = transform_code_object(code, transform) File "torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object transformations(instructions, code_options) File "torch/_dynamo/convert_frame.py", line 155, in _fn return fn(args, kwargs) File "torch/_dynamo/convert_frame.py", line 496, in transform tracer.run() File "torch/_dynamo/symbolic_convert.py", line 2125, in run super().run() File "torch/_dynamo/symbolic_convert.py", line 787, in run and self.step() File "torch/_dynamo/symbolic_convert.py", line 750, in step getattr(self, inst.opname)(inst) File "torch/_dynamo/symbolic_convert.py", line 469, in wrapper return inner_fn(self, inst) File "torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW self.call_function(fn, args, kwargs) File "torch/_dynamo/symbolic_convert.py", line 651, in call_function self.push(fn.call_function(self, args, kwargs)) File "torch/_dynamo/variables/torch.py", line 614, in call_function tensor_variable = wrap_fx_proxy( File "torch/_dynamo/variables/builder.py", line 1285, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, kwargs) File "torch/_dynamo/variables/builder.py", line 1370, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "torch/_dynamo/utils.py", line 1653, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "torch/_dynamo/utils.py", line 1599, in get_fake_value ret_val = wrap_fake_exception( File "torch/_dynamo/utils.py", line 1140, in wrap_fake_exception return fn() File "torch/_dynamo/utils.py", line 1600, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "torch/_dynamo/utils.py", line 1720, in run_node raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e File "torch/_dynamo/utils.py", line 1699, in run_node return node.target(args, kwargs) File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/_subclasses/fake_tensor.py", line 1637, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1975, in dispatch return self._dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 2190, in _dispatch_impl r = func(args, *kwargs) File "torch/_ops.py", line 571, in __call__ return self_._op(args, *kwargs) File "torch/_prims_common/wrappers.py", line 252, in _fn result = fn(args, **kwargs) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118907 Approved by: https://github.com/lezcano	2024-02-02 18:17:34 +00:00
Yifu Wang	fd000340fd	ProcessGroupGloo::allgather_into_tensor_coalesced (#118910 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage. ### This PR This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`. This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910 Approved by: https://github.com/shuqiangzhang	2024-02-02 17:53:28 +00:00
andrewor14	70605d150b	[quant][pt2] Add `move_exported_model_to_train` (#113492 ) Summary: This is the equivalent API to `model.train()` for exported models, analogous to `move_exported_model_to_eval`. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_inplace python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_bn Reviewers: jerryzh168, kimishpatel Subscribers: jerryzh168, kimishpatel, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/113492 Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan	2024-02-02 17:39:47 +00:00
Nikita Shulga	52b679d415	[BE] Cleanup CircleCI README (#118927 ) All of the information there is out-of-date as CI/CD has long migrated to the GitHub Actions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118927 Approved by: https://github.com/kit1980	2024-02-02 17:08:20 +00:00
Bin Bao	0e5fe4b3ae	[AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963 ) Summary: generate_index_put_fallback currently generates something like the following, ``` AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)}; ``` The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault. Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963 Approved by: https://github.com/aakhundov	2024-02-02 16:49:45 +00:00
Angela Yi	53da422582	[export] Move _create_graph_module_for_export to torch/export (#118893 ) Summary: I have to keep the torch/_export one to not break executorch... Test Plan: CI Reviewed By: avikchaudhuri Differential Revision: D52842750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118893 Approved by: https://github.com/zhxchen17	2024-02-02 16:40:01 +00:00
hongxyan	b374f8987d	[ROCm] Hipify trie re-engineering and adding unit tests (#118433 ) Fixes #[117504](https://github.com/pytorch/pytorch/issues/117504) Re-engineering Hipify Trie: (1) Re-engineering Trie. (2) More documentation or comments for easier understanding (3) Created a set of unit test (class `TestHipifyTrie`) to test the Trie data structure and APIs. Test: ``` root@xxx:/development/pytorch# pytest test/test_utils.py -k TestHipifyTrie ==================================================================================================== test session starts ==================================================================================================== platform linux -- Python 3.9.18, pytest-7.3.2, pluggy-1.3.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, shard-0.1.2, hypothesis-5.35.1 collected 11453 items / 11445 deselected / 8 selected Running 8 items in this shard test/test_utils.py ........ [100%] ============================================================================================ 8 passed, 11445 deselected in 3.84s ============================================================================================ root@xxx:/development/pytorch# ``` Also performed diff on modified and generated contents by this tool with the original code and the new code of the hipify_python.py script. Verified no difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118433 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2024-02-02 16:04:59 +00:00
lezcano	65efbf078c	Optimize dict keys guard when all the keys are constant (#118855 ) We also rename ODICT_KEYS and make it use a list rather than a string. Split from https://github.com/pytorch/pytorch/pull/118630. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118855 Approved by: https://github.com/peterbell10 ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199, #118535	2024-02-02 14:42:56 +00:00
lezcano	cdbc29e91a	[dynamo,optim] Use the actual sources from the parameters when tracing "params" in an optimizer (#118535 ) Fixes the unnecessary guards described at https://github.com/pytorch/pytorch/pull/117983#discussion_r1467622149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118535 Approved by: https://github.com/mlazos ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199	2024-02-02 14:42:56 +00:00
lezcano	a3770bcf10	Add functools.partial and UserDefinedFunction to dict keys (#118199 ) This is tested by `fullgraph=True` in the `test_getattr_dict` test. I can write a one-off test for both if that's needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118199 Approved by: https://github.com/peterbell10, https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208	2024-02-02 14:42:35 +00:00
lezcano	9d592c14eb	Don't assume all subclasses of BaseUserFunctionVariable have a fn attribute (#118208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118208 Approved by: https://github.com/anijain2305 ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003	2024-02-02 14:42:06 +00:00
lezcano	188628d99e	[dynamo,easy] Add Typing variable to possible dict keys (#118003 ) With this one, the only keys we are not tracing properly in the (non-skipped) test suite are `OutDtypeHigherOrderVariable()`, and a couple `UserDefinedObjectVariables` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118003 Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #117982, #118098, #117983, #117625, #118194	2024-02-02 14:40:21 +00:00
lezcano	ecf7d0e8ac	Make dict guards amenable to the CSE pass (#118194 ) Supersedes https://github.com/pytorch/pytorch/pull/118096 as a much cleaner and simpler solution. It is difficult to write a test for this one without exposing too much of the internals. You can see empirically that it works by running ``` TORCHDYNAMO_PRINT_GUARDS=1 TORCH_LOGS=+guards python test/test_optim.py -k test_can_load_older_state_dict_ASGD_cpu_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118194 Approved by: https://github.com/jansel, https://github.com/peterbell10 ghstack dependencies: #117982, #118098, #117983, #117625	2024-02-02 14:38:48 +00:00
lezcano	eb2bdfae88	Make variables in dict LazyTrackers (not lazily guarded yet) and avoid using DICT_KEYS guard (#117625 ) Make variables in dict lazy and remove DICT_KEYS guard. We build the keys of a dict depth-first and we rely on the guards of each element in the dict to create the correct guards. This allows us to remove the rather buggy DICT_KEYS guard and make the guard lazy. The guards are not completely lazy yet, as we instantiate them in `_HashableTracker._eq_impl` but it should be possible to make them truly lazy. Also, adding new types to the supported types within keys should be less error prone. This is marginally less efficient when we graph break, but in turn we should graph break much less. It also makes the dicts code easier to maintain (removes `is_hashable_python_var`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117625 Approved by: https://github.com/jansel, https://github.com/peterbell10, https://github.com/anijain2305 ghstack dependencies: #117982, #118098, #117983	2024-02-02 14:38:08 +00:00
lezcano	75a5c41921	[dynamo,optim] Place guards on the args before assuming they exist (#117983 ) This enables the new way of writing guards for dicts. Before we were doing things like ``` L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)][3] is L['self'].param_groups[0]['params'][3] ``` without knowing whether `L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)]` was a list. On a different note, I'll probably write a pass to recover the previous way to place guards on dicts via something like `DICT_KEYS` as an optimisation, as it seems relevant for optimisers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117983 Approved by: https://github.com/mlazos ghstack dependencies: #117982, #118098	2024-02-02 14:37:46 +00:00
lezcano	b1da929df9	Use SourcelesBuilder in BuiltinVariable (#118098 ) This was failing when fetching a dictionary from a module Pull Request resolved: https://github.com/pytorch/pytorch/pull/118098 Approved by: https://github.com/peterbell10, https://github.com/anijain2305 ghstack dependencies: #117982	2024-02-02 14:37:23 +00:00
lezcano	0f3e20a1b6	Print the malformed guard when there's a guard error. (#117982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117982 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-02-02 14:37:05 +00:00
rzou	292243d1aa	Automatically pull test reports from CI (#118882 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118882 Approved by: https://github.com/jamesjwu, https://github.com/yanboliang ghstack dependencies: #118874	2024-02-02 14:18:56 +00:00
rzou	0f7954107a	Add ability to print histogram as a github issue (#118874 ) Adds the ability to print the failures histogram into lines that can be copy-pasted into a github issue. I used this to generate https://github.com/orgs/pytorch/projects/43 Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/118874 Approved by: https://github.com/jamesjwu	2024-02-02 14:18:56 +00:00
Yu, Guangye	520771d7b3	refactor lazy init to device-agnostic (#118846 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846 Approved by: https://github.com/malfet	2024-02-02 12:10:39 +00:00
Tobias Ringwald	2de327cedc	Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561 ) …dex that is not a valid class. Fixes #117532. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117561 Approved by: https://github.com/mikaylagawarecki	2024-02-02 11:03:16 +00:00
angelayi	05ac295177	[export] Fix bug with user input mutations (#118942 ) We hit an edge case where the graph exporting contains placeholder nodes whose names conflict with names from aot_export, we don't update the user_inputs_to_mutate in the graph signature correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118942 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-02-02 09:02:04 +00:00
Kai Londenberg	cc46829f96	[Inductor] GEMM shape padding improvements (#118522 ) Improvements to shape padding logic in torch/_inductor/pad_mm.py These changes could lead up to 14% perf improvement for certain Meta internal models in experiments. Most notably: * 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal ) * 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True). * 3.) Enable shape padding for the Inductor CUDA / Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case. * Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False ) * Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118522 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-02 08:50:06 +00:00
cyy	855d5f144e	Relax MKL_INT assumption to int64_t (#118946 ) When I built Pytorch on Windows with lastest MKL, it reported: ``` sources\pytorch\aten\src\ATen/cpu/vml.h(106): error C2338: static_assert failed: 'MKL_INT is assumed to be int32_t' ``` It should be safe to relax the restriction to int64_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118946 Approved by: https://github.com/ezyang	2024-02-02 07:11:47 +00:00
PyTorch MergeBot	2964170f3a	Revert "[optim] Rectify capturable testing and fix bugs! (#118326 )" This reverts commit d947b9d50011ebd75db2e90d86644a19c4fe6234. Reverted https://github.com/pytorch/pytorch/pull/118326 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there are some relevant failures in trunk `d947b9d500`, may be a land race ([comment](https://github.com/pytorch/pytorch/pull/118326#issuecomment-1923125676))	2024-02-02 07:08:14 +00:00
angelayi	4a5a2c6571	Update auto_functionalize schema (#118809 ) - Moved the dictionary arguments to the node's kwargs as dicts are not valid inputs. - Inlined the mutated arguments to the output. Originally, the output of auto_functionalize was the operator output and a list of mutated arguments (ex. [op_out1, op_out2, [mutated_arg1, mutated_arg2]]. However this is not easily exportable. Now, it will just be [op_out1, op_out2, mutated_arg1, mutated_arg2]. Differential Revision: [D53331040](https://our.internmc.facebook.com/intern/diff/D53331040) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118809 Approved by: https://github.com/zou3519	2024-02-02 06:21:43 +00:00
Aaron Orenstein	89b7ab671e	Protect against modules without __file__ (#117445 ) The __file__ special variable is optional so should be treated as such. Fixes #117109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117445 Approved by: https://github.com/oulgen, https://github.com/yanboliang	2024-02-02 06:06:50 +00:00
Xu Song	3d8c36786b	Add device for distributed examples (#118867 ) ## 🐛 Describe the bug The following example (`all_reduce`) missed `device` allocation `a205e7bf56/torch/distributed/distributed_c10d.py (L2080-L2087)` ## Solution A better example should be like this `a205e7bf56/torch/distributed/distributed_c10d.py (L3212-L3222)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118867 Approved by: https://github.com/soulitzer	2024-02-02 05:51:59 +00:00
Michael Suo	da5cbb1269	[export] fix for duplicate constant lifting (#118776 ) Summary: Whenever we access a constant, we emit a `get_attr` node for it. The `lift_constants_pass` was lifting every `get_attr` node unconditionally, even if the same target was already lifted. This diff fixes that. I also took the liberty of adding some infra to make it easier to unit test passes. GraphBuilder lets you declaratively construct graphs with the right metadata, it's pretty useful for directly inducing the pattern you want to test against. Test Plan: added unit test Differential Revision: D53278161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118776 Approved by: https://github.com/angelayi, https://github.com/titaiwangms	2024-02-02 05:51:31 +00:00
Zejun Huang	32f48e917d	[minimizer] Defined traverse (#118889 ) Summary: Add defined traverse mode for minimizer it take user input start_idx and end_idx, form a subgraph, compare result from acclerators vs cpu Differential Revision: D53318292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118889 Approved by: https://github.com/jfix71	2024-02-02 05:50:17 +00:00
Mihir Patel	3f1f057adf	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-02-02 05:29:49 +00:00
PyTorch MergeBot	9cc6422ab6	Revert "[executorch hash update] update the pinned executorch hash (#118936 )" This reverts commit 8cc8cf75f31f7e430ab2918db4a2fb9c7b951024. Reverted https://github.com/pytorch/pytorch/pull/118936 on behalf of https://github.com/suo due to conflicts with human change ([comment](https://github.com/pytorch/pytorch/pull/118936#issuecomment-1922824471))	2024-02-02 05:05:44 +00:00
PyTorch UpdateBot	8cc8cf75f3	[executorch hash update] update the pinned executorch hash (#118936 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936 Approved by: https://github.com/pytorchbot	2024-02-02 04:10:53 +00:00
Elias Ellison	497ea17684	Limit reductions into pointwise cat fusion (#118452 ) @Chillee observed a regression when fusing the following: ``` def f(a, b): return torch.cat([torch.softmax(a, dim=-1), torch.softmax(b, dim=-1)]) ``` This PR limits pointwise concat/masked fusion in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118452 Approved by: https://github.com/jansel	2024-02-02 03:34:50 +00:00
Colin Peppler	babd6c776d	[inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654 ) Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512) ### Why? Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`. To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints. This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model. ### Test ``` $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols OK (skipped=3) $ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols # Before Error: CUDA driver error: invalid argument FAILED (errors=2, skipped=3) # Now OK (skipped=3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-02-02 03:19:52 +00:00
Bin Bao	946ea47a4f	[inductor] Fix an internal test issue (#118903 ) Summary: test_add_complex4 that introduced in https://github.com/pytorch/pytorch/pull/117929 fails internally, because of a cpp compilation issue for cpu. Specify the right device in the test instead. Differential Revision: [D53333919](https://our.internmc.facebook.com/intern/diff/D53333919) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118903 Approved by: https://github.com/clee2000	2024-02-02 03:18:12 +00:00
Catherine Lee	8b729fb826	[ez] Fix CI log file piping error (#118807 ) Fixes https://github.com/pytorch/pytorch/issues/118764 Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-02-02 03:07:56 +00:00
Jane Xu	d947b9d500	[optim] Rectify capturable testing and fix bugs! (#118326 ) This PR fixes several bugs, listed in priority: 1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed. 2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks 3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos 4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place. 5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected. The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device. Details for posterity: 4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct. ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={}, desc=default params=None, kwargs={'lr': 0.01}, desc=non-default lr params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad params=None, kwargs={'capturable': True}, desc=capturable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad . ---------------------------------------------------------------------- Ran 1 test in 19.229s OK ``` 5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct. ``` /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" params=None, kwargs={'differentiable': False}, desc=default params=None, kwargs={'differentiable': True}, desc=default & differentiable params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable .params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused . ---------------------------------------------------------------------- Ran 2 tests in 11.112s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326 Approved by: https://github.com/mlazos	2024-02-02 02:02:58 +00:00
Tianyu Liu	08472a4fd5	[dtensor] add op support for aten.gather.default (#118513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118513 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-02-02 01:48:21 +00:00
Jorge Pineda	8ca8729321	[PT-Vulkan][EZ] Adjust string-report width (#118914 ) ## Before: P1148506541 Some of the shader names are now too long. ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4322188 vulkan.nchw_to_image {500, 500, 1} 4322240 vulkan.convert_channels_to_height_packed{500, 125, 1} 1189240 vulkan.zero {1, 1, 1} 3744 vulkan.convert_channels_to_width_packed{125, 500, 1} 1265680 ``` ## After: P1148506671 Now it's just right; `convert_channels_to_height_packed` is the longest shader name in the codebase. ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4327232 vulkan.nchw_to_image {500, 500, 1} 4327960 vulkan.convert_channels_to_height_packed{500, 125, 1} 1190540 vulkan.zero {1, 1, 1} 3744 vulkan.convert_channels_to_width_packed {125, 500, 1} 1287468 ``` Differential Revision: [D53293924](https://our.internmc.facebook.com/intern/diff/D53293924/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118914 Approved by: https://github.com/liuk22	2024-02-02 01:43:48 +00:00
Nathanael See	7e1ac59016	[pytorch][vulkan] add 1d tensor support for linear (#118690 ) Summary: Vulkan Linear op doesn't support 1d tensors. We can unsqueeze 1d tensors to 2d to unblock the functionality. Test Plan: `LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="linear_"` ``` Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = linear_ [==========] Running 11 tests from 1 test suite. [----------] Global test environment set-up. [----------] 11 tests from VulkanAPITest [ RUN ] VulkanAPITest.linear_1d_small [ OK ] VulkanAPITest.linear_1d_small (319 ms) [ RUN ] VulkanAPITest.linear_1d_large [ OK ] VulkanAPITest.linear_1d_large (64 ms) [ RUN ] VulkanAPITest.linear_2d_flat [ OK ] VulkanAPITest.linear_2d_flat (0 ms) [ RUN ] VulkanAPITest.linear_2d_small [ OK ] VulkanAPITest.linear_2d_small (0 ms) [ RUN ] VulkanAPITest.linear_2d_large [ OK ] VulkanAPITest.linear_2d_large (129 ms) [ RUN ] VulkanAPITest.linear_3d_flat [ OK ] VulkanAPITest.linear_3d_flat (0 ms) [ RUN ] VulkanAPITest.linear_3d_small [ OK ] VulkanAPITest.linear_3d_small (1 ms) [ RUN ] VulkanAPITest.linear_3d_large [ OK ] VulkanAPITest.linear_3d_large (51 ms) [ RUN ] VulkanAPITest.linear_4d_flat [ OK ] VulkanAPITest.linear_4d_flat (0 ms) [ RUN ] VulkanAPITest.linear_4d_small [ OK ] VulkanAPITest.linear_4d_small (1 ms) [ RUN ] VulkanAPITest.linear_4d_large [ OK ] VulkanAPITest.linear_4d_large (6 ms) [----------] 11 tests from VulkanAPITest (578 ms total) [----------] Global test environment tear-down [==========] 11 tests from 1 test suite ran. (578 ms total) [ PASSED ] 11 tests. ``` Differential Revision: D53243201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118690 Approved by: https://github.com/jorgep31415, https://github.com/liuk22	2024-02-02 01:35:45 +00:00
PyTorch MergeBot	796278b57e	Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813 )" This reverts commit 20484a193626ef72e0b3f35914f17deb2a89b8fc. Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))	2024-02-02 01:19:19 +00:00
Stephen Jia	9153174cd1	[pt-vulkan] Introduce `SharedObject` class to `ComputeGraph` (#118756 ) ## Context This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes). This changeset builds upon the [previous PR enabling resource aliasing](https://github.com/pytorch/pytorch/pull/118436) and introduces the `SharedObject` class to `ComputeGraph`, which manages resource aliasing in graph execution mode. `SharedObject` tracks which `vTensor` values in a `ComputeGraph` share the same backing memory, and provides functionality to aggregate memory requirements and bind users to same memory allocation. ## Notes for Reviewers The `SharedObject` class is introduced in `Graph.h`. It's fairly simple and provides three functions: * `add_user()` which adds a `ValueRef` to the list of users of the `SharedObject`, and updates the aggregate memory requirements with the memory requirements of the new user * `allocate_memory()` creates a `VmaAllocation` with the aggregated memory requirements * `bind_users()` iterates over the `users` of the `SharedObject` and binds each `vTensor`'s underlying resource to the memory associated with the `SharedObject`. As for how `SharedObject` is used in `ComputeGraph`: * `add_tensor()` now has an additional argument `shared_object_idx` which, if `>0`, will construct a `vTensor` without any backing memory and add the new `vTensor` to the `SharedObject` at `shared_object_idx` * `encode_execute()` will first iterate through the `SharedObject`s of the graph and allocate + bind users before recording the command buffer. Differential Revision: [D53271486](https://our.internmc.facebook.com/intern/diff/D53271486/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118756 Approved by: https://github.com/jorgep31415, https://github.com/yipjustin	2024-02-02 01:19:00 +00:00
CaoE	a5a63db3bf	add Half support for flash attention on CPU (#118368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118368 Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/drisspg ghstack dependencies: #118367	2024-02-02 01:08:39 +00:00
Michael Lazos	838c1c553e	Add back recompile test (#118905 ) Adds back a test that was skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/118905 Approved by: https://github.com/janeyx99	2024-02-02 00:51:01 +00:00
Nikita Shulga	4b59bfe8e5	[CI] Filter should not fail if pr_body is empty (#118934 ) Otherwise it will fail with `TypeError: argument of type 'NoneType' is not iterable` (see https://github.com/pytorch/pytorch/actions/runs/7748725174/job/21131915226 for example) ``` % gh api /repos/pytorch/pytorch/issues/118927\| { "url": "https://api.github.com/repos/pytorch/pytorch/issues/118927", ... "body": null, ... "state_reason": null } ``` TODO: Can we add a test for it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/118934 Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/huydhn	2024-02-02 00:49:20 +00:00
Catherine Lee	08d90a1ea9	Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586 ) Info about super in dynamic classes: https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions Mainly doing this because it's making disable bot spam Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped Logs for `inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda` https://ossci-raw-job-status.s3.amazonaws.com/log/21083466405 Afaik this PR doesn't actually cause the test to fail, it just surfaces the error since the mem leak check wasn't running previously Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586 Approved by: https://github.com/huydhn	2024-02-02 00:40:37 +00:00
Jorge Pineda	7c609f01ff	[PT-Vulkan] aten::conv1d - support any batch size (#118834 ) Completes `aten::conv1d` implementation. See D53204673 for full context. Differential Revision: [D53253625](https://our.internmc.facebook.com/intern/diff/D53253625/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118834 Approved by: https://github.com/yipjustin ghstack dependencies: #118833	2024-02-01 23:53:00 +00:00
Edward Z. Yang	dc4779b010	Split out fake_impls from fake_tensor (#118878 ) The motivation is fake_tensor is marked as an uninteresting file for the purposes of backtraces, but operator implementations in fake tensor are interesting and I do want them reported. How did I decide whether or not to move helper functions or not? It was kind of random, but if they weren't used in fake tensor generally I moved them over. There are no functional code changes, so you only need to review the import changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118878 Approved by: https://github.com/eellison	2024-02-01 23:50:56 +00:00
Nikita Shulga	844a76ebe8	[MPS][BE] Remove stale TODO (#118902 ) And use convenient methods TODO was added by an accidental copy-n-paste of code from https://github.com/pytorch/pytorch/pull/82315 into https://github.com/pytorch/pytorch/pull/88532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118902 Approved by: https://github.com/kit1980	2024-02-01 23:43:23 +00:00
rzou	a16df1d85f	[Dynamo] graph break on isinstance calls if we don't know the type (#118778 ) If we can't figure out the python type of a VariableTracker, then the isinstance call should graph break (instead of raising an error). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118778 Approved by: https://github.com/ydwu4 ghstack dependencies: #118768	2024-02-01 23:18:10 +00:00
Mikayla Gawarecki	39aab55c1c	Add myself to CODEOWNERS for serialization-related files (#118892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118892 Approved by: https://github.com/albanD	2024-02-01 23:14:04 +00:00
Nikolay Bogoychev	46ef73505d	Clarify how to get extra link flags when building CUDA/C++ extension (#118743 ) Make it a bit more explicit how one parse linker arguments to the build and point to the superclass documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118743 Approved by: https://github.com/ezyang	2024-02-01 22:35:25 +00:00
PyTorch MergeBot	dbba1d4bf5	Revert "Some minor type stub improvements (#118529 )" This reverts commit c978f38bd4aedeff4ee9ae693349217daea01412. Reverted https://github.com/pytorch/pytorch/pull/118529 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118529#issuecomment-1922362331))	2024-02-01 22:18:36 +00:00
BowenBao	d4a94ad041	[ONNX] Fix upsample_bilinear2d decomp skip with output shape (#118823 ) The previous output size missed the first two dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118823 Approved by: https://github.com/titaiwangms	2024-02-01 22:04:35 +00:00
Nikita Shulga	6692f2c91e	[no ci] Add myself to MPS codeowners (#118904 ) I got pinged on every other PR anyway, so just a means to automate the process Pull Request resolved: https://github.com/pytorch/pytorch/pull/118904 Approved by: https://github.com/albanD	2024-02-01 21:52:15 +00:00
Jorge Pineda	6929322a28	[PT-Vulkan] aten::conv1d - support any channel-group combo (#118833 ) ## Main Part of completing `aten::conv1d`'s implementation. See D53204673 for full context. This diff relaxes the constraint ``` c_in = c_out = groups ``` to support any legal combination of c_in, c_out, groups. From the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html), both c_in and c_out must be divisible by groups. Apart from that, any combo is now fair game. ## Additional Improved GLSL comments and variable names, since more indices yield more headaches. Differential Revision: [D53248767](https://our.internmc.facebook.com/intern/diff/D53248767/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118833 Approved by: https://github.com/yipjustin	2024-02-01 21:46:01 +00:00
Yang Chen	61b572ed56	[inductor] more accurate throughput calculations for kernel benchmarks (#118858 ) Our current throughput calculations for kernel benchmarks have some issues, particularly when we slice inputs in the kernel. In such cases, we count the original inputs as part of the memory traffic passed across the kernel. This is incorrect because it may result in a much larger throughput calculation, which can even exceed the theoretical bandwidth. Instead, we should only count the size of the "slices" that contribute to the actual memory traffic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858 Approved by: https://github.com/jansel	2024-02-01 21:42:14 +00:00
Shunting Zhang	20484a1936	[inductor] make multi-kernel work with cpp-wrapper (#117813 ) Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning. Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813 Approved by: https://github.com/jansel	2024-02-01 21:29:02 +00:00
albanD	54668ad6dc	Cleanup max cuda device (#118779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779 Approved by: https://github.com/ezyang	2024-02-01 21:11:28 +00:00
Edward Z. Yang	f63dc9a21d	s/DIRECLTY/DIRECTLY/ (#118877 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118877 Approved by: https://github.com/albanD	2024-02-01 20:25:58 +00:00
laith sakka	923a7c7572	add test elipsis to dynamo test functions (#118754 ) add tests to ensure the reported bug in #117563 is not failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118754 Approved by: https://github.com/anijain2305	2024-02-01 19:05:01 +00:00
rzou	318e6ff40e	Fix `__name__` on a reconstructed NestedUserFunctionVariable (#118768 ) ``` def f(): def g(): return () print(g.__name__) f() ``` The following script should print `g` (with or without torch.compile), but prints `f.<locals>.g` with torch.compile. The problem looks like we use the co_qualname when reconstructing the NestedUserFunctionVariable. I switched this over to use the co_name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118768 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-02-01 18:59:01 +00:00
mantaionut	b0e65dd1b4	Fix TCP Store Windows (#118860 ) In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket. Added missing call to addMiscellaneousSocket on Windows. Fixes #118737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860 Approved by: https://github.com/awgu, https://github.com/malfet	2024-02-01 18:46:18 +00:00
PyTorch MergeBot	df048f4da4	Revert "[RELAND] Remove deprecated fbgemm operators (#112153 )" This reverts commit 19e8ba95e535cd73d3eb37849f383ca8bab58603. Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780))	2024-02-01 18:35:19 +00:00
Yifu Wang	0f7e63620f	CUDA fast path for split_with_sizes_copy.out (#117203 ) ### Motivation In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`): All-gather output: ``` AAAABBBCCAAAABBBCC ``` After all-gather-copy-out: ``` AAAAAAAA BBBBBB CCCC ``` The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today. We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD). ### all-gather-copy-out via Composing ATen Ops Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows: Reshape all-gather output as (world_size, -1): ``` AAAABBBCC AAAABBBCC ``` `split_with_sizes` + `_foreach_copy_`: ``` AAAA BBB CC AAAA BBB CC ``` However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons: - The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high. - `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy. - `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy. - Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads. ### PR Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details. ### Benchmarks The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time. Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline. Baseline: ``` num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460) num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572) num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587) num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534) num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084) num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154) num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087) num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151) num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664) num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584) num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612) num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559) num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073) num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133) num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093) num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207) num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904) num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996) num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765) num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611) num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100) num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176) num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099) num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207) num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531) num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752) num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717) num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633) num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107) num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182) num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089) num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177) num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488) num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655) num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671) num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572) num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078) num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149) num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086) num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170) num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476) num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633) num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622) num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576) num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076) num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140) num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073) num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147) ``` New kernel: ``` num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066) num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417) num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419) num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410) num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098) num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134) num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099) num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138) num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996) num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289) num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264) num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249) num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078) num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104) num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075) num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099) num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616) num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276) num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278) num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245) num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080) num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102) num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078) num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106) num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656) num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264) num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279) num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274) num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077) num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112) num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081) num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116) num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632) num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294) num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286) num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255) num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074) num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094) num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075) num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105) num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668) num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274) num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276) num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262) num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078) num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109) num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080) num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #118512	2024-02-01 18:23:01 +00:00
Edward Z. Yang	68f9c28e00	Don't make default arguments dynamic (#118772 ) Noticed this while working on https://github.com/pytorch/pytorch/issues/114590 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118772 Approved by: https://github.com/anijain2305	2024-02-01 18:11:57 +00:00
Nikita Shulga	24dd9f42ce	[MPS] Fix `use_metal_mm` condition (#118830 ) One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)` Extend test to account for this situation Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830 Approved by: https://github.com/huydhn	2024-02-01 17:53:42 +00:00
Isuru Fernando	3e79ef6db8	Complete decomposition for aten.round (#118635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118635 Approved by: https://github.com/peterbell10	2024-02-01 17:14:44 +00:00
Masaki Kozuki	0010b6145e	Reduce register usage of fused adam(w) (#118361 ) Part of #117872 \| branch \| cpu time avg (ms) \| cuda time avg (ms) \| \|--------\|--------------\|---------------\| \| [main](eebe7e1d37f1baa995c694d540cc2fc98884fa18) \| 13.430 \| 144.117 \| \| pr \| 13.371 \| 49.655 \| Used torch profiler to measure the avg perf or 20 iterations. Model is openlm-research/open_llama_7b_v2 (script is [here](https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1)). --- PR ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 5.789s 46.42% 5.789s 289.456ms 0 b 0 b 0 b 0 b 20 ProfilerStep* 36.02% 3.119s 67.19% 5.819s 290.958ms 0.000us 0.00% 2.586s 129.276ms 48.00 Kb -1.47 Mb 0 b -504.23 Gb 20 aten::mm 2.57% 222.681ms 8.80% 762.415ms 56.475us 2.501s 20.05% 2.501s 185.255us 0 b 0 b 441.39 Gb 441.39 Gb 13500 autograd::engine::evaluate_function: MmBackward0 0.10% 8.600ms 8.17% 707.935ms 157.319us 0.000us 0.00% 1.625s 361.098us 0 b 0 b 198.65 Gb -135.03 Gb 4500 MmBackward0 0.39% 33.896ms 7.99% 692.035ms 153.786us 0.000us 0.00% 1.601s 355.710us 0 b 0 b 330.84 Gb -248.00 Mb 4500 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 1.007s 8.07% 1.007s 50.329ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 837.000us 3.36% 290.610ms 14.530ms 0.000us 0.00% 993.235ms 49.662ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.22% 18.825ms 3.35% 289.773ms 14.489ms 0.000us 0.00% 993.235ms 49.662ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.12% 10.823ms 3.09% 267.428ms 13.371ms 993.095ms 7.96% 993.095ms 49.655ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 993.095ms 7.96% 993.095ms 154.207us 0 b 0 b 0 b 0 b 6440 aten::matmul 0.19% 16.140ms 1.73% 149.869ms 33.304us 0.000us 0.00% 876.000ms 194.667us 0 b 0 b 107.46 Gb 0 b 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 835.374ms 6.70% 835.374ms 185.639us 0 b 0 b 0 b 0 b 4500 aten::linear 0.27% 23.268ms 1.97% 170.227ms 37.828us 0.000us 0.00% 776.278ms 172.506us 0 b 0 b 107.46 Gb 12.17 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 707.074ms 5.67% 707.074ms 183.180us 0 b 0 b 0 b 0 b 3860 aten::mul 1.31% 113.614ms 5.14% 445.405ms 22.125us 552.421ms 4.43% 552.780ms 27.459us 256.32 Kb 256.21 Kb 420.38 Gb 419.88 Gb 20131 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 442.209ms 3.55% 442.209ms 138.190us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.25% 21.336ms 5.00% 432.976ms 74.651us 0.000us 0.00% 398.627ms 68.729us 0 b 0 b -45.71 Gb -252.76 Gb 5800 aten::add_ 0.37% 31.975ms 7.19% 622.433ms 53.658us 391.957ms 3.14% 391.957ms 33.789us 0 b 0 b -4.35 Gb -4.35 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 345.037ms 2.77% 345.037ms 265.413us 0 b 0 b 0 b 0 b 1300 aten::copy_ 0.41% 35.727ms 20.62% 1.786s 146.503us 342.386ms 2.75% 342.386ms 28.092us 48.00 Kb 48.00 Kb -56.00 Mb -56.00 Mb 12188 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 8.661s Self CUDA time total: 12.472s ``` main ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 7.671s 42.31% 7.671s 383.529ms 0 b 0 b 0 b 0 b 20 ProfilerStep* 28.85% 3.050s 72.83% 7.700s 385.009ms 0.000us 0.00% 4.474s 223.678ms 48.00 Kb -1.48 Mb 0 b -504.45 Gb 20 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 2.896s 15.97% 2.896s 144.787ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 819.000us 2.75% 291.024ms 14.551ms 0.000us 0.00% 2.882s 144.125ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.17% 18.291ms 2.74% 290.205ms 14.510ms 0.000us 0.00% 2.882s 144.125ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.10% 10.893ms 2.54% 268.602ms 13.430ms 2.882s 15.90% 2.882s 144.117ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 2.882s 15.90% 2.882s 447.570us 0 b 0 b 0 b 0 b 6440 aten::mm 2.05% 217.136ms 7.21% 762.211ms 56.460us 2.499s 13.78% 2.499s 185.075us 0 b 0 b 441.37 Gb 441.37 Gb 13500 autograd::engine::evaluate_function: MmBackward0 0.07% 7.179ms 6.77% 715.673ms 159.038us 0.000us 0.00% 1.624s 360.812us 0 b 0 b 198.65 Gb -134.64 Gb 4500 MmBackward0 0.32% 34.257ms 6.62% 700.088ms 155.575us 0.000us 0.00% 1.600s 355.460us 0 b 0 b 330.59 Gb -628.00 Mb 4500 aten::matmul 0.15% 15.892ms 1.32% 139.597ms 31.022us 0.000us 0.00% 874.861ms 194.414us 0 b 0 b 107.46 Gb 0 b 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 834.631ms 4.60% 834.631ms 185.474us 0 b 0 b 0 b 0 b 4500 aten::linear 0.21% 22.460ms 1.51% 159.620ms 35.471us 0.000us 0.00% 774.772ms 172.172us 0 b 0 b 107.46 Gb 11.88 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 705.996ms 3.89% 705.996ms 182.901us 0 b 0 b 0 b 0 b 3860 aten::mul 1.06% 112.529ms 4.28% 452.473ms 22.488us 552.242ms 3.05% 552.266ms 27.447us 255.90 Kb 255.88 Kb 413.93 Gb 413.90 Gb 20121 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 441.514ms 2.44% 441.514ms 137.973us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.19% 20.517ms 4.18% 442.189ms 76.239us 0.000us 0.00% 398.552ms 68.716us 0 b 0 b -45.57 Gb -251.17 Gb 5800 aten::add_ 0.30% 31.703ms 6.01% 635.030ms 54.744us 391.897ms 2.16% 391.897ms 33.784us 0 b 0 b -5.71 Gb -5.71 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 344.972ms 1.90% 344.972ms 265.363us 0 b 0 b 0 b 0 b 1300 aten::copy_ 0.33% 34.415ms 34.75% 3.674s 301.437us 342.661ms 1.89% 342.661ms 28.115us 80.00 Kb 80.00 Kb -240.00 Mb -240.00 Mb 12188 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 10.574s Self CUDA time total: 18.129s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118361 Approved by: https://github.com/janeyx99	2024-02-01 17:04:10 +00:00
Bradley Davis	b73a2b7795	[ait] inspect get_attr nodes for _decline_if_input_dtype (#118760 ) Summary: previously get_attr nodes were skipped, but for example: %mul_240 : [num_users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.mul](args = (), kwargs = {input: %_fx_const_folded_attrs_13, other: %add_143}) where %_fx_const_folded_attrs_13 is int64, but add_143 is float causes issues if skipped, e.g. "unsupported dtype='int64' for alignments" Differential Revision: D53273467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118760 Approved by: https://github.com/khabinov	2024-02-01 15:56:15 +00:00
garfield1997	ff9ce94489	Create empty host tensor for privateuseone (#118854 ) For the H2D copy of local_used_map_ on the privateuseone device, reuse the CUDA logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118854 Approved by: https://github.com/ezyang	2024-02-01 15:32:55 +00:00
Eddie Yan	d790c1dca6	[CUDA][cuDNN][TF32] Misc TF32 updates (#118781 ) Twiddle some thresholds that don't seem to play nice with sm90. CC @tinglvv @nWEIdia @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/118781 Approved by: https://github.com/ezyang	2024-02-01 15:32:50 +00:00
Andrew Gu	687946eea1	[FSDP2] Added reduce-scatter (#117975 ) This PR adds the FSDP reduce-scatter (the copy-in/reduce-scatter collective/view-out). - We use gradient pre- and post-divide factors like existing FSDP (mainly for fp16 reduction). - We use a separate CUDA stream for the reduce-scatter to conveniently handle additional kernels surrounding the collective as a separate 'thread of execution' (e.g. pre/post-divide and later the D2H gradient offload). - ~~The implementation in this PR is more complicated to _try_ to reduce CPU overhead by using `torch.split` instead of a Python for-loop. The challenge comes from the fact that the autograd-computed unsharded gradients do not have padding. We prefer to not do an intermediate padding step and instead directly copying to the big reduce-scatter input.~~ For simplicity, I changed the implementation to include intermediate padding steps, as it can still achieve ~250 GB/s, and it avoids any `O(NP)` tensor materialization for world size `N` and `P` `nn.Parameter`s. <details> <summary> Recall: Copy-in/All-Gather/Copy-Out Example </summary> Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty: ``` Given: (3, 3): AAAAAAAAA (2, 2): BBBB Sharded parameters/all-gather inputs: Rank 0: AAAAAA, BB Rank 1: AAAPPP, BB Each rank allocate group's all-gather output: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: AAAAAABBEEEEEEEE Rank 1: EEEEEEEEAAAPPPBB Each rank all-gather: Rank 0: AAAAAABBAAAPPPBB Rank 1: AAAAAABBAAAPPPBB Each rank copy-out: Rank 0: AAAAAAAAAPPP, BBBB Rank 1: AAAAAAAAAPPP, BBBB ``` </details> <details> <summary> Copy-in/Reduce-Scatter/View-Out Example </summary> Suppose we have 2 gradients with shapes `(3, 3)` (denoted with `a`s when not-yet-reduced and `A`s after reduced) and `(2, 2)` (denoted with `b`s and `B`s similarly) and 2 ranks, where `E` represents empty: ``` Given from autograd: (3, 3): aaaaaaaaa (2, 2): bbbb Unsharded gradients/reduce-scatter inputs (no padding!): Rank 0: aaaaaaaaa, bbbb Rank 1: aaaaaaaaa, bbbb Each rank allocate group's reduce-scatter input: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: aaaaaabbaaaEEEbb Rank 1: aaaaaabbaaaEEEbb Each rank : Rank 0: AAAAAABBAAAEEEBB Rank 1: AAAAAABBAAAEEEBB Each rank view-out: Rank 0: AAAAAA BB Rank 1: AAA, BB ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117975 Approved by: https://github.com/weifengpy, https://github.com/yifuwang ghstack dependencies: #117950, #117955, #117973	2024-02-01 15:21:37 +00:00
Andrew M. James	9c2b43cc50	[inductor] Handle special values correctly in ir.Scan codegen (#118788 ) Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This is a fairly minor bugfix that has not come up since the only two scan ops with lowerings use "normal" values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788 Approved by: https://github.com/peterbell10	2024-02-01 14:54:20 +00:00
PyTorch MergeBot	221747507d	Revert "[export] support non-persistent buffers (#118612 ) (#118722 )" This reverts commit a43c28368c184ba1bf964f4fb99bec300917e2f4. Reverted https://github.com/pytorch/pytorch/pull/118722 on behalf of https://github.com/atalman due to broke linux-jammy-py3-clang12-executorch ([comment](https://github.com/pytorch/pytorch/pull/118722#issuecomment-1921484565))	2024-02-01 14:39:29 +00:00
PyTorch MergeBot	4a5a3bcc89	Revert "fused adam(w): Reduce register usage (#117872 )" This reverts commit b8e71cf3022e701604ea1f0c381c0b9ccf8743be. Reverted https://github.com/pytorch/pytorch/pull/117872 on behalf of https://github.com/janeyx99 due to This was not intended to be merged ([comment](https://github.com/pytorch/pytorch/pull/117872#issuecomment-1921425677))	2024-02-01 14:15:00 +00:00
vfdev-5	a1dd367716	Fixed error in bicubic upsampling aa=false for uint8 input (#118389 ) Description: - Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite: ```diff - self.assertLess(diff.max(), 15) + self.assertLess(diff.max(), 5) ``` While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1. - Renamed methods - The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values More details on the bug: For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as ``` out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ]) ``` When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly: ``` -- output index i= 0 regular float32 approach: source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1] interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001] separable uint8 approach: source indices coming from index ranges (min, size): [0, 1] incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0] fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0] Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach. ``` Quick benchmark to ensure perfs no regression: ``` [------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------] \| torch (2.3.0a0+gitfda85a6) PR \| torch (2.3.0a0+git0d1e705) Nightly \| Speed-up: PR vs Nightly 1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False \| 440.996 (+-2.044) \| 470.824 (+-5.927) \| 1.068 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False \| 463.565 (+-1.519) \| 497.231 (+-10.825) \| 1.073 (+-0.000) 3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False \| 1717.000 (+-28.589) \| 1915.570 (+-43.397) \| 1.116 (+-0.000) 3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False \| 1801.954 (+-22.391) \| 1981.501 (+-37.034) \| 1.100 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False \| 199.599 (+-0.851) \| 196.535 (+-3.788) \| 0.985 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False \| 243.126 (+-0.681) \| 240.695 (+-2.306) \| 0.990 (+-0.000) 3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False \| 686.270 (+-2.870) \| 687.769 (+-17.863) \| 1.002 (+-0.000) 3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False \| 899.509 (+-5.377) \| 899.063 (+-9.001) \| 1.000 (+-0.000) Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389 Approved by: https://github.com/NicolasHug ghstack dependencies: #118388	2024-02-01 14:14:32 +00:00
cyy	8b140da804	Use MKL_INT in MKL wrapper interfaces (#118734 ) I encountered the error when built PyTorch on Windows MKL: ``` pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): error C2664: “void cblas_sgemm_batch(const CBLAS_LAYOUT,const CBLAS_TRANSPOSE ,const CBLAS_TRANSPOSE ,const __int64 ,const __int64 ,const __int64 ,const float ,const float *,const __int64 ,const float *,const __int64 ,const float ,float ,const __int64 ,const __int64,const __int64 ) noexcept”: 无法将参数 4 从“const int ”转换为“const __int64 *” pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): note: 指向的类型不相关; 转换需要 reinterpret_cast、C 样式强制转换或带圆括号的函数样式强制转换 C:\Program Files (x86)\Intel\oneAPI\2024.0\include\mkl_cblas.h(550): note: 参见“cblas_sgemm_batch”的声明 ``` This was because MKL_INT was defined as int64_t. This PR tries to use MKL_INT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118734 Approved by: https://github.com/ezyang	2024-02-01 13:32:28 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00
Michael Suo	eaa45f47f8	[sigmoid] fix for torchbind serialization (#118791 ) Summary: There is an annoying inconsistency in how we pickle custom objs. `torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u. This serializes in a different format than TorchScript does, which uses the TS C++ pickler. The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work. Test Plan: ran SherlockNoMad's repro ``` buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG ``` Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs. Reviewed By: SherlockNoMad Differential Revision: D53248454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791 Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov	2024-02-01 10:09:07 +00:00
Angela Yi	0dc15ff674	[reland][export] Fix graph signature for primitive outputs (#118818 ) Summary: Reland of D53233649/https://github.com/pytorch/pytorch/pull/118655. Previously I didn't realize there was a use-case of a torchbind object as an input to the graph, so I didn't mark `CustomObjArgument` as a valid input, which broke [this test](`a43c28368c/test/export/test_torchbind.py (L81)`). Somehow the initial CI did not catch it, but hud was sad so that PR was reverted. So now I added `CustomObjArgument` as valid input [here](https://github.com/pytorch/pytorch/pull/118818/files#diff-92420f977c3a02b2deadf6752ce4a9ee601c20612a1a13cc365252eb09410edbR298). Test Plan: CI Reviewed By: tarun292 Differential Revision: D53288445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118818 Approved by: https://github.com/ydwu4	2024-02-01 09:59:05 +00:00
Masaki Kozuki	b8e71cf302	fused adam(w): Reduce register usage (#117872 ) As per title, reducing register usage for better occupancy. Changes are: - use 32bit indexing if possible - convert some arguments of fused adam(w) functor to its template parameters - give `const` to some arguments Tables below are before/after of adamw for sm90 with / without amsgrad enabled. ### without amsgrad \| dtype \| main \| this PR \| \|-------\|------\|---------\| \| bf16 \| 79 \| 64 \| \| fp16 \| 82 \| 64 \| \| fp32 \| 126 \| 64 \| \| fp64 \| 128 \| 109 \| ### with amsgrad \| dtype \| main \| this PR \| \|-------\|------\|---------\| \| bf16 \| 124 \| 74 \| \| fp16 \| 124 \| 74 \| \| fp32 \| 123 \| 76 \| \| fp64 \| 128 \| 121 \| --- `AdamW(..., fused=True)` with llama-2 bf16 on H100 improved to 49.935ms of cuda avg time from 126.648ms according to torch profiler. This PR: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 5.878s 46.47% 5.878s 293.918ms 0 b 0 b 0 b 0 b 20 aten::mm 2.57% 224.777ms 8.50% 741.993ms 54.962us 2.591s 20.48% 2.591s 191.910us 0 b 0 b 441.39 Gb 441.39 Gb 13500 ProfilerStep* 31.64% 2.763s 67.67% 5.910s 295.485ms 0.000us 0.00% 2.551s 127.547ms 48.00 Kb -1.44 Mb 0 b -506.38 Gb 20 autograd::engine::evaluate_function: MmBackward0 0.13% 11.349ms 7.90% 690.160ms 153.369us 0.000us 0.00% 1.726s 383.544us 0 b 0 b 198.65 Gb -137.53 Gb 4500 MmBackward0 0.45% 38.959ms 7.68% 670.399ms 148.978us 0.000us 0.00% 1.693s 376.326us 0 b 0 b 332.81 Gb 2.26 Gb 4500 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 1.012s 8.00% 1.012s 50.617ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 846.000us 3.39% 296.240ms 14.812ms 0.000us 0.00% 998.876ms 49.944ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.26% 23.113ms 3.38% 295.394ms 14.770ms 0.000us 0.00% 998.876ms 49.944ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.13% 11.000ms 3.08% 268.545ms 13.427ms 998.705ms 7.89% 998.705ms 49.935ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 998.705ms 7.89% 998.705ms 155.078us 0 b 0 b 0 b 0 b 6440 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 872.287ms 6.90% 872.287ms 193.842us 0 b 0 b 0 b 0 b 4500 aten::matmul 0.19% 16.721ms 1.82% 159.130ms 35.362us 0.000us 0.00% 864.840ms 192.187us 0 b 0 b 107.46 Gb 0 b 4500 aten::linear 0.28% 24.641ms 2.09% 182.129ms 40.473us 0.000us 0.00% 765.554ms 170.123us 0 b 0 b 107.46 Gb 12.46 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 690.729ms 5.46% 690.729ms 178.945us 0 b 0 b 0 b 0 b 3860 aten::mul 1.36% 118.465ms 4.89% 427.071ms 21.225us 549.580ms 4.34% 549.697ms 27.320us 224.03 Kb 223.96 Kb 413.51 Gb 413.36 Gb 20121 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 484.455ms 3.83% 484.455ms 151.392us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.27% 23.176ms 4.63% 404.534ms 69.747us 0.000us 0.00% 406.155ms 70.027us 0 b 0 b -46.01 Gb -257.12 Gb 5800 aten::add_ 0.39% 34.186ms 7.22% 630.849ms 54.384us 394.402ms 3.12% 394.402ms 34.000us 0 b 0 b -6.68 Gb -6.68 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 366.653ms 2.90% 366.653ms 282.041us 0 b 0 b 0 b 0 b 1300 aten::copy_ 0.41% 35.934ms 20.61% 1.800s 147.691us 341.572ms 2.70% 341.572ms 28.025us 48.00 Kb 48.00 Kb -40.00 Mb -40.00 Mb 12188 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 8.733s Self CUDA time total: 12.651s AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=846.000us cpu_time=14.812ms self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=23.113ms cpu_time=14.770ms self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us self_cuda_time=1.012s cuda_time=50.617ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> ``` Main ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.00% 0.000us 0.00% 0.000us 0.000us 7.354s 42.89% 7.354s 367.698ms 0 b 0 b 0 b 0 b 20 ProfilerStep* 28.22% 2.875s 72.48% 7.384s 369.184ms 0.000us 0.00% 4.067s 203.325ms 48.00 Kb -1.48 Mb 0 b -508.04 Gb 20 aten::mm 2.24% 228.499ms 7.13% 726.223ms 53.794us 2.563s 14.95% 2.563s 189.873us 0 b 0 b 441.39 Gb 441.39 Gb 13500 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 2.546s 14.85% 2.546s 127.304ms 0 b 0 b 0 b 0 b 20 AdamW.step 0.01% 821.000us 2.87% 292.871ms 14.644ms 0.000us 0.00% 2.533s 126.654ms 0 b 0 b 0 b 0 b 20 Optimizer.step#AdamW.step 0.22% 22.801ms 2.87% 292.050ms 14.602ms 0.000us 0.00% 2.533s 126.654ms 0 b 0 b 0 b 0 b 20 aten::_fused_adamw_ 0.11% 11.332ms 2.61% 265.853ms 13.293ms 2.533s 14.77% 2.533s 126.648ms 0 b 0 b 0 b 0 b 20 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 2.533s 14.77% 2.533s 393.315us 0 b 0 b 0 b 0 b 6440 autograd::engine::evaluate_function: MmBackward0 0.13% 13.342ms 6.73% 685.250ms 152.278us 0.000us 0.00% 1.706s 379.209us 0 b 0 b 198.65 Gb -138.02 Gb 4500 MmBackward0 0.38% 38.974ms 6.52% 664.652ms 147.700us 0.000us 0.00% 1.675s 372.113us 0 b 0 b 333.59 Gb 2.75 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 859.515ms 5.01% 859.515ms 191.003us 0 b 0 b 0 b 0 b 4500 aten::matmul 0.16% 16.431ms 1.49% 152.052ms 33.789us 0.000us 0.00% 856.839ms 190.409us 0 b 0 b 107.46 Gb 0 b 4500 aten::linear 0.23% 23.703ms 1.72% 174.862ms 38.858us 0.000us 0.00% 758.995ms 168.666us 0 b 0 b 107.46 Gb 12.21 Gb 4500 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 682.302ms 3.98% 682.302ms 176.762us 0 b 0 b 0 b 0 b 3860 aten::mul 1.16% 117.854ms 4.12% 420.100ms 20.892us 544.045ms 3.17% 544.157ms 27.062us 240.38 Kb 240.34 Kb 419.45 Gb 419.29 Gb 20108 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 479.767ms 2.80% 479.767ms 149.927us 0 b 0 b 0 b 0 b 3200 autograd::engine::evaluate_function: MulBackward0 0.27% 27.303ms 3.95% 402.627ms 69.418us 0.000us 0.00% 403.020ms 69.486us 0 b 0 b -45.56 Gb -257.26 Gb 5800 aten::add_ 0.32% 32.543ms 6.08% 619.248ms 53.383us 393.242ms 2.29% 393.242ms 33.900us 0 b 0 b -6.21 Gb -6.21 Gb 11600 sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256... 0.00% 0.000us 0.00% 0.000us 0.000us 363.245ms 2.12% 363.245ms 279.419us 0 b 0 b 0 b 0 b 1300 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 338.460ms 1.97% 338.460ms 29.228us 0 b 0 b 0 b 0 b 11580 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 10.187s Self CUDA time total: 17.145s AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=821.000us cpu_time=14.644ms self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=22.801ms cpu_time=14.602ms self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us self_cuda_time=2.546s cuda_time=127.304ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0> ``` Script I used: https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1 <!-- ## adamw ### This PR ```console $ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_impl.sm_70.cubin Extracting ELF file 2: fused_adamw_impl.sm_80.cubin Extracting ELF file 3: fused_adamw_impl.sm_90.cubin $ cuobjdump -res-usage fused_adamw_impl.sm_90.cubin \| cu++filt Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:64 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:109 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` ### Main ```console $ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_impl.cu.1.sm_70.cubin Extracting ELF file 2: fused_adamw_impl.cu.2.sm_80.cubin Extracting ELF file 3: fused_adamw_impl.cu.3.sm_90.cubin $ cuobjdump -res-usage fused_adamw_impl.cu.3.sm_90.cubin \| cu++filt Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:79 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:82 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:126 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` ## adamw & amsgrad ### This PR ```console root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_amsgrad_impl.sm_70.cubin Extracting ELF file 2: fused_adamw_amsgrad_impl.sm_80.cubin Extracting ELF file 3: fused_adamw_amsgrad_impl.sm_90.cubin root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.sm_90.cubin Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:76 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float , double, double, double, double, double, bool, const float , const float >(T1, T2, T3, T4...): REG:121 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` ### Main ```console root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all Extracting ELF file 1: fused_adamw_amsgrad_impl.cu.1.sm_70.cubin Extracting ELF file 2: fused_adamw_amsgrad_impl.cu.2.sm_80.cubin Extracting ELF file 3: fused_adamw_amsgrad_impl.cu.3.sm_90.cubin root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usave fused_adamw_amsgrad_impl.cu.3.sm_90.cubin cuobjdump fatal : Unknown option 'res-usave' root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.cu.3.sm_90.cubin Resource usage: Common: GLOBAL:3 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:123 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5>, float , double, double, double, double, double, bool, bool, float , float , at::native::ADAM_MODE>(T1, T2, T3...): REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0 ``` --> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117872 Approved by: https://github.com/janeyx99	2024-02-01 09:34:50 +00:00
vfdev-5	eba4bd6b86	Updated test_upsamplingBiMode2d_consistency (#118388 ) Description: - Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype - Updated out-dated comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388 Approved by: https://github.com/NicolasHug	2024-02-01 09:22:23 +00:00
Angela Yi	7e0ea0d5df	[export] Only deepcopy graph in unlift (#118821 ) Summary: We only need to deepcopy the graph because we're modifying the graph by unlifting its parameter/buffer inputs. We don't need to deepcopy the graph module state/contents. This causes an error when the graph module contains an ExecuTorch LoweredModule which stores tensors. Test Plan: Fixes the following diff Differential Revision: D53290077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118821 Approved by: https://github.com/tugsbayasgalan	2024-02-01 09:00:22 +00:00
Yanbo Liang	4fc4f5eb06	[Dynamo] Support tensor is not tensor (#118840 ) Fixes Meta internal use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118840 Approved by: https://github.com/yf225	2024-02-01 07:32:43 +00:00
Yifu Wang	a1280f0cc6	Add an OpInfo test for split_with_sizes_copy (#118512 ) Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline. Changes made: - Registered a batching rule for `split_with_sizes_copy`. - Registered a decomposition for `split_with_sizes_copy`. - Registered a DTensor prop rule for `split_with_sizes_copy`. - Added required dtype and device checks to the composite impl. - Added output resize to the composite impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512 Approved by: https://github.com/albanD	2024-02-01 07:09:27 +00:00
Mu-Chu Lee	2b48891e62	[AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765 ) Summary: Add Runtime Constant-folding for AOTInductor. This also include the invocation of constant folding at load time. The constant folding lowering is a 2-step process. First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code. Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module. Test Plan: Unit tests included in commit. Differential Revision: D53274382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765 Approved by: https://github.com/chenyang78	2024-02-01 04:54:25 +00:00
Jiaxu Zhu	b97ab47619	[pytorch][ao] Update `PerChannelMinMaxObserver` default `_load_from_state_dict` (#118659 ) Summary: When `version` is missing in the metadata, use `min_val/max_val` as keys instead of `max_vals/min_vals` ## Reasons 1. It's been almost 2 years since this change D30003700, which means now most checkpoints are using the `max_val/min_val` keys 2. most checkpoints dumps using `model.state_dict()` don't have version info, which will lead a fake `missing keys` error when loading state_dict Test Plan: CI Differential Revision: D53233012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118659 Approved by: https://github.com/jerryzh168	2024-02-01 04:39:31 +00:00
PyTorch UpdateBot	526701cfb7	[executorch hash update] update the pinned executorch hash (#118698 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118698 Approved by: https://github.com/pytorchbot	2024-02-01 03:39:50 +00:00
Mikayla Gawarecki	45d2dff844	[easy] Enable test_neg_view for 5D SampleInput for torch.nn.functional.linear (#118815 ) Fixes #117854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118815 Approved by: https://github.com/malfet	2024-02-01 03:26:45 +00:00
PyTorch UpdateBot	adff335095	[vision hash update] update the pinned vision hash (#118825 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118825 Approved by: https://github.com/pytorchbot	2024-02-01 03:14:16 +00:00
Andrew Gu	9b28621369	[FSDP2] Added forward unshard/wait for unshard/reshard (#117973 ) This PR adds the all-gather and free logic required for forward. - We define the logical all-gather as two ops: (1) unshard and (2) wait for unshard. This abstraction allows capturing both implicit forward prefetching (using multiple streams and `async_op=False`) and explicit forward prefetching (using `async_op=True`). - Symmetrically, we define the reshard op to free the unsharded parameters. Some other notes: - The `FSDPParamGroup` and its `FSDPParam`s transition their sharded states together. This invariant allows us to reason about the parameters by group rather than individually with respect to whether they are sharded or unsharded. --- ### How Does the Overlap Work for All-Gather? For context, the all-gather consists of three steps: (1) copy-in, (2) all-gather collective, and (3) copy-out. <details> <summary> Example </summary> Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty: ``` Given: (3, 3): AAAAAAAAA (2, 2): BBBB Sharded parameters/all-gather inputs: Rank 0: AAAAAA, BB Rank 1: AAAPPP, BB Each rank allocate group's all-gather output: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: AAAAAABBEEEEEEEE Rank 1: EEEEEEEEAAAPPPBB Each rank all-gather: Rank 0: AAAAAABBAAAPPPBB Rank 1: AAAAAABBAAAPPPBB Each rank copy-out: Rank 0: AAAAAAAAAPPP, BBBB Rank 1: AAAAAAAAAPPP, BBBB ``` </details> `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream before running the collective. `async_op=True` means that the function waits on the work, having the current stream wait for the NCCL stream before returning. `async_op=False` means it returns the `Work` object, which the user can wait on later. #### Implicit Prefetching Implicit prefetching achieves communication/computation overlap without changing the CPU issue order: - We use separate streams for copy-in and for issuing the `dist.all_gather_into_tensor()`. The copy-in stream allows us to overlap the copy-in with all-gather/reduce-scatter in backward, and the all-gather stream allows us to overlap the all-gather with forward compute (issued before it). - Because `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream, we need this "dummy" all-gather stream to prevent the all-gather from waiting on the forward compute with which it should overlap. - Without the separate copy-in stream, we cannot overlap all-gather copy-in with all-gather in forward. - We copy-out in the default stream after having the default stream wait for the all-gather. This means that the autograd leaves are allocated in the default stream and autograd will not call `recordStream`. Implicit prefetching does not require knowing the execution order ahead of time. However, when overlapping the next all-gather with the current compute, there may be a gap from the CPU thread issuing the current compute. If the CPU thread can run ahead, then this is not an issue. #### Explicit Prefetching Explicit prefetching achieves communication/computation by changing the CPU issue order, namely by reordering the all-gather to be before the compute with which it should overlap. - Because we reorder, we do not need any separate streams, and we can use `async_op=False` for overlap. - We can expose this explicit prefetching as a module-level `unshard()` op (e.g. `module.unshard(async_op: bool)`, and we can use it as a primitive for implementing the explicit forward prefetching in existing FSDP. Explicit prefetching requires knowing the execution order. --- Disclaimer: The testing is relatively lighter in this PR. I did not want to spend too much time writing new forward-only tests. The stream usage will be exercised thoroughly once we have backward too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117973 Approved by: https://github.com/weifengpy, https://github.com/yifuwang ghstack dependencies: #117950, #117955	2024-02-01 03:08:13 +00:00
James Wu	8d6e34b21b	Add verbose option to failures histogram (#118757 ) Sample output: https://gist.github.com/jamesjwu/cc80d7da305add0a69c5e39aae09a077 Using directories from https://hud.pytorch.org/pr/118597: eager_tests: [linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034340833) dynamo_tests: [linux-focal-py3.11-clang10 / test (dynamo, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034342747) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118757 Approved by: https://github.com/zou3519	2024-02-01 02:46:36 +00:00
David Berard	499f31d40b	[dynamo] use par_style = "xar" in minifier targets file (#118603 ) For internal usage, par_style="xar" is needed in order for certain build modes to work with triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118603 Approved by: https://github.com/williamwen42	2024-02-01 02:42:26 +00:00
Michael Suo	a43c28368c	[export] support non-persistent buffers (#118612 ) (#118722 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1769 Basic support for non-persistent buffers, which are buffers that do not show up in the state dict. One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict. This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them. Test Plan: added a unit test Differential Revision: D53253905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118722 Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi	2024-02-01 00:36:09 +00:00
drisspg	4cba1dd0c3	[submodule] Update cudnn_frontend to v1.0.3 (#118782 ) # Summary Updates cudnn frontend to tagged 1.0.3 tagged version submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/118782 Approved by: https://github.com/malfet	2024-02-01 00:35:03 +00:00
suo	2f79a7bf9e	[export] make spec comparison indifferent to fx collections (#118718 ) Treat immutable_dict as dict and immutale_list as list. This behavior was tripped up by some executorch tests Differential Revision: [D53252679](https://our.internmc.facebook.com/intern/diff/D53252679/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118718 Approved by: https://github.com/zhxchen17	2024-02-01 00:10:49 +00:00
Nikita Shulga	6c67f3333a	[Inductor] Skip triton templates for mixedmm on SM70- (#118591 ) As it results in numerical errors, see https://github.com/pytorch/pytorch/issues/117144 Fixes https://github.com/pytorch/pytorch/issues/117144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118591 Approved by: https://github.com/jansel	2024-01-31 23:30:45 +00:00
Edward Z. Yang	da4b4d961e	Support printing storage while FakeTensorMode is enabled (#118780 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118780 Approved by: https://github.com/thiagocrepaldi, https://github.com/eellison	2024-01-31 23:10:47 +00:00
BowenBao	30f43e3d89	[ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710 ) Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device. This PR modifies the script to deepcopy and export the model on another device when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710 Approved by: https://github.com/thiagocrepaldi	2024-01-31 23:03:39 +00:00
Jane Xu	21ce53b9c5	Add inf norm support for _foreach_norm (#118441 ) Fixes #117803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118441 Approved by: https://github.com/mlazos	2024-01-31 22:58:51 +00:00
Elias Ellison	e87ac82c98	Fix missing default dim param in weight norm interface decomp (#118762 ) Fix for https://github.com/pytorch/pytorch/issues/118742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118762 Approved by: https://github.com/ezyang, https://github.com/shunting314	2024-01-31 22:10:10 +00:00
Michael Lazos	e426924c19	Change classification to beta for TORCH_LOGS (#118682 ) Changes classification of TORCH_LOGS to beta Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682 Approved by: https://github.com/svekars	2024-01-31 21:50:55 +00:00
Michael Lazos	fb391a016d	Test that optimizers are running cudagraphs (#118716 ) Updates compiled optimizer tests to ensure that cudagraphs is running when on cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118716 Approved by: https://github.com/eellison	2024-01-31 21:34:23 +00:00
Edward Z. Yang	8dee7b7a16	Add TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED (#118750 ) This allows us to request extended (including C++ backtrace) information whenever a specific guard occurs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118750 Approved by: https://github.com/aakhundov	2024-01-31 21:16:27 +00:00
Edward Z. Yang	c978f38bd4	Some minor type stub improvements (#118529 ) I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529 Approved by: https://github.com/Skylion007	2024-01-31 20:56:56 +00:00
PyTorch MergeBot	5ced432a0d	Revert "[export] Fix graph signature for primitive outputs (#118655 )" This reverts commit 680cc6b17ab3f318c0da6177646afe6700152327. Reverted https://github.com/pytorch/pytorch/pull/118655 on behalf of https://github.com/atalman due to broke TestExportTorchbind.test_input test ([comment](https://github.com/pytorch/pytorch/pull/118655#issuecomment-1919940598))	2024-01-31 20:55:46 +00:00
Sergii Dymchenko	a768a50a55	Re-enable test_nan_to_num (#118711 ) Resolve TODO and re-enable as https://github.com/pytorch/pytorch/issues/82763 is resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118711 Approved by: https://github.com/peterbell10	2024-01-31 20:01:10 +00:00
Catherine Lee	9391af9796	Merging heuristics (#118029 ) Everyday I move closer and closer to just using numbers * number of heuristics that marked it as high, probable, low, none etc * order of heuristics in the `__init__` file as well as how the heuristic ordered the tests * put heuristics historical edited files and profiling as not trial mode * briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029 Approved by: https://github.com/huydhn	2024-01-31 20:00:10 +00:00
Andrew Gu	3280fdb883	[FSDP2] Added `_to_kwargs` root forward input cast (#117955 ) This PR adds a `_to_kwargs()` call on the FSDP root module's forward inputs to move them to `device` similar to DDP. `39df084001/torch/nn/parallel/distributed.py (L1426-L1427)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117955 Approved by: https://github.com/weifengpy ghstack dependencies: #117950	2024-01-31 19:51:32 +00:00
Andrew Gu	d33f9dcefe	[FSDP2] Added all-gather and unsharded parameter (#117950 ) This PR adds the FSDP all-gather (the copy-in/all-gather collective and the copy-out) and the unsharded parameter concept to `FSDPParam`. This is to prepare for being able to run the forward pass. - We implement all-gather as two functions: `foreach_all_gather` (copy-in/all-gather collective) and `foreach_all_gather_copy_out` (copy-out). - In the future, there will be two paths: `async_op=True` in the default stream for explicit prefetching and `async_op=False` in separate streams for implicit prefetching. - In the future, we will use `torch.split_with_sizes_copy` in the copy-out when it has the CUDA fast path. - We have the functions operate on `List[FSDPParam]` instead of passing the `torch.Tensor` and metadata mainly so that the `all_gather_input` can be computed under the `all_gather_copy_in_stream`. Since the two functions are specific to FSDP, I did not see motivation for avoiding this at the cost of entering/exiting the `all_gather_copy_in_stream` context twice (which incurs some CPU overhead). - The `init_all_gather_output()` and `init_unsharded_parameter()` functions may seem unintuitive. The reason we initialize them once and write to them in-place thereafter is for autograd. See the note `[Note: FSDP and autograd]` in the code. - We expand our 'FSDP tensors' definition to include the all-gather input and all-gather output in addition to the sharded and unsharded parameters. This distinction might seem unnecessary or pedantic, but it enables a language for describing pre- and post-all-gather transformations. - We use the `_unsafe_preserve_version_counters` context when copying out because otherwise autograd will complain of a version mismatch in backward due to writing to the leaf tensors. (An alternative would be to use `.data`, but we are avoiding that 😄 .) --- <details> <summary> Copy-in/All-Gather/Copy-Out Example </summary> Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty: ``` Given: (3, 3): AAAAAAAAA (2, 2): BBBB Sharded parameters/all-gather inputs: Rank 0: AAAAAA, BB Rank 1: AAAPPP, BB Each rank allocate group's all-gather output: EEEEEEEEEEEEEEEE Each rank copy-in: Rank 0: AAAAAABBEEEEEEEE Rank 1: EEEEEEEEAAAPPPBB Each rank all-gather: Rank 0: AAAAAABBAAAPPPBB Rank 1: AAAAAABBAAAPPPBB Each rank copy-out: Rank 0: AAAAAAAAAPPP, BBBB Rank 1: AAAAAAAAAPPP, BBBB ``` </details> --- For context, we use the copy-in/all-gather/copy-out strategy instead of NCCL group coalescing for two reasons: 1. One large NCCL all-gather is still noticeably faster than several NCCL all-gathers using group coalescing of the same total bytes (even after NCCL 2.18.3). We prefer to tradeoff extra device-to-device copies (using GPU high-bandwidth memory) to save communication time, which does not improve as fast from hardware generation to generation. 2. Copying out of the all-gather buffer tensor simplifies multi-stream memory handling because there is a constant number of such all-gather tensors alive at once. (The copy-out is done in the default/compute stream.) If we directly used the all-gather tensor memory for computation, then the number of such alive tensors is linear in the module depth and hence dependent on the particular model. --- Disclaimer: This PR has some extraneous code, but I did not want to simplify too much since that code will be added back soon anyway (e.g. for overlapping, mixed precision, and ZeRO++). Hopefully it does not hinder code review too much. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117950 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-01-31 19:51:32 +00:00
PyTorch MergeBot	483001e846	Revert "Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586 )" This reverts commit f2682e75e6fd735c4a84afe59eafd541f7643f4a. Reverted https://github.com/pytorch/pytorch/pull/118586 on behalf of https://github.com/atalman due to Broke slow tests ([comment](https://github.com/pytorch/pytorch/pull/118586#issuecomment-1919810802))	2024-01-31 19:44:29 +00:00
Andrew Calvano	649f2e3000	Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300 ) Summary: The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds. This change ensures that all registers_ accesses are in bounds or an exception will be thrown. Test Plan: contbuild + OSS signals Differential Revision: D49748737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300 Approved by: https://github.com/malfet, https://github.com/kimishpatel	2024-01-31 19:40:02 +00:00
hodavand	8026534a2f	Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929 ) Fixes #117370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929 Approved by: https://github.com/Skylion007, https://github.com/desertfire	2024-01-31 19:34:58 +00:00
vinithakv	82b6ee5a2a	Fix build error in ppc64le (#118516 ) ... from /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/test/vec_test_all_types.cpp:1: /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h: In member function 'bool at::vec::DEFAULT::Vectorized::has_inf_nan() const': /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:244:36: error: no matching function for call to 'at::vec::DEFAULT::Vectorized::_isinf(float&) const' 244 \| if(_isnan(_vec0[i]) \|\| _isinf(_vec0[i])) { \| ~~~~~~^~~~~~~~~~ /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:237:21: note: candidate: 'at::vec::DEFAULT::Vectorized at::vec::DEFAULT::Vectorized::_isinf() const'~ ... Started breaking from `29516bd2a0`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/118516 Approved by: https://github.com/ezyang	2024-01-31 19:33:57 +00:00
Felix Zimmermann	aca41a3a74	[optim] lbfgs: handle complex params as independent real params (#118184 ) Ref: #86340 Fixes #118148 This fixes LBFGS for complex parameters. Complex parameters are handled as R^2. I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers. Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable. We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters. Let me know if the approach is ok, thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184 Approved by: https://github.com/janeyx99	2024-01-31 19:24:16 +00:00
Edward Z. Yang	82b0341af3	s/verison/version/ (#118749 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118749 Approved by: https://github.com/malfet, https://github.com/albanD	2024-01-31 19:23:55 +00:00
rzou	41dfd0e063	Update Dynamo passrate/histogram scripts (#118752 ) Changelog: - Don't count running PYTORCH_TEST_WITH_DYNAMO=1 on dynamo/ tests in the pass rate. This was a bug (we were counting all of these as failing, but in reality, most of these pass). The net effect is that the passrate is (artifically) 6% higher. - Have the histogram script filter out skips based on the passrate metric. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118752 Approved by: https://github.com/jamesjwu	2024-01-31 19:15:17 +00:00
Shan19900305	99b69e1ffb	add PrivateUse1 device support in function options_from_string. (#118627 ) add PrivateUse1 device support in function options_from_string. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118627 Approved by: https://github.com/soulitzer	2024-01-31 18:52:58 +00:00
Boyuan Feng	7aff92c838	[torch] Expose dynamic_shapes api at multiple levels (#118695 ) Summary: Exposes `dynamic_shapes` api at multiple levels so it's easier to replace the old API `dynamic_dim()` with the new API `Dim()`. Test Plan: CI Differential Revision: D53246409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118695 Approved by: https://github.com/ydwu4	2024-01-31 18:50:01 +00:00
CaoE	6bd1807ae9	enable mkl_gemm_f16f16f32 in cpublas::gemm (#118367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118367 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-31 18:37:42 +00:00
Isuru Fernando	81d12846dc	Add decomp for pixel_shuffle/unshuffle (#118239 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118239 Approved by: https://github.com/peterbell10	2024-01-31 18:34:21 +00:00
soulitzer	81b55f58ce	Matmul decide should_fold using has_out instead of grad_mode (#118617 ) Fixes https://github.com/pytorch/pytorch/issues/118548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118617 Approved by: https://github.com/lezcano	2024-01-31 18:34:16 +00:00
rzou	a5a0fdcae9	Remove some unnecessary skipIfTorchDynamo (#118725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118725 Approved by: https://github.com/bdhirsh	2024-01-31 18:18:17 +00:00
Angela Yi	680cc6b17a	[export] Fix graph signature for primitive outputs (#118655 ) Summary: Now that we allow primitive outputs, we need to fix how the graph signature outputs user_outputs Test Plan: CI Differential Revision: D53233649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118655 Approved by: https://github.com/tarun292	2024-01-31 18:00:02 +00:00
laith sakka	8455447972	Support builtin callable with object arguments in dynamo (#118678 ) Fix issue #117556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118678 Approved by: https://github.com/anijain2305	2024-01-31 17:54:08 +00:00
Edward Z. Yang	68c3cb7594	s/fialure/failure/ (#118744 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118744 Approved by: https://github.com/peterbell10	2024-01-31 17:42:44 +00:00
suo	5586d7797e	fix up batchnorm folding in pt2 quant (#118720 ) Changes to how attributes are structured messed this pass up, fix it Differential Revision: [D53253601](https://our.internmc.facebook.com/intern/diff/D53253601/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118720 Approved by: https://github.com/SherlockNoMad	2024-01-31 17:40:47 +00:00
Oguz Ulgen	4a677da36b	Add more triton kernel mutation tracking tests (#118691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118691 Approved by: https://github.com/aakhundov ghstack dependencies: #118676, #118595	2024-01-31 17:38:17 +00:00
Oguz Ulgen	b4f4fd0c28	Parse and handle functions in TTIR (#118595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118595 Approved by: https://github.com/aakhundov ghstack dependencies: #118676	2024-01-31 17:38:17 +00:00
laith sakka	1bf9ddf130	add test_truth (#118597 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118597 Approved by: https://github.com/anijain2305	2024-01-31 15:10:58 +00:00
Bin Bao	1128cf96f0	[AOTI] Support _embedding_bag in C shim (#118706 ) Summary: At some point I will stop manually adding ops to C shim, but use torchgen to generate those code. For the near term, I need to add a few more in order to switch the AOTInductor dashboard run. Differential Revision: [D53249074](https://our.internmc.facebook.com/intern/diff/D53249074) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118706 Approved by: https://github.com/frank-wei, https://github.com/aakhundov ghstack dependencies: #118704, #118705	2024-01-31 15:02:40 +00:00
Bin Bao	8db8ff652c	[AOTI] Add aoti_torch_view_dtype in C shim (#118705 ) Summary: Support ir.ComplexView in the ABI-compatible codegen Differential Revision: [D53249039](https://our.internmc.facebook.com/intern/diff/D53249039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118705 Approved by: https://github.com/frank-wei ghstack dependencies: #118704	2024-01-31 14:42:29 +00:00
Bin Bao	dd52939438	[inductor] Refactor ir.ComplexView (#118704 ) Summary: Make ir.ComplexView a subclass of ir.FallbackKernel, to unify the codegen logic Differential Revision: [D53248972](https://our.internmc.facebook.com/intern/diff/D53248972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118704 Approved by: https://github.com/frank-wei	2024-01-31 14:42:29 +00:00
Kai Londenberg	35f3ccffd4	[Cutlass 3.3.0 submodule upgrade] (#118629 ) Cutlass 3.3 offers the following improvements: Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. minor doc update Test Plan: CI ( ciflow/trunk, ciflow/inductor ) pytest test/inductor/test_max_autotune.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/118629 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/khabinov	2024-01-31 13:53:58 +00:00
Sergii Dymchenko	c3a3e61bcb	Resolve TODO in test_slice_mutation2 (#118712 ) As https://github.com/pytorch/pytorch/issues/94693 has been resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118712 Approved by: https://github.com/peterbell10	2024-01-31 08:26:22 +00:00
Sherlock Huang	9afd539075	[sigmoid] update serialization to include custom objs (#118684 ) Summary: Update the serialization code to handle custom objs. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//sigmoid/frontend/test_gpu:serializer_test Differential Revision: D53139356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118684 Approved by: https://github.com/angelayi, https://github.com/suo	2024-01-31 08:23:34 +00:00
Sergii Dymchenko	56718cab8d	Unskip test_complex_type_conversions (#118694 ) Resolve TODO and unskip test_complex_type_conversions as real and imag have been implemented for complex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118694 Approved by: https://github.com/huydhn	2024-01-31 08:04:15 +00:00
Simon Fan	73229b4f93	Add --filter-rank to torchrun: allow logs filtering by rank (#118562 ) Addresses issue https://github.com/pytorch/pytorch/issues/117383 The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr) ## Behavior ### with --tee Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console. ### with --redirect When --redirect is specified without --tee, nothing is logged to console, so we no-op. ### with neither When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console. The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation. ## Usage ### without --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py hello from rank 0 python DEBUG: TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 5) {} output output output ((mul,),) {} ... ``` ### with --tee ``` > TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py [rank0]:hello from rank 0 python [rank0]:DEBUG: TRACED GRAPH [rank0]: __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs [rank0]:------------- ------ ----------------------- --------- -------- [rank0]:placeholder l_x_ L_x_ () {} [rank0]:call_function mul <built-in function mul> (l_x_, 5) {} [rank0]:output output output ((mul,),) {} ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-01-31 07:40:01 +00:00
drisspg	995f69623d	Add Silu to Dtensor Pointwise ops (#118702 ) # Summary Adds silu to the supported list, needed for llama2 mlp support Pull Request resolved: https://github.com/pytorch/pytorch/pull/118702 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2024-01-31 06:17:36 +00:00
Nikita Shulga	74f4947caf	Fix admm over empty tensors and broadcastable input (#118619 ) Fixes https://github.com/pytorch/pytorch/issues/118131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118619 Approved by: https://github.com/albanD	2024-01-31 05:40:25 +00:00
Zhengxu Chen	2d37a046e7	[export] Enforce serialization BC/FC with updater script. (#118424 ) Summary: This diff implements a mechanism for safely update torch.export serialization schema, aka schema.py, which is the API surface having the strongest compatibility guarantee. The diff is consist of 3 changes: - Added a script to "build" or "materialize" schema.py into a platform neutral format (yaml), which serves as the committed form of the seialization schema. - Added unittest to compare against schema.py and schema.yaml, so that it forces developers to execute the updater script when there is mismatch between two files. - Added a checker inside the updater script, so that all the compatible change will result in a minor version bump, and all the incompatible changes will result in a major version bump. torch.export's serialization BC/FC policy is (tentatively) documented here: https://docs.google.com/document/d/1EN7JrHbOPDhbpLDtiYG4_BPUs7PttpXlbZ27FuwKhxg/edit#heading=h.pup7ir8rqjhx , we will update the As noted in the code doc, people should be able to run the following command to update schema properly from now on: ``` python scripts/export/update_schema.py --prefix <path_to_torch_development_diretory> or buck run caffe2:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ ``` Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_schema buck run caffe2:update_export_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/ Differential Revision: D52971020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118424 Approved by: https://github.com/angelayi	2024-01-31 05:37:58 +00:00
Yifu Wang	697ca4f292	Preliminary DeviceMesh + native c10d functional integration (#118423 ) ### Summary - Added `group_name` as the third field in `dim_group_infos`. - `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI. ### Other fixes - Convert `reduceOp` to lower case before passing it into c10d_functional ops. - Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423 Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab	2024-01-31 04:36:12 +00:00
Andrew Gu	e3cde68534	[FSDP2] Added initial `_lazy_init` and FQNs for debugging (#117881 ) This PR adds the initial `_lazy_init`. Lazy initialization marks the point when the FSDP structure is finalized and is typically the beginning of the first forward. This would be after any meta-device initialization. - Lazy initialization is distinct from construction time because when processing `fully_shard(module)`, there is no way to know whether a parent of `module` will have `fully_shard` applied as well. This is a consequence of `fully_shard` having to be applied bottom-up. - At lazy initialization, we now have the concept of a _root_. The root FSDP module is the one whose `forward` runs first and ends last (and hence similarly for its backward). Having a single root simplifies handling logic that should only run "once per forward/backward/iteration". We may consider relaxing this in the future, but it will add more complexity to the design. - Once we have a root, we can define _fully-qualified names_ (FQNs) for both parameters and modules. To aid debugging, we store `_param_fqn` and `_module_fqn` on `FSDPParam` and `FSDPParamGroup`, respectively. Note that we can have a unique `_module_fqn` for `FSDPParamGroup` since we currently assume a 1:1 relationship. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117881 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #118525, #117814, #117867, #117877	2024-01-31 03:38:53 +00:00
PyTorch UpdateBot	f7ae454003	[vision hash update] update the pinned vision hash (#118700 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118700 Approved by: https://github.com/pytorchbot	2024-01-31 03:10:52 +00:00
PyTorch UpdateBot	6d7cfb5c3f	[audio hash update] update the pinned audio hash (#118699 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118699 Approved by: https://github.com/pytorchbot	2024-01-31 03:10:48 +00:00
Jorge Pineda	0a7e2ce0e1	[PT-Vulkan] aten::conv1d - support any stride, padding, dilation (#118660 ) Summary: This diff stack builds on yipjustin's initial special-case implementation: D50914117. That special-case only covers ``` strides = 1 padding = 0 dilation = 1 in_channels = out_channels = groups n = 1 ``` Test Plan: ``` [jorgep31415@161342.od /data/sandcastle/boxes/fbsource (a0b8b9b7f)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl 3 additional file change events Buck UI: https://www.internalfb.com/buck2/ebb61796-c71d-4e0c-8148-de1eb67b5d4c Network: Up: 10KiB Down: 53MiB (reSessionID-5f852cf6-9bf1-4c73-a471-4c121b53ed62) Jobs completed: 16. Time elapsed: 21.6s. Cache hits: 43%. Commands: 7 (cached: 3, remote: 0, local: 4) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (136 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (35 ms) [----------] 2 tests from VulkanAPITest (172 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (172 ms total) [ PASSED ] 2 tests. ``` Reviewed By: yipjustin Differential Revision: D53204673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118660 Approved by: https://github.com/yipjustin	2024-01-31 01:49:09 +00:00
suo	68a75d4539	[lint] remove merge_base_with from .lintrunner.toml (#118677 ) This setting is problematic in fbcode, where the expected behavior is to match `arc lint`, which has a behavior much like running `lintrunner` without a `--merge-base-with` argument. Let's try removing this. I also updated the CI message to encourage people to run with `-m origin/main`, which should hopefully cut down on confusion in the absence of defaulting to that behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118677 Approved by: https://github.com/PaliC	2024-01-31 00:53:58 +00:00
Andrew Gu	07a7feca74	[FSDP2] Sharded parameter in `FSDPParam` (#117877 ) This PR adds logic to shard the managed parameters on dim-0. This is like `distribute_tensor()` with two differences: 1. `distribute_tensor()` today cannot accept a `DTensor` and reshard it to the parent mesh (https://github.com/pytorch/pytorch/issues/116101). 2. `DTensor` does not pad its local shard on any `Shard` dimensions (https://github.com/pytorch/pytorch/issues/113045). As such, the `FSDPParam._init_sharded_param()` derives the global `DTensor` metadata itself and pads the local tensor on dim-0. The padding helps make the all-gather copy-in more efficient since the all-gather buffer will require padding. --- Some details: - We free the original parameter manually after constructing the sharded parameter. This lowers the peak memory during construction time slightly (since not _all_ parameters in the group must be sharded before the original parameters are freed) and is not strictly necessary. - We bypass `nn.Module.__setattr__` because the checks are slow and unnecessary. The drawback is that we would ignore a user-defined override of `__setattr__`; however, since we have never encountered this in practice, I am okay with this. Notably, user calls to `setattr` would still use the override; FSDP only uses `setattr` as a mechanism for switching between sharded and unsharded parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117877 Approved by: https://github.com/wanchaol ghstack dependencies: #118525, #117814, #117867	2024-01-31 00:44:19 +00:00
cyy	4a019047ad	Enable nested namespace check in clang-tidy (#118506 ) It is time to enable nested namespaces in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506 Approved by: https://github.com/albanD	2024-01-31 00:32:35 +00:00
David Berard	1b03423526	[meta registration] fix _efficient_attention_forward for jagged inputs (#118657 ) Fixes the meta registration for the logsumexp output, whose shape should be defined by the size of the offsets tensor when it exists. `644f64f2d1/aten/src/ATen/native/transformers/cuda/attention.cu (L1045)` Differential Revision: [D53234217](https://our.internmc.facebook.com/intern/diff/D53234217) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118657 Approved by: https://github.com/YuqingJ	2024-01-31 00:11:39 +00:00
Wei Wei	6fa162e681	Reland: [aotinductor] Replicate split_cat from torch IR to predispatch IR" (#118590 ) Summary: This is part the pass migration efforts. The final target is removing the acc tracer in AOTI. In this diff, I did a few things: 1. copy and modify the `fx_passes/split_cat.py` passes based on predispatch IR. 2. verify the correctness by copying the `test_split_cat_fx_passes.py` and create a new file `test_split_cat_fx_passes_aten_fb.py` which is executed in AOTI and checked the counters 3. create a util function to execute the pass and compare the before/after graph to give user more information like pass effect and time spent. It will create logs like ``` [2024-01-25 20:26:48,997] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 0, save before/after graph to /tmp/tmpvlpwrklp, graph before/after are the same = False, time elapsed = 0:00:00.001585 [2024-01-25 20:26:49,000] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 1, save before/after graph to /tmp/tmpz_onjfeu, graph before/after are the same = False, time elapsed = 0:00:00.001873 [2024-01-25 20:26:49,002] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 2, save before/after graph to /tmp/tmpgkck8yko, graph before/after are the same = True, time elapsed = 0:00:00.000269 [2024-01-25 20:26:49,007] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 3, save before/after graph to /tmp/tmpquenq06y, graph before/after are the same = False, time elapsed = 0:00:00.003621 [2024-01-25 20:26:49,009] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 4, save before/after graph to /tmp/tmpi8fia0dv, graph before/after are the same = True, time elapsed = 0:00:00.000190 ``` Differential Revision: D53171027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118590 Approved by: https://github.com/kflu, https://github.com/khabinov, https://github.com/chenyang78	2024-01-31 00:09:46 +00:00
Oguz Ulgen	7761ceb6b3	Fix a bug with python lambda capture (#118676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118676 Approved by: https://github.com/jamesjwu, https://github.com/aakhundov	2024-01-30 23:59:07 +00:00
Tianyu Liu	616e9dbed8	add torch.float64 precision support to the transformer test suite in TP/SP (#116436 ) This PR (as a followup to #115530) resolves previous issues of not passing `assertEqual()` tests (with small error) when comparing outputs from the single-gpu model and the distributed model, under certain input/model sizes or when certain operations (e.g. weight-tying) are enabled. This is done by simply enabling higher precision computation using `dtype=torch.float64`. What is not tested: whether or not distributed model training convergence rate is affected using just `torch.float32` precision. Test plan: TP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False` TP+SP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116436 Approved by: https://github.com/wanchaol	2024-01-30 23:50:29 +00:00
atalman	1f376b3b24	Flix lint after #117814 (#118689 ) Forward fix after PR: #117814 . make lint green again Pull Request resolved: https://github.com/pytorch/pytorch/pull/118689 Approved by: https://github.com/awgu, https://github.com/huydhn	2024-01-30 23:46:27 +00:00
Oguz Ulgen	1e78dc95a4	Fix/Temporarily disable tests broken due to triton version mismatch (#118661 ) Summary: These test were broken because internal triton is 2.2 whereas external is 3.0. Will update after internal version catches up. Test Plan: CI Differential Revision: D53231204 Co-authored-by: Oguz Ulgen <oulgen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118661 Approved by: https://github.com/oulgen	2024-01-30 23:06:35 +00:00
Isuru Fernando	2f7839e6db	register decomposition for rsub in torch._refs (#118288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118288 Approved by: https://github.com/lezcano ghstack dependencies: #118398	2024-01-30 22:18:15 +00:00
Isuru Fernando	04ded1399d	Fix signatures of torch.{add, sub, mul} (#118398 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118398 Approved by: https://github.com/lezcano	2024-01-30 22:18:15 +00:00
Andrew Gu	6ea233a14c	[FSDP2] Added initial `FSDPParamGroup`, `FSDPParam`, `ParamModuleInfo` (#117867 ) This PR adds the initial `FSDPParamGroup` and `FSDPParam` classes, and it focuses on the `ParamModuleInfo` data structure. - `ParamModuleInfo` has the info needed to `setattr` a managed parameter, where it must account for shared parameters and shared modules. ``` # Shared parameter lin1.weight = lin2.weight # Shared module mlp.lin1 = mlp.lin2 ``` - In order for FSDP to find shared modules' parameters, we must use `remove_duplicate=False`. See https://github.com/pytorch/pytorch/pull/99448/ for the original context. Finding shared modules' parameters is not necessary for the `setattr` logic, but in case we need it in the future (like for existing FSDP's state dict), we include that info for now. With this PR, we see the general system architecture: - 1 `module` : 1 `fully_shard` - 1 `fully_shard` : 1 `FSDPParamGroup` - 1 `FSDPParamGroup` : k `FSDPParam` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117867 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #118525, #117814	2024-01-30 22:07:59 +00:00
Andrew Gu	ae6233ec47	[FSDP2] Added `mesh` arg, `FSDPState`, move to device (#117814 ) Squashed to include https://github.com/pytorch/pytorch/pull/117861, https://github.com/pytorch/pytorch/pull/117852 --- This PR adds `_get_managed_modules()` to determine which modules a `fully_shard(module)` call manages. The rule is defined as: > `fully_shard(module)` manages all modules in `module.modules()` except those already managed by a nested `fully_shard()` or a nested non-composable API (e.g. `replicate()` or TorchRec). Practically, this can be implemented as a graph search from `module` that does not proceed into any module with `fully_shard` or a non-composable API applied. Because the non-composable APIs follow the same rule, this rule is correct inductively. --- This PR adds `_get_managed_states(managed_modules)` to return the managed parameters and buffers given the managed modules. - Without an extra mechanism to ignore specific parameters or buffers, the rule currently is simply to get the directly managed state (i.e. parameters/buffers) from each managed module while de-duplicating shared ones. - However, we prefer this translation from managed modules to managed states to accommodate ignoring specific states in the future (which has appeared in various open-source use cases). --- This PR adds the `mesh` argument to `fully_shard` and some helper data structures specific to FSDP/HSDP that pre-compute useful info like rank/world size for each mesh dim. - The `mesh` defines the FSDP/HSDP algorithm. 1D mesh means FSDP, and 2D mesh means HSDP, where we assume sharding on the last dimension. - We can revisit the HSDP sharding-dim assumption if needed in the future. - The default (if `mesh is None`) is that `fully_shard` calls `init_device_mesh` following the global process group. - The helper data structures are the various `*MeshInfo`s. I included up to the `HSDPMeshInfo` even though it will not be immediately used to show the spirit of it. We want to tag both the shard and replicate dims. - The `mesh_info` variable in `fully_shard` is not used for now. It will be passed downstream in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117814 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #118525	2024-01-30 22:05:16 +00:00
Andrew Gu	7aa4b35b75	[FSDP2][Reland] Introduced initial `fully_shard` frontend (#118525 ) This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP. - We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one. - We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module. - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`. - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able. - Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state. - We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794). - In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name. Reland details: I removed `test/distributed/_composable/fsdp/_test_fully_shard_common.py` and moved its contents to the existing `torch/testing/_internal/common_fsdp.py`, which is already a target for internal tests. Differential Revision: [D53187509](https://our.internmc.facebook.com/intern/diff/D53187509) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118525 Approved by: https://github.com/wanchaol	2024-01-30 22:05:16 +00:00
Huy Do	48f876143a	Fix missing permission in create release workflow (#118681 ) Fixes https://github.com/pytorch/pytorch/actions/runs/7715417683/job/21029944543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118681 Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/atalman, https://github.com/malfet	2024-01-30 22:02:30 +00:00
Elias Ellison	1aa836f502	Dont fuse write into read if indexing differs (#118210 ) Fix for https://github.com/pytorch/pytorch/issues/101950, https://github.com/pytorch/pytorch/issues/94693 Similar to inplacing a kernel only fuse a write after a read of the same tensor if the write and read have same indexing formula. I did a perf test and it was neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118210 Approved by: https://github.com/jansel	2024-01-30 21:55:27 +00:00
Jerry Zhang	82a7460b67	[quant][bc-breaking] Turn on fold_quantize by default (#118605 ) Summary: Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and `fp32 weight -> q -> dq -> linear -> ...` in the quantized model After this PR, we'll produce a graph with int8 weight by default after convert_pt2e: `int8 weight -> dq -> linear -> ...` We'll remove the fold_quantize flag in the next PR Test Plan: CI Differential Revision: D51730862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118605 Approved by: https://github.com/andrewor14	2024-01-30 21:42:29 +00:00
Ivan Zaitsev	ba1be17733	Remove `voznesenskym` from the list of autoreviewers (#118680 ) Mitigates the failures of "Auto Request Review" workflow: ``` Requesting review to ezyang, albanD, miladm, voznesenskym, antoniojkim, SherlockNoMad Error: HttpError: Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the pytorch/pytorch repository. ``` https://github.com/pytorch/pytorch/actions/runs/7716852492/job/21034629665?pr=118669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118680 Approved by: https://github.com/clee2000	2024-01-30 21:35:38 +00:00
Catherine Lee	f2682e75e6	Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586 ) Info about super in dynamic classes: https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions Mainly doing this because it's making disable bot spam Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586 Approved by: https://github.com/huydhn	2024-01-30 21:34:05 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
Jason Ansel	e332653eb3	[inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490 Approved by: https://github.com/desertfire	2024-01-30 21:03:19 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
Elias Ellison	e33e88e5bc	Add separate logging target for cudagraphs (#118329 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118329 Approved by: https://github.com/mlazos	2024-01-30 20:16:51 +00:00
Shuqiang Zhang	e180218949	[c10d] Log the last enqueued and completed collective (#118582 ) Summary: During debugging of some timeouted jobs, I found it difficult to identify which rank is at fault eventhough we have logs of many ranks reporting timeout on a specific collective seq. If we can also report lastEqueuedSeq and lastCompletedSeq, it would be much easier to identify, 1. whether a rank has not even join a collective call (not enqueued) 2. Or it has joined the collective call, but not completed. For the 1st case, it is mostly likely users code problem for the 2ed case, it could be lower-layer issues Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582 Approved by: https://github.com/wconstab	2024-01-30 20:13:55 +00:00
Jorge Pineda	9247641f34	[PT-Vulkan] aten::unsqueeze - nit optimization (#118575 ) Summary: Learning Vulkan shaders and realized one of the branches can be easily optimized. The relevant branch is only taken when we unsqueeze along `dim == 1` for 3D tensors. 1. There's an unnecessary for-loop. 2. There's an unnecessary dependency on the output tensor's number of channels. ## CPU Tensor ``` 3D->4D: (c, h, w) -> (c, 0, h, w) ``` ## GPU Texture ``` 3D->4D: (w, h, c/4)[c%4] -> (w, h, c)[0] ``` Note the GPU Texture's output is always at `[0]` and the output tensor's number of channels is always 1. We are currently writing the same value `v[p]` to all elements of the texel `out_texel`, but we need only write it to `out_texel[0]`: Test Plan: ``` [jorgep31415@161342.od /data/sandcastle/boxes/fbsource (ca3b566bc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="unsqueeze" File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl Buck UI: https://www.internalfb.com/buck2/2c7f1365-e004-41a0-9201-473929a2738a Network: Up: 174B Down: 0B (reSessionID-c54d25da-f44b-49f7-8bfd-1db4eee50f6d) Jobs completed: 6. Time elapsed: 14.4s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_0dto1d_dim0 [ OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (60 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (132 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (20 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (66 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (3 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (19 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms) [----------] 10 tests from VulkanAPITest (307 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (307 ms total) [ PASSED ] 10 tests. [ ``` Differential Revision: D53189637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118575 Approved by: https://github.com/yipjustin	2024-01-30 20:01:18 +00:00
suo	d0627cc2af	[export] do not rewrite state dict when unlifting (#118611 ) This is Very Bad; changing state dict keys violates one of the key contracts we have, which is "do not mess with the state dict". Change unlift to use a similar `_assign_attr` approach that fx.GraphModule and unflatten do. Also took the opportunity to improve the interface of `_assign_attr` to be more general. Differential Revision: [D53139277](https://our.internmc.facebook.com/intern/diff/D53139277/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118611 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608, #118609, #118610	2024-01-30 19:14:19 +00:00
suo	be90ab7efd	[export] do not unlift cond/map submodules (#118610 ) I don't think we should be unlifting HOO submodules. What is the constract of unlifting? It is: restore the original calling convention of the module, undoing the transformation in which we lift parameters, buffers, and constants to inputs in the graph. Unlifting does not make any guarantees about what's going on inside the module. It's still a flat module. So why should we lift the cond/map submodules? It doesn't have anything to do with the contract stated above; it's some internal stuff that doesn't affect how the module will be called. Further, this code as written modifies the state dict; adding a new buffer that is actually duplicate of a previous buffer. Modifying the state dict from the original eager module is never correct. Differential Revision: [D53160713](https://our.internmc.facebook.com/intern/diff/D53160713/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118610 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608, #118609	2024-01-30 19:14:18 +00:00
suo	4ee8aa6028	[export] adopt KeyPath API in nonstrict mode (#118609 ) This PR rewrites two paths to use the newly-added keypaths API in pytree: First: we were hand-rolling a tree_map during fakification because we wanted to track sources. This PR uses keypaths instead, which can do the same thing without needing custom code. Second: our constraint error formatting was referencing placeholder names in error messages. These placeholder names are not otherwise user-visible, so they are super confusing to users (e.g. "which input does arg1_3 correspond to?"). This diff uses the `keystr` API to format the error message. This necessitated some small refactors—generating the keystr is expensive so doing it in an f-string was very bad. It can also be further improved—we can inspect the signature so that instead of `*args[0]` we can give people the actual argument name, which would be the ideal UX. But leaving that for later. Differential Revision: [D53139358](https://our.internmc.facebook.com/intern/diff/D53139358/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118609 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607, #118608	2024-01-30 19:14:11 +00:00
suo	ca090b2c77	[export] do not use tree_flatten_spec (#118608 ) tree_flatten_spec is bad; it isn't synced up with `register_pytree_node` so it will not handle arbitrary custom pytrees. It's also not really maintained. We only use it for two purposes: - To retain kwarg ordering stability, so that if the user passes in kwargs in a different order things will still work. - To do "structural" checks that ignore types. In both cases, tree_flatten_spec is probably not the ideal way to implement the desired behavior. ## kwargs ordering - tree_flatten_spec overwrites the behavior of ALL dictionaries, not just kwargs. This is not correct, dictionary ordering is meaningful in Python, and it's pretty trivial to write a program that relies on dict ordering. - For kwargs, we do sort of expect that the order in which arguments are passed shouldn't matter. BUT there is one exception: `kwargs`. In fact, [PEP 468](https://peps.python.org/pep-0468/) was introduced specifically to clarify that ordering does matter when the function being called uses `kwargs`. In this diff I introduce a utility function that only reorders kwargs. This gets us most of the way to correct—dicts are no longer reordered, but kwargs can be passed in any order. A "fully correct" solution would need fix the corner case from PEP468. We could detect whether the top-level fn being traced uses `kwargs` (via `inspect`), then serialize a flag for it. In ExportedProgram, we would check that flag and only re-order if `kwargs` was unused; otherwise error if the key order doesn't match. This is a super corner case though, so I'll file it as a followup task. ## structural equivalence checking This is another use case, where again `tree_flatten_spec` is too broad. Generally we want to treat a precise two types as the same, not override the behavior of comparison generally. So I introduce an `is_equivalent` util for this purpose. Differential Revision: [D53168420](https://our.internmc.facebook.com/intern/diff/D53168420/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118608 Approved by: https://github.com/zhxchen17 ghstack dependencies: #118607	2024-01-30 19:14:04 +00:00
Oguz Ulgen	bc9642f578	Skip more tests under rocm (#118624 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118624 Approved by: https://github.com/aakhundov	2024-01-30 19:06:06 +00:00
Stephen Jia	e6e7d7f26b	[pt-vulkan] Introduce MemoryAllocation class and enable deferred allocation and resource aliasing (#118436 ) ## Context This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes). This changeset enables [resource aliasing](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/resource_aliasing.html), a technique that allows two resources (i.e. `VkImage`s or `VkBuffer`s to bind to the same memory allocation. This is the core feature that allows memory planning to be implemented in PyTorch Vulkan. ## Notes for Reviewers At a high level, this changeset introduces the `MemoryAllocation` struct which represents a raw `VmaAllocation`. `VulkanImage` and `VulkanBuffer` have been updated to store a `MemoryAllocation` member instead of the raw handle of a `VmaAllocation`. `vTensor`, `VulkanImage`, and `VulkanBuffer` constructors now have a `allocate_memory` argument which controls if memory should be allocated on construction. If `false`, then memory must be allocated separately and bound later using `bind_allocation()` before the resource can be used. Internal: ## Notes for Internal Reviewers Please refer to [this design doc](https://docs.google.com/document/d/1EspYYdkmzOrfd76mPH2_2BgTDt-sOeFnwTkV3ZsFZr0/edit?usp=sharing) to understand how memory planning will work end-to-end in the Vulkan Delegate. Differential Revision: [D53136249](https://our.internmc.facebook.com/intern/diff/D53136249/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118436 Approved by: https://github.com/yipjustin	2024-01-30 19:03:55 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit 4f13f69a45ef53747e2eefffd65d91ce840b431b. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
suo	6511811ebb	[export] preserve metadata during nonstrict tracing (#118607 ) Previously, nonstrict tracing would wipe metadata of graphmodules, because the wrapper class we're using was not detected as a graphmodule and thus meta preservation was not turned on Differential Revision: [D53139354](https://our.internmc.facebook.com/intern/diff/D53139354/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118607 Approved by: https://github.com/zhxchen17	2024-01-30 18:27:52 +00:00
Wei (Will) Feng	644f64f2d1	[c10d] added docstrings and tests for src / dst (#118593 ) Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg * update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments * communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend`` * communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list`` * ``pytest test/distributed/test_c10d_nccl.py -k subgroup`` 3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device * use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2 * ``torch.cuda.set_device(global_rank)`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593 Approved by: https://github.com/wconstab	2024-01-30 17:47:58 +00:00
Peter Bell	19e8ba95e5	[RELAND] Remove deprecated fbgemm operators (#112153 ) These operators are not used and have been deprecated since #72690 (Feb 2022). BC-breaking message: `TorchScript` models that were exported with the deprecated `torch.jit.quantized` API will no longer be loadable, as the required internal operators have been removed. Please re-export your models using the newer `torch.ao.quantization` API instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153 Approved by: https://github.com/jerryzh168	2024-01-30 16:32:37 +00:00
Pearu Peterson	2327879fb6	Add lowering to special.bessel_j0 (2nd try) (#118565 ) This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565 Approved by: https://github.com/peterbell10	2024-01-30 15:26:59 +00:00
garfield1997	fbf92500fb	enable privateuseone to perform streaming backward (#117111 ) Fixes #116957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117111 Approved by: https://github.com/soulitzer	2024-01-30 15:13:31 +00:00
atalman	15702a8027	Fix lnit after #118533 (#118633 ) Fixes lint after https://github.com/pytorch/pytorch/pull/118533 Adds ignore ``possibly-undefined`` to more places Pull Request resolved: https://github.com/pytorch/pytorch/pull/118633 Approved by: https://github.com/DanilBaibak	2024-01-30 14:07:16 +00:00
Qingpeng Li	827949cef2	accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 ) When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function. Simple benchmark on AMD 3600 CPU Ubuntu 22.04: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|original\|1986\|1658\| \|this PR\|1295\|995\| faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code. CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned. The simple benchmark cpp file: [demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539 Approved by: https://github.com/malfet	2024-01-30 13:24:13 +00:00
Jiong Gong	e5bb527d3e	[inductor][cpp] support scalar value in vec reduction (#118511 ) Fix https://github.com/pytorch/pytorch/issues/118379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118511 Approved by: https://github.com/leslie-fang-intel, https://github.com/lezcano, https://github.com/jansel	2024-01-30 13:07:43 +00:00
lezcano	91690983ff	[easy] Faster empty LIST_LENGTH guard (#118542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118542 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-01-30 13:02:18 +00:00
Yifu Wang	64efec9953	Port FakeProcessGroup to cpp (#118426 ) ### Summary Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops. The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426 Approved by: https://github.com/wanchaol ghstack dependencies: #113057	2024-01-30 11:40:13 +00:00
Will Constable	da0635d17c	Add pytorch-distributed justknobs helper (#118568 ) Summary: Sets up a helper that checks any JKs relevent to pytorch distributed, and propagates their values to ENV. Test Plan: Added unit test Differential Revision: D53192406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118568 Approved by: https://github.com/zdevito	2024-01-30 08:13:52 +00:00
Menglu Yu	3ecc2f3a0d	[PT2][Runtime Numeric Check] Fix compatibility issue (#118578 ) Summary: Titled Test Plan: WIP Differential Revision: D53196722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118578 Approved by: https://github.com/jackiexu1992	2024-01-30 08:04:27 +00:00
Guoliang He	b7c8485704	refactor mm_plus_mm check to pattern match (#118456 ) Fixes #103101 replace #103253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118456 Approved by: https://github.com/jansel	2024-01-30 07:48:06 +00:00
Shuqiang Zhang	c7af626a26	[c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256 ) resolve #117749 Summary: Updated the PR with the following intentions: 1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled. 2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call. 3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call. 4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256 Approved by: https://github.com/kwen2501	2024-01-30 06:23:20 +00:00
Oguz Ulgen	e632d0c0dc	Break Triton MutationTests to one kernel per test (#118553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118553 Approved by: https://github.com/aakhundov ghstack dependencies: #118588	2024-01-30 06:17:55 +00:00
eqy	4a48899b6e	[CUDA][complex] Define `LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS` in CUDA build (#117061 ) An upcoming CUDA release will migrate to CCCL, and we need this define to preserve current complex behavior: https://nvidia.github.io/libcudacxx/standard_api/numerics_library/complex.html CC @miscco @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117061 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-30 06:11:31 +00:00
Oguz Ulgen	c203d88795	Skip mutation tests on rocm (#118588 ) Fixes #118585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118588 Approved by: https://github.com/aakhundov, https://github.com/jansel	2024-01-30 05:46:54 +00:00
Eddie Yan	fe07851173	[CUDA][TF32][functorch] Also disable TF32 for vjp and jvp tests (#118592 ) CC @zou3519 Appears to be the same issue as https://github.com/pytorch/pytorch/issues/86798 Seen surfacing on >= sm80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118592 Approved by: https://github.com/zou3519	2024-01-30 05:34:20 +00:00
Colin Peppler	8be6dee14b	[inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569 ) Summary: ### Context It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`. * First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()` * Second in `self.codegen_kwargs()`. When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration. ``` auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L); ... // There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0. // And there's no reference to tmp_tensor_handle_0. // Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't // automatically cleaned-up like RAIIAtenTensorHandle CUdeviceptr var_6; aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void*>(&var_6)); void kernel_args_var_2[] = {..., &var_6, ...}; launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2); ``` ### Solution We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`. Test Plan: ### Inspect device memory allocated ``` # Before diff 0 device memory 2048 1 device memory 2560 2 device memory 3072 3 device memory 3584 4 device memory 4096 5 device memory 4608 # With diff (memory usage doesn't grow) 0 device memory 1536 1 device memory 1536 2 device memory 1536 3 device memory 1536 4 device memory 1536 5 device memory 1536 ``` Reviewed By: jingsh, tissue3 Differential Revision: D53190934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569 Approved by: https://github.com/oulgen	2024-01-30 05:19:32 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Daohang Shi	5dfcf07449	Reland PR117393 [inductor/fb] log config dict when compilation finishes (#118552 ) Summary: Reverted due to merge conflict Differential Revision: D53188124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118552 Approved by: https://github.com/mengluy0125	2024-01-30 04:34:22 +00:00
PyTorch UpdateBot	dcc077eea2	[executorch hash update] update the pinned executorch hash (#118594 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118594 Approved by: https://github.com/pytorchbot	2024-01-30 03:49:49 +00:00
Shunting Zhang	0d47f6a44f	[ez][inductor] fix a typo in should_pad_bench (#118598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118598 Approved by: https://github.com/eellison	2024-01-30 03:49:44 +00:00
PyTorch UpdateBot	135f785d77	[audio hash update] update the pinned audio hash (#118338 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118338 Approved by: https://github.com/pytorchbot	2024-01-30 03:44:00 +00:00
PyTorch UpdateBot	ff0cb38693	[vision hash update] update the pinned vision hash (#118340 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118340 Approved by: https://github.com/pytorchbot	2024-01-30 03:15:16 +00:00
Catherine Lee	2eefbc02a0	[ez] Discover tests without importing torch (#118574 ) Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed. Helpful when you don't have torch installed (aka me when I'm feeling lazy) I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that. The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574 Approved by: https://github.com/huydhn	2024-01-30 03:02:29 +00:00
Zhengxu Chen	eb9905be5d	[export] Remove the branch for skipping verifier. (#118139 ) Summary: We used to skip verifier when the signature object is not the "correct" one (usually from some deprecated frontend). This was very useful when we wanted to pay a small cost to enable verifier path to be called everywhere for torch export. Now I believe no tests are relying on this behavior so we should remove this weird branch. Test Plan: CI Differential Revision: D53024506 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118139 Approved by: https://github.com/suo	2024-01-30 02:58:03 +00:00
Yifu Wang	b778f44e97	Allow using native c10d_functional via _functional_collectives (#113057 ) This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification. NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057 Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol	2024-01-30 02:34:25 +00:00
drisspg	126c1621ce	Add Support for CausalBias to torch compile (#116071 ) Fixes #115363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116071 Approved by: https://github.com/mlazos	2024-01-30 02:22:48 +00:00
Masaki Kozuki	67d8db9252	Remove semicolon after `return_from_mutable_noop_redispatch` (#118538 ) [`return_from_mutable_noop_redispatch`](`65f8276bc6/torchgen/gen_functionalization_type.py (L477)`) calls [`return_str`](`65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)`). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538 Approved by: https://github.com/colesbury	2024-01-30 02:22:42 +00:00
Zhengxu Chen	0ed24cb1af	[export] comments about runtime_var_to_range. (#118539 ) Summary: Add some comments in case we forgot what runtime_var_to_range means Test Plan: eyes Differential Revision: D53186114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118539 Approved by: https://github.com/suo	2024-01-30 02:07:34 +00:00
Driss Guessous	b1f8b6b8fc	Forward Fix accidental removal of import (#118572 ) Summary: This Diff is a forward fix for this PR: https://github.com/pytorch/pytorch/pull/114689 Where I accidentally removed the old import from backends/cuda. Test Plan: Verrified on failing revert diff and it did indeed fix the issue Reviewed By: DanilBaibak Differential Revision: D53193454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118572 Approved by: https://github.com/DanilBaibak	2024-01-30 02:07:19 +00:00
David Berard	460950d3aa	[Nested Tensor] Support ragged_idx != 1 on aten::is_same_size, aten::_to_copy (#118442 ) is_same_size is needed internally; `_to_copy` should be easy because it doesn't support new layouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118442 Approved by: https://github.com/cpuhrsch	2024-01-30 01:32:51 +00:00
Elias Ellison	6c9f72156e	Fix constant folding bug with sym size tensor (#118411 ) When there was a constant folded SymInt which was used to construct a then constant folding tensor, we had previously used tried to use the sympy symbol which would error (should take in SymInt not symbol). Fix by recording the observed size during constant folding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118411 Approved by: https://github.com/ezyang	2024-01-30 01:26:51 +00:00
hyperfraise	aef820926c	Add some tests for 3d channels last (#118283 ) Part of a multi-PR work to fix #59168. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118283 Approved by: https://github.com/albanD	2024-01-30 01:26:47 +00:00
CaoE	bacbad5bc9	add GradScaler on CPU (#109993 ) Step 2 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-29 23:42:35 +00:00
Mikayla Gawarecki	796d270392	[easy] Fix small typo in register_state_dict_pre_hook doc (#118571 ) Fixed https://github.com/pytorch/pytorch/pull/112674#issuecomment-1912849827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118571 Approved by: https://github.com/janeyx99, https://github.com/albanD	2024-01-29 23:18:12 +00:00
Angela Yi	413a434846	[export] Convert all export tests to .module() (#118425 ) Test Plan: CI Differential Revision: D53075379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118425 Approved by: https://github.com/suo	2024-01-29 23:06:54 +00:00
Felix Zimmermann	ca7cbf1226	Add memory_format to typehints of Tensor.cpu and Tensor.cuda (#118392 ) Fixes #118501 which makes mypy complain if users use memory_format in torch.cpu/torch.cuda in their code. this adds the missing memory_format to the typehints of both functions. I believe there is no test infrastructure for type hints.... Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118392 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-29 22:56:34 +00:00
Arseny Kapoulkine	e1cbf6dff5	Use SEQUENTIAL posix_fadvise on mmapped files (#117805 ) In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes). Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...). With this, they run at ~1.5 GB/s which is still bad but better than before! It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be. All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp. I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805 Approved by: https://github.com/mikaylagawarecki	2024-01-29 22:38:00 +00:00
ydwu4	67c6152f4e	[HigherOrderOp] support while_loop in dynamo (#116913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116913 Approved by: https://github.com/zou3519	2024-01-29 22:32:32 +00:00
Ivan Zaitsev	e3d7a19f73	[CI] add wait for /orig branch in mergeability check (#118576 ) --- Test runs: * [happy path](https://github.com/pytorch/pytorch/actions/runs/7702614677/job/20991275431?pr=118576) (this PR) * [waiting for the hardcoded branch name](https://github.com/izaitsevfb/pr-head-test/actions/runs/7702386966/job/20990584514#step:3:33) in a separate repo (step succeeded after the branch was manually pushed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118576 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-29 22:10:50 +00:00
albanD	a40be5f4dc	Autograd doc cleanup (#118500 ) I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500 Approved by: https://github.com/soulitzer	2024-01-29 21:51:33 +00:00
ydwu4	fc5cde7579	[dynamo] constant fold torch.cuda.get_device_properties to avoid graph break (#118422 ) Before the PR, we have a graph break for code like this, ```python def test_get_device_properties_tensor_device(a): x = a.to("cuda") prop = torch.cuda.get_device_properties(x.device) if prop.major == 8: return x + prop.multi_processor_count return x + prop.max_threads_per_multi_processor ``` This PR constant folds the torch.cuda.get_device_properties and we'll get a following dynamo graph: ```python [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] def forward(self, L_a_ : torch.Tensor): [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] l_a_ = L_a_ [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:544 in test_get_device_properties_tensor_device, code: x = a.to("cuda") [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] x = l_a_.to('cuda'); l_a_ = None [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:547 in test_get_device_properties_tensor_device, code: return x + prop.multi_processor_count [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] add = x + 108; x = None [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] return (add,) [2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG] ``` The signature of get_device_properties is: ```python def get_device_properties(device: _device_t) -> _CudaDeviceProperties: ``` I think it's safe to constant fold get_device_properties(): 1. torch.cuda.get_device_properties(tensor.device). In this case, tensor.device.index is guarded in _check_tensor 2. torch.cuda.get_device_properties(device_int_id). We don't expect the GPU properties for a particular index changes during a torch.compile run and it make sense to specialize the properties for a concrete device_int_id. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118422 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-01-29 20:26:40 +00:00
Peter Bell	f99adbb4ec	[inductor] Remove ROCm xfail on test_cum{sum,prod}_zero_dim (#118558 ) Fixes #118540, fixes #118541 Since the zero-dim case reduces to a pointwise operation, we don't fallback on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118558 Approved by: https://github.com/malfet	2024-01-29 20:23:40 +00:00
ydwu4	6591741183	[dynamo] support inference_mode with no arguments (#118427 ) Before the pr, we have an error for the following code ```python def k(x): with torch.inference_mode(): x = x + 1 return x torch.compile(k, backend="eager", fullgraph=True)(x) ``` error message: ``` Traceback (most recent call last): .... return InferenceModeVariable.create(tx, args[0].as_python_constant()) torch._dynamo.exc.InternalTorchDynamoError: list index out of range ``` This pr supports the case when torch.inference_mode is not provided any argument (i.e. default to True). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118427 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-01-29 20:20:26 +00:00
Joost Houben	e0d04b7119	[Caffe2] Fix bug in `str` on wide types (#117531 ) Summary: The current implementation of `str` passes wide types (`wchar_t`, `wchar_t`, `std::wstring`) directly to `std::ostringstream`. This has the following behavior: - C++17, `wchar_t` & `wchar_t `: print the integer representation of the character or the pointer. This is unexpected and almost certainly a (runtime) bug. - C++17, `std::wstring`: compile-time error. - C++20, all of the above: compile-time error. To fix the bug and to enable C++20 migration, this diff performs narrowing on these wide types (assuming UTF-16 encoding) before passing them to `std::ostringstream`. This fixes both the C++20 compile time errors and the C++17 runtime bugs. This bug surfaced in enabling C++20 windows builds, because windows specific caffe2 code uses `TORCH_CHECK` with wide strings, which references `str` for generating error messages. Test Plan: CI & https://godbolt.org/z/ecTGd8Ma9 Differential Revision: D52792393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117531 Approved by: https://github.com/malfet	2024-01-29 20:11:37 +00:00
Andrew Gu	68b18dc2a2	[DeviceMesh] Removed print of `self._dim_group_infos` (#118527 ) This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527 Approved by: https://github.com/wz337	2024-01-29 19:14:25 +00:00
PyTorch MergeBot	bb55970e5b	Revert "Add justknobs env helper for pytorch distributed (#118451 )" This reverts commit 4d1bb2175a49e9b4440085a3dc2e2b211e5cf99e. Reverted https://github.com/pytorch/pytorch/pull/118451 on behalf of https://github.com/wconstab due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/118451#issuecomment-1915369013))	2024-01-29 19:01:05 +00:00
Lucas Pasqualin	0288db3120	[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns (#118119 ) Fixes #117399 ~~Soliciting some early feedback here.~~ ~~Do we happen to know if there already some tests that cover this case or would it make sense to add? @fegin , @wz337~~ Edit: Added tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118119 Approved by: https://github.com/fegin	2024-01-29 18:18:52 +00:00
PyTorch MergeBot	fb11354594	Revert "[c10d] relax the nccl error check for nonblocking mode (#118254 )" This reverts commit 993e4f3911856be3a93746f6ed6a13f25de6ff65. Reverted https://github.com/pytorch/pytorch/pull/118254 on behalf of https://github.com/clee2000 due to has internal failures D53170606 ([comment](https://github.com/pytorch/pytorch/pull/118254#issuecomment-1915267786))	2024-01-29 17:56:40 +00:00
Nikita Shulga	3011a4406f	[BE][GHF] Do not hardcode default branch name (#118530 ) Instead rely on `GitHubPR.default_branch()` which is the name of the repo's default branch. Do not pass branch name `merge_changes` is called, as it is set to default branch inside the function Pull Request resolved: https://github.com/pytorch/pytorch/pull/118530 Approved by: https://github.com/clee2000	2024-01-29 17:18:23 +00:00
Wenyin Fu	65f8276bc6	add an option to specify custom addr2line binary (#118328 ) There is a need for users to pick their own addr2line binary in their deployment due to reasons like default addr2line being too slow etc... This option would allow user quickly experiment other alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118328 Approved by: https://github.com/zdevito, https://github.com/aaronenyeshi	2024-01-29 16:36:38 +00:00
Will Constable	abe3c55a6a	Update DDP dynamo debug docs (#118295 ) Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295 Approved by: https://github.com/LucasLLC, https://github.com/wanchaol	2024-01-29 14:58:26 +00:00
Catherine Lee	f9971daaee	Fix divergence between internal + external (#118509 ) D53049807 and https://github.com/pytorch/pytorch/pull/118197 got out of sync somehow Fixing externally since I'm pretty sure the internal version is correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/118509 Approved by: https://github.com/malfet	2024-01-29 14:53:50 +00:00
Jiong Gong	04c1df651a	[inductor][cpp] enable vectorization with constant bool (#118380 ) Related model DebertaForQuestionAnswering etc. For DebertaForQuestionAnswering, single thread, measured on ICX: Before: 0.990x, After: 1.043x Pull Request resolved: https://github.com/pytorch/pytorch/pull/118380 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-01-29 13:31:22 +00:00
leslie-fang-intel	ee3dfbbe47	[Inductor] Fix Argmax codegen with Nan input (#118358 ) Summary Fix issue https://github.com/pytorch/pytorch/issues/118266, current `torch.argmax` and `torch.argmin` has different return values with eager and Inductor cpp backend when inputs has `Nan` value. Align cpp backend results to eager by reusing the compare function. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_argmin_cpu_only python -u -m pytest -s -v test_cpu_repro.py -k test_argmax_argmin_with_nan_value ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118358 Approved by: https://github.com/lezcano, https://github.com/jgong5, https://github.com/jansel	2024-01-29 09:09:46 +00:00
James Wu	41dfdde9f5	Handle some numpy functions with out arguments correctly in dynamo (#118248 ) Dynamo creates Tensors when tracing through numpy ufuncs like np.sin, np.minimum etc. When running, np functions generally return Tensors when run with `torch.compile`. However, we currently require when normalizing `out` arguments that the input is an ndarray. This creates assertion errors when running torch.compile on any numpy function with an out argument: ``` def test_numpy_ufunc_out(self): @torch.compile(backend="eager") def foo(): x = np.arange(5) out = np.empty((x.shape[0], x.shape[0])) res_out = np.sin(x, out=out) assert res_out is out foo() ``` Failure with stack trace: https://gist.github.com/jamesjwu/68e217638d735678b3de968584dba23f Instead, we can wrap tensors in an ndarray in normalize_outarray to handle the case correctly. Fixing this resolves ~220 tests under dynamo_test_failures, but also exposes a followup bug. In the presence of a graph break, ndarrays don't preserve their id, which can affect assertions and `is` checks between numpy arrays: ``` def test_x_and_out_broadcast(self, ufunc): x = self.get_x(ufunc) out = np.empty((x.shape[0], x.shape[0])) x_b = np.broadcast_to(x, out.shape) # ufunc is just np.sin here res_out = ufunc(x, out=out) res_bcast = ufunc(x_b) # passes assert res_out is out graph_break() # fails assert res_out is out ``` Regular tensors preserve their id because Dynamo caches their example tensor values across a graph break. However, with ndarrays, we only store their converted tensor values, and construct new ndarrays around those values: `eebe7e1d37/torch/_dynamo/variables/builder.py (L1083)` Added a test with expected failure to showcase this — we can then fix that issue separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118248 Approved by: https://github.com/lezcano	2024-01-29 09:09:21 +00:00
Will Constable	4d1bb2175a	Add justknobs env helper for pytorch distributed (#118451 ) Summary: Adds a JK killswitch check and configures the env for enabling pytorch nccl flight recorder. Note- this only enables recording events in memory, not dumping them. Test Plan: CI test Reviewed By: zdevito Differential Revision: D52920092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451 Approved by: https://github.com/malfet	2024-01-29 08:57:16 +00:00
Jason Ansel	41902a6ebc	[dynamo] Optimize is_tracing checks (#118474 ) benchmarks/dynamo/microbenchmarks/overheads.py - before: 10.4us - after: 9.9us Pull Request resolved: https://github.com/pytorch/pytorch/pull/118474 Approved by: https://github.com/yanboliang	2024-01-29 08:31:26 +00:00
PyTorch MergeBot	eba240afcb	Revert "[FSDP2] Introduced initial `fully_shard` frontend (#117776 )" This reverts commit 316579e30ce820cb5f431e6bb816a882db918b38. Reverted https://github.com/pytorch/pytorch/pull/117776 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117776#issuecomment-1914121167))	2024-01-29 07:38:41 +00:00
lancerts	e6f3a4746c	include a print for _get_cuda_arch_flags (#118503 ) Related to #118494, it is not clear to users that the default behavior is to include all feasible archs (if the 'TORCH_CUDA_ARCH_LIST' is not set). In these scenarios, a user may experience a long build time. Adding a print statement to reflect this behavior. [`verbose` arg is not available and not feeling necessary to add `verbose` arg to this function and all its parent functions...] Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118503 Approved by: https://github.com/ezyang	2024-01-29 07:03:56 +00:00
Oguz Ulgen	47b5a6b05d	[Dynamo] Analyze triton kernels via tracing to determine mutations (#117300 ) This PR adds TTIR lexing and parsing in order to analyze which of the user defined triton kernel inputs are mutated. Differential Revision: [D53165999](https://our.internmc.facebook.com/intern/diff/D53165999) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117300 Approved by: https://github.com/jansel	2024-01-29 06:37:08 +00:00
Edward Z. Yang	2951bbf0f7	Add some type annotations to torch._inductor.codegen.wrapper (#118491 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491 Approved by: https://github.com/Skylion007	2024-01-29 06:17:27 +00:00
Will Constable	5f59d0c748	[C10D] Disarm PGNCCL Heartbeat Monitor to gather data (#118344 ) Summary: Leave monitoring thread 'running' in log-only mode. Use the kill logs to correlate with actual job outcomes (e.g. does stuck job detector agree?) Later, re-enable (using a justknobs knob this time) Test Plan: CI Differential Revision: D53108142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118344 Approved by: https://github.com/shuqiangzhang, https://github.com/yifuwang, https://github.com/malfet, https://github.com/kwen2501	2024-01-29 06:09:36 +00:00
PyTorch UpdateBot	890d8e6692	[executorch hash update] update the pinned executorch hash (#118502 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118502 Approved by: https://github.com/pytorchbot	2024-01-29 03:45:45 +00:00
evelynmitchell	0d9aff2523	Removed unused “device” argument in torch.frombuffer() #118273 (#118439 ) Fixes #118273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118439 Approved by: https://github.com/albanD	2024-01-28 22:01:49 +00:00
Edward Z. Yang	acc700739e	Upgrade mypy version to 1.8.0 (#118481 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118481 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479, #118480	2024-01-28 19:22:37 +00:00
Edward Z. Yang	338596dfbc	Forbid follow_imports = skip from mypy.ini (#118480 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118480 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479	2024-01-28 19:22:37 +00:00
Edward Z. Yang	119b66ba16	Use strict to toggle strict options in MYPYSTRICT (#118479 ) As we force a specific version of mypy, it's OK to use the agglomerated flag. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475	2024-01-28 19:22:22 +00:00
Edward Z. Yang	ecca533872	Use dmypy instead of mypy in lintrunner (#118475 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118475 Approved by: https://github.com/suo ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469	2024-01-28 13:42:06 +00:00
Edward Z. Yang	cad79bd0bb	Remove follow_imports = skip from sympy (#118469 ) dmypy silently ignores follow_imports = skip, so to get parity between dmypy and mypy we have to suck it up and type: ignore all of the sympy typing problems. The suppressions were added automatically with the following script generated by GPT-4: ``` import re # Read the error file with open("error_file.txt", "r") as f: errors = f.readlines() # Parse the lines with errors and error types error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type # Insert ignore comments in the source files for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467, #118468	2024-01-28 13:38:38 +00:00
Edward Z. Yang	59b4d2cd40	[mypy] Remove colorama ignore_missing_imports (#118468 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118468 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432, #118467	2024-01-28 13:38:38 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
PyTorch UpdateBot	2ed0af2bde	[executorch hash update] update the pinned executorch hash (#118477 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118477 Approved by: https://github.com/pytorchbot	2024-01-28 03:56:11 +00:00
Aaron Gokaslan	9d5b950bdd	[BE][Easy]: Update ruff to 0.1.14 (#118466 ) Updates ruff to 0.1.14 which has some more autofixes, bugfixes, and fixes some false positives. Full changelog found here: https://github.com/astral-sh/ruff/releases/tag/v0.1.14 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118466 Approved by: https://github.com/ezyang	2024-01-27 23:44:25 +00:00
Yanbo Liang	ca1d70632d	[14/N][Dynamo] Make trace_rules.lookup only handle function + callable type (#118366 ) Step by step changes to unblock #118264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118366 Approved by: https://github.com/angelayi	2024-01-27 23:02:44 +00:00
Tobias Ringwald	62c1e4a578	Added missing CircularPad*d references so the docs are actually built. (#118465 ) Fixes #118429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465 Approved by: https://github.com/Skylion007	2024-01-27 22:39:01 +00:00
Xuehai Pan	2728c9137d	[easy][AOT] Fix shortcut path for simple tuple/list spec (#118460 ) `type(self.spec)` is always `TreeSpec` and the condition is always `False`. This PR changes it to `self.spec.type`, which is the type of tree that the spec represents. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118460 Approved by: https://github.com/Skylion007	2024-01-27 19:04:12 +00:00
Peter Bell	1460334436	[quant] Remove deprecated torch.jit.quantized APIs (#118406 ) The `torch.jit.quantized` interface has been deprecated since #40102 (June 2020). BC-breaking message: All functions and classes under `torch.jit.quantized` will now raise an error if called/instantiated. This API has long been deprecated in favor of `torch.ao.nn.quantized`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118406 Approved by: https://github.com/jerryzh168	2024-01-27 18:32:45 +00:00
Edward Z. Yang	d03173e88c	Unify MYPYINDUCTOR and MYPY (#118432 ) The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this. Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418	2024-01-27 17:23:20 +00:00
Xuehai Pan	42062e2622	[pytree][BE] update treespec `is_leaf()` access (#116371 ) Change `isinstance(treespec, LeafSpec) -> treespec.is_leaf()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116371 Approved by: https://github.com/zou3519	2024-01-27 11:44:57 +00:00
Justin Yip	26473460a4	[ET-Vulkan] ExecuTorch Vulkan floor_div (#118428 ) Summary: Add a new operator "floor_div" to ET-Vulkan. Test Plan: ``` [yipjustin@7777.od ~/fbcode (b32108c6c)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- File changed: fbcode//executorch/backends/vulkan/test/test_vulkan_delegate.py Buck UI: https://www.internalfb.com/buck2/90290e5b-d47e-4cac-bc63-9939cc210d1f Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649890839142 Network: Up: 2.8KiB Down: 0B (reSessionID-e7425cc1-0987-46d8-a7bf-418a660bee5b) Jobs completed: 19. Time elapsed: 42.6s. Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1) Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: SS-JIA Differential Revision: D53072722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118428 Approved by: https://github.com/SS-JIA	2024-01-27 11:20:52 +00:00
eqy	8d790abab9	[NCCL][c10d] Log failing pointer if deregistration fails (#118455 ) For debugging convenience CC @minsii @Aidyn-A @syed-ahmed @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455 Approved by: https://github.com/wconstab	2024-01-27 11:03:02 +00:00
PyTorch MergeBot	dabb90f2a4	Revert "[Exception] [6/N] Remove use of torch::TypeError (#117964 )" This reverts commit 87335fabaeca41f9721ba5d5eb7eafcf70b7afad. Reverted https://github.com/pytorch/pytorch/pull/117964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117964#issuecomment-1913079096))	2024-01-27 08:44:34 +00:00
suo	bb6eba189f	[export][ez] remove unused argument from InterpreterModule (#118364 ) small thing I noticed Differential Revision: [D53113926](https://our.internmc.facebook.com/intern/diff/D53113926/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118364 Approved by: https://github.com/angelayi	2024-01-27 06:46:01 +00:00
Edward Z. Yang	89a1175e0e	Upgrade mypy python_version to 3.11 (#118418 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118418 Approved by: https://github.com/albanD ghstack dependencies: #118414	2024-01-27 06:10:46 +00:00
Isuru Fernando	978faf1fa2	Use an op counter to decide when to realize a kernel (#117030 ) Instead of checking the number of bytes in the string representation of the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/117030 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-01-27 05:28:46 +00:00
Michael Lazos	800e2e823f	Add compilable foreach RAdam support (#117912 ) Fixes https://github.com/pytorch/pytorch/issues/117807 This brings the number of supported optimizers with `torch.compile` to 11/13 (!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912 Approved by: https://github.com/janeyx99	2024-01-27 04:32:27 +00:00
Shunting Zhang	fe10b1800f	LazyGraphModule (#117911 ) I feel it's easier to open a new PR rather than iterating on the previous PR (https://github.com/pytorch/pytorch/pull/105257 ) since this is more like a rewrite. In this PR, instead of changing GraphModule directly which can easily causes BC issue, I create a LazyGraphModule class as Zachary & Jason suggested in comments from the previous PR. The difference between LazyGraphModule and GraphModule is mainly about how re-compile for the graph module happens. In GraphModule the recompilation happens 'eagerly': constructing a GraphModule will cause the recompilation. While in LazyGraphModule, we just mark the module as needing recompilation. The real recompilation only happens when absolutely required (e.g. call forward method, access the code property etc.). In a lot of cases in torch.compile, the real recompilation eventually is not triggered at all. This can save a few seconds of compilation time. By default, GraphModule rather than LazyGraphModule is used. `use_lazy_graph_module(True)` context manager can be used to pick LazyGraphModule instead. This has been applied to the torch.compile stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117911 Approved by: https://github.com/jansel	2024-01-27 04:10:18 +00:00
Will Constable	70699a6357	[C10D] Add tests for gather and gather_object with subgroup (#118359 ) Addresses #118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359 Approved by: https://github.com/weifengpy	2024-01-27 04:08:56 +00:00
PyTorch UpdateBot	28625d746f	[executorch hash update] update the pinned executorch hash (#118443 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118443 Approved by: https://github.com/pytorchbot	2024-01-27 04:08:49 +00:00
Shuqiang Zhang	993e4f3911	[c10d] relax the nccl error check for nonblocking mode (#118254 ) resolve https://github.com/pytorch/pytorch/issues/117749 Summary: This is the first step to enable NCCL nonblocking mode. In NCCL nonblocking mode, ncclInProgress is an expected return value when checking communicators. Without this relaxation, watchdog thread would throw NCCL errors during work checking while it is expected. Test Plan: Set nonblocking mode in unit tests, and make sure all existing NCCL tests pass Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254 Approved by: https://github.com/kwen2501	2024-01-27 03:49:00 +00:00
David Berard	40c08795b0	[JIT] python IR bindings: consolidate tests, add short docs in OVERVIEW.md (#118319 ) Document the existence of python IR bindings; quick comments about it; and consolidate tests in one file to serve as examples to users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118319 Approved by: https://github.com/eellison	2024-01-27 03:11:51 +00:00
Edward Z. Yang	9bce208dfb	Replace follow_imports = silent with normal (#118414 ) This is a lot of files changed! Don't panic! Here's how it works: * Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file. * When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded. * The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors. * Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list. * Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves. * torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state. * There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many. In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file. The codemod was done with this script authored by GPT-4: ``` import glob exclude_patterns = [ ... ] for pattern in exclude_patterns: for filepath in glob.glob(pattern, recursive=True): if filepath.endswith('.py'): with open(filepath, 'r+') as f: content = f.read() f.seek(0, 0) f.write('# mypy: ignore-errors\n\n' + content) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414 Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD	2024-01-27 02:44:11 +00:00
lancerts	af1338bfbf	fix escape nested comments in C++ (#117882 ) Fixes #115243, as it is tricky to deal with the nested comment in doxygen + sphinx. Change 6 below is adopted as the fix. All other changes do not work. After adopting change 6, realize the original `torch::optim::SGD sgd(0.9);` is not the correct call to the sgd constructor, modified to the correct one `torch::optim::SGD sgd(model->parameters(), 0.9);` - Original in [link](https://pytorch.org/cppdocs/api/function_namespacetorch_1ad98de93d4a74dd9a91161f64758f1a76.html#exhale-function-namespacetorch-1ad98de93d4a74dd9a91161f64758f1a76): `/// torch::optim::SGD sgd(/lr=/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/0054b355-4925-4112-93b4-9385fdc34bb9) - Change 1, this solution is referenced from [here](https://stackoverflow.com/questions/24978463/doxygen-escape-nested-comments-in-c): `/// torch::optim::SGD sgd(/&zwj;* lr= &zwj;/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/77ff2d18-3097-4265-8dcd-31d78acb9c6e) - Change 2: `/// torch::optim::SGD sgd(/ lr= // 0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/b520f8de-ead7-4009-b0fb-f4517daba077) - Change 3: `/// torch::optim::SGD sgd(/\lr=\/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/07e9e608-4640-43c0-994a-37983b803003) - Change 4: `/// torch::optim::SGD sgd(/&lowast; lr= &lowast;/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/121e55c5-0802-4ff3-bbd7-3521e1299d94) - Change 5: ``` /// \rst /// .. code-block:: cpp /// /// torch::nn::Linear model(3, 4); /// torch::load(model, "model.pt"); /// \verbatim /// torch::optim::SGD sgd(/lr=/0.9); /// \endverbatim /// std::istringstream stream("..."); /// torch::load(sgd, stream); /// /// auto tensor = torch::ones({3, 4}); /// torch::load(tensor, "my_tensor.pt"); /// \endrst ``` ![image](https://github.com/pytorch/pytorch/assets/7495155/e675f551-e939-4be8-b24a-e2e53377dd08) - Change 6: `/// torch::optim::SGD sgd(0.9); // 0.9 is the learning rate` ![image](https://github.com/pytorch/pytorch/assets/7495155/ecf0adc4-9b0b-4aef-b0bc-72d4b17c45fa) ![image](https://github.com/pytorch/pytorch/assets/7495155/01bf5d5b-8450-4599-8c9a-00204ab56119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117882 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-01-27 02:37:23 +00:00
ydwu4	5b31516008	[dynamo] inline torch.jit._unwrap_optional (#118434 ) Before this pr, torch.jit._unwrap_optional is in the skipfile list thus causing a graph break. Check its implementation it's just a normal python function [here](`ff8e33556e/torch/jit/_script.py (L1681-L1683)`): ```python def _unwrap_optional(x): assert x is not None, "Unwrapping null optional" return x ``` We could safely inline it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118434 Approved by: https://github.com/yanboliang	2024-01-27 02:22:14 +00:00
Animesh Jain	4aa1f994be	[dynamo][assume_constant_result] Dont put symbolic guards for assume_constant_result (#118430 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118430 Approved by: https://github.com/ydwu4	2024-01-27 01:56:14 +00:00
Min Si	838d3620cd	[NCCL PG] log NCCL comm at creation and abort (#118335 ) Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs. Differential Revision: D53107647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335 Approved by: https://github.com/wconstab	2024-01-27 01:43:53 +00:00
Wei Wang	80cb6db90d	[CUDA] [CI] Disable flash attention for sm87 architecture when the head dim > 192 (#117678 ) Head dim > 192 requires A100/H100 (sm80 or sm90) per TORCH_CHECK [here](`0c26565d5d/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp (L760)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117678 Approved by: https://github.com/eqy, https://github.com/malfet	2024-01-27 01:22:47 +00:00
Nikita Shulga	7cc7bf9dda	[GHF] Add co-authors to PR (#118347 ) Mention co-authors in PR body Modify `CommitAuthors` to include query first two commit `authors`, which makes sure that authors from suggested commits are recognized. Test plan: CI + check `get_authors()` on a few PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/118347 Approved by: https://github.com/kit1980	2024-01-27 01:02:49 +00:00
Vincent Lee	4d771c56de	[xnnpack] Move x86 flags to platform_compiler_flags (#117923 ) Summary: AVX extension flags are x86 specific, and clang-18 has started to error on it when building targets that's not x86. I couldn't find the resulting upstream change that made these flags an error, but it's fairly trivial that these flags do not apply to all architectures. For most of the flags, they are already defined in `platform_compiler_flags`. The changes done * Gate the flags under `compiler_flags` with `selects` * If flags weren't defined in `platform_compiler_flags`, define them there as well * Remove the `^` and `$` in the platform regex. Not all flavors start with the platform (e.g. `android-x86_64`. * Some minor formatting changes were also included here. Test Plan: Atop D52741786, ``` buck2 build --flagfile 'arvr/mode/android/apk/linux/opt' '//arvr/projects/mixedreality/android/ocean_passthrough_service:ocean_passthrough_mrservice_dev' ``` Differential Revision: D52856224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117923 Approved by: https://github.com/mcr229	2024-01-26 23:41:06 +00:00
Lucas Pasqualin	ff8e33556e	Enables load balancing duplicates in DCP (#116469 ) Enables the deduplication of saved entries by load balancing duplicates across ranks. Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in ~3 seconds on 8 ranks. Before this PR, the same operation has been measured at ~19 seconds. ``` def run(local_rank, world_size, param_size, num_params, work_dir): os.environ["RANK"] = str(local_rank) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" device = torch.device(f"cuda:{local_rank}") torch.cuda.set_device(device) dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size) model = Model(param_size=param_size, num_params=num_params) model = DistributedDataParallel(model, gradient_as_bucket_view=True) _patch_model_state_dict(model) sz = sum(t.nelement() * t.element_size() for t in model.parameters()) rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB") rank_0_print("Saving the model with DCP...") checkpointer = _FileSystemCheckpointer( f"{args.work_dir}/dcp", sync_files=False, single_file_per_rank=False, thread_count=1 ) begin_ts = time.monotonic() checkpointer.save(state_dict={"model": model}) end_ts = time.monotonic() rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP") ``` Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469 Approved by: https://github.com/fegin, https://github.com/wz337	2024-01-26 22:34:14 +00:00
eellison	b95c45fbf7	add stack trace to device skip (#118112 ) Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112 Approved by: https://github.com/bdhirsh	2024-01-26 22:33:48 +00:00
rzou	b256b7b348	Add way to actually delete a torch.library.Library object (#118318 ) Relying on object lifetimes in Python is a bad idea due to reference cycles. Previously, when a torch.library.Library object gets destroyed, it clears all the registrations associated with it, but it's unclear when it actually gets destroyed due to the existence of refcycles. This PR: - adds torch::Library::clear(), which deterministically releases all of the RAII registration handles of the torch::Library object - adds a new `torch.library._scoped_library` context manager, which creates a library and cleans it up at the end of the scope using the previous item. All tests (unless they already handle library lifetimes) should use this new API - Rewrites some flaky tests to use `_scoped_library`. In the future we'll probably migrate all of our torch.library tests to use `_scoped_library`, but that's kind of annoying because we have multiple thousands of LOC I'm hoping this will deflake those tests; we'll see. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318 Approved by: https://github.com/albanD	2024-01-26 22:30:51 +00:00
Peter Bell	f129e3fe03	[inductor] Handle cum{sum,prod} on zero-dim tensors (#117990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117990 Approved by: https://github.com/lezcano	2024-01-26 22:21:42 +00:00
titaiwangms	074ac822d5	[ONNX] Skip empty input test case in aten_mm (#118413 ) Fixes #117718 Fixes #117725 It's actually a known issue in https://github.com/microsoft/onnxscript/pull/586, and we do exclude the empty input test cases in aten_matmul. This PR follows the skip, and add aten_mm as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118413 Approved by: https://github.com/thiagocrepaldi	2024-01-26 22:06:57 +00:00
ydwu4	eee63ac845	[dynamo] move torch._C._get_cublas_allow_tf32 to constant_fold_functions (#118342 ) Previously, I create a value match for torch._C._get_cublas_allow_tf32, it should just be in constant_fold_functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118342 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #118236	2024-01-26 22:00:00 +00:00
Ivan Zaitsev	d41cfc92e6	[CI] simplify mergeability check workflow (#118415 ) Test run: https://github.com/pytorch/pytorch/actions/runs/7673050632/job/20914851421?pr=118415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118415 Approved by: https://github.com/PaliC, https://github.com/huydhn	2024-01-26 21:45:24 +00:00
Catherine Lee	84251d1d71	[ez] Windows log printing + save successful test logs (#118124 ) when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps My guess is windows line ending differences Also always save log file regardless of success or failure See `476b81a9bf` for what it looks like now Swapped to opening in text mode instead of binary, seems to be ok now. 42483193bf024983060a234dc0262f4840aef4b8 for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124 Approved by: https://github.com/huydhn	2024-01-26 21:14:25 +00:00
Angela Yi	5c56822be2	[export] Various fixes to .module() (#118272 ) Summary: While turning on .module() for all the export tests, I uncovered some bugs with .module() and while fixing them I ended up rewriting some of the code... Some of the bugs were: * bad kwargs support on the unlifted module * no support for user input mutations * (at the commit hash i was working off of) no support for custom objects * there were no tests on unlifting weights from cond/map submodules Test Plan: CI Differential Revision: D53075380 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118272 Approved by: https://github.com/suo	2024-01-26 21:05:07 +00:00
Flavio Sales Truzzi	2ed1b1747a	Fix Auto Functionalize to handle specified default values (#118331 ) Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing IndexError: tuple index out of range. Test Plan: New tests. Reviewed By: zou3519 Differential Revision: D53095812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118331 Approved by: https://github.com/zou3519	2024-01-26 20:31:38 +00:00
Jean Schmidt	07499074bb	Increasing session duration for AWS credentials for _rocm-test.yml (#118412 ) The workflow _rocm-test.yml needs longer session duration for AWS role keys Pull Request resolved: https://github.com/pytorch/pytorch/pull/118412 Approved by: https://github.com/jeffdaily, https://github.com/huydhn	2024-01-26 19:32:24 +00:00
Thiago Crepaldi	939008a268	Fix RuntimeError: NYI: Named tensors are not supported with the tracer (#118393 ) This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable. Repro 1: ```python import torch def f(x): return x[x > 0] jf = torch.jit.trace(f, torch.tensor(2., device="cuda")) ``` Error: ```bash Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace traced = torch._C._create_function_from_trace( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<stdin>", line 2, in f RuntimeError: NYI: Named tensors are not supported with the tracer ``` Repro2: ```python import torch import torch.nn.functional as F from torch import nn import copy class Net(nn.Module): def __init__(self): super().__init__() def forward(self, inputs): x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer x = F.relu(x) return x model = Net() images = torch.randn(8, 28, 28) torch.jit.trace(model, images) ``` Error 2: ```bash Traceback (most recent call last): File "/opt/pytorch/test_deepcopy.py", line 18, in <module> File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace return trace_module( ^^^^^^^^^^^^^ File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module module._c._create_method_from_trace( File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(input, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/test_deepcopy.py", line 12, in forward x = F.relu(x) ^^^^^^^^^^ File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy y = copier(memo) ^^^^^^^^^^^^ File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__ new_storage = self._typed_storage()._deepcopy(memo) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy y = copier(memo) ^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__ new_storage = self.clone() ^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 126, in clone return type(self)(self.nbytes(), device=self.device).copy_(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: NYI: Named tensors are not supported with the tracer ``` ---- #48054 RuntimeError: NYI: Named tensors are not supported with the tracer #49538 jit tracer doesn't work with unflatten layer #31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer. - This bug was closed but exists. Multiple comments on it still showing error. This is addressed Likely fixes the following issues (but untested) #63297 Named tensor in tracer #2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with "NYI: Named tensors are not supported with the tracer" This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py Test plan: New unit test added Broken scenarios tested locally CI Fixes #48054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393 Approved by: https://github.com/zou3519	2024-01-26 19:31:23 +00:00
rzou	bfbb8d8220	Don't manually invoke atexit exit handlers in tests (#118409 ) Fixes https://github.com/pytorch/pytorch/issues/104098 This is a bad idea because it runs all the exit handlers and messes with global state that is necessary for other tests to run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118409 Approved by: https://github.com/ydwu4, https://github.com/yanboliang ghstack dependencies: #118152, #118309	2024-01-26 19:11:19 +00:00
rzou	728789d850	Deflake stream tests, part 2 (#118391 ) I missed these the first time around, some more streams need to be synchronized. Fixes https://github.com/pytorch/pytorch/issues/112694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118391 Approved by: https://github.com/ydwu4, https://github.com/yanboliang	2024-01-26 19:10:53 +00:00
Wanchao Liang	e696fa1ee7	[tp] enable rowwise embedding sharding in RowwiseParallel (#118242 ) As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079, #118080	2024-01-26 19:01:24 +00:00
Wanchao Liang	dc8357b397	[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 ) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079	2024-01-26 19:01:24 +00:00
Wanchao Liang	910b49c48b	[dtensor] rewrite embedding ops using op strategy (#118079 ) This PR rewrites sharded embedding rule to use OpStrategy instead of the rule, one step further to get rid of rules and consolidate the embedding operator implementation, to prepare for rowwise embedding implementation, which will come in next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079 Approved by: https://github.com/tianyu-l	2024-01-26 19:01:15 +00:00
Edward Z. Yang	25f72194e8	Realize inputs to DynamicScalar before unwrapping storage (#118125 ) Fixes https://github.com/pytorch/pytorch/issues/118102 Unfortunately, the test still fails due to an unrelated problem https://github.com/pytorch/pytorch/issues/117665 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118125 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #117862	2024-01-26 18:08:03 +00:00
Edward Z. Yang	96d94f574e	Fix several bugs related to unbacked SymInt codegen in inductor (#117862 ) Let me tell you, this was a journey. * When we repropagate through FX interpreter in AOTAutograd, this will reallocate unbacked SymInts. We can eliminate all of these fresh allocations by appropriately asserting equalities on them setting up replacements. See also https://github.com/pytorch/pytorch/issues/111950 * The `inner_fn` of Loops can contain references to unbacked SymInts. We must collect them to prevent DCE. * Export naughtily accessed `_expr` when it should have accessed `expr` on SymNode. Fixed two sites of this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117862 Approved by: https://github.com/bdhirsh	2024-01-26 18:08:03 +00:00
yewentao	89a0b1df51	fix lint for cudnn codes (#117091 ) Fixes the lint issue described in https://github.com/pytorch/pytorch/pull/116759 @albanD Please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/117091 Approved by: https://github.com/albanD	2024-01-26 17:53:22 +00:00
David Berard	2842d3c9d3	[Nested Tensor] view: basic support for ragged_idx != 1 and _unsafe_view (#118317 ) Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view: `eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)` If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd). Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work. Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317 Approved by: https://github.com/soulitzer	2024-01-26 17:29:37 +00:00
PyTorch MergeBot	533637d9a3	Revert "Check if enable inside run call (#118101 )" This reverts commit 2abb812a78c0d3976e6eb10114716bcb163480ca. Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to broke periodic multigpu test some how `6fc015fedc` ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1912357321))	2024-01-26 16:41:56 +00:00
Alexander Grund	f1aef2c094	Don't check is_conj for `_refs.linalg.svd` (#117972 ) The flag is not correctly set when PyTorch is compiled with GPU support resulting in failures in `test_ops.py::test_python_ref_meta__refs_linalg_svd_cpu_complex`. Use a similar approach to test_meta and skip the check for this function. Workaround for #105068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117972 Approved by: https://github.com/lezcano	2024-01-26 15:24:29 +00:00
PyTorch MergeBot	af8f37c2b6	Revert "Use SEQUENTIAL posix_fadvise on mmapped files (#117805 )" This reverts commit 401aa1a1deaee19909c957d7d56d91341018b4dc. Reverted https://github.com/pytorch/pytorch/pull/117805 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117805#issuecomment-1912204403))	2024-01-26 14:59:58 +00:00
cyy	6da0e7f84b	[Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829 Approved by: https://github.com/albanD	2024-01-26 13:33:24 +00:00
Tobias Ringwald	8ff55c7e68	Clarified sampling process of torch.randn for complex dtypes. (#118315 ) Fixes #118269. Clarified the docs of `torch.randn` and `torch.randn_like`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118315 Approved by: https://github.com/lezcano	2024-01-26 13:05:19 +00:00
leslie-fang-intel	b66c4eda61	[Inductor] Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278 ) Summary Follow up of https://github.com/pytorch/pytorch/pull/108220 which improves performance of `basic_gnn_gin`, `basic_gnn_sage` and `basic_gnn_gcn` in multi thread test cases. However, it causes performance regression of these 3 models in single thread test case as reported in https://github.com/pytorch/pytorch/issues/117740. Fix the single thread issues in this PR by adding the thread number check to decide whether fallback `scatter_reduce_` or not. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_scatter_using_atomic_add ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118278 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-01-26 12:43:25 +00:00
Yifu Wang	0857a3a753	[c10d_functional] fix an issue where mutation on views fails in inductor (#118333 ) `_CollectiveKernel.create_inplace` expresses mutation with the newly introduced `MutationOutput` which requires the `layout` of the input. Currently, there's a bug where if the input is a view, `inp.layout` fails. This PR fixes the issue by unwrapping the input if it's a view. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118333 Approved by: https://github.com/wanchaol	2024-01-26 11:13:30 +00:00
Daohang Shi	4d0b471389	fix key error in pre_grad fx_passes_numeric_check (#118325 ) Summary: ``` I0125 121749.865 pyper_config_utils.py:8225] torchdynamo pyper config = TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True) ``` In trainer ``` I0125 12:58:51.832000 4011.139732263132160 torchdynamo_wrapper.py:291 trainer:0:1 ] [pt2] creating torchdynamo backend wrapper with settings TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True) #ai_training_job_id="febe34d9-b2fb-493e-a5cc-6a0b1dc85ad4" #ai_training_local_rank="1" #ai_training_role_rank="1" #mast_job_attempt="2" #mast_job_name="f525072920-TrainingApplication" ... if config.fx_passes_numeric_check["pre_grad"]: ``` https://www.internalfb.com/diff/D52826442?dst_version_fbid=1115735309429172&transaction_fbid=682438900759710 https://www.internalfb.com/diff/D51838043?dst_version_fbid=336373395892373&transaction_fbid=349901787874069 This diff first fixes the key error to restore broken tests. Its pyper changes can be addressed later. https://www.internalfb.com/code/fbsource/[72c19313ed73]/fbcode/caffe2/torch/_inductor/config.py?lines=142-147 Test Plan: buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_mimo_cmf_deterministic_ne_pt2_training_platform__canary_offline_training-launcher -- --build-fbpkg --run-disabled --tests test Reviewed By: yusuo Differential Revision: D53102344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118325 Approved by: https://github.com/mengluy0125	2024-01-26 11:02:12 +00:00
leslie-fang-intel	8dd1be49b7	[Inductor] Use sleef implementation for CPP backend acosh codegen (#118350 ) Summary Fix https://github.com/pytorch/pytorch/issues/118267. Current cpp backend using `f"({x} + ({x}{x} - {vec_one}).sqrt()).log()"` to calculate `acosh`, the issue happens when input is a large negative value like `-910685.8125`. In this case, `(xx - 1).sqrt() + x` equals to 0, and `0.log()` returns `-inf`. However, based on the document: https://pytorch.org/docs/stable/generated/torch.acosh.html, negative inputs should returns `Nan`. Using acosh sleef implementation to fix this issue. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_acosh_with_negative_large_input ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118350 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-26 10:19:40 +00:00
Chien-Chin Huang	2ea38498b0	[FSDP][BE] Only show state_dict log when the debug level is detail (#118196 ) As title Differential Revision: [D53038704](https://our.internmc.facebook.com/intern/diff/D53038704/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118196 Approved by: https://github.com/rohan-varma, https://github.com/wz337 ghstack dependencies: #118197, #118195	2024-01-26 09:52:36 +00:00
Chien-Chin Huang	4f4e61bb75	[DCP] Add tests to demonstrate DCP checkpoint conversion (#117773 ) As title Differential Revision: [D52854759](https://our.internmc.facebook.com/intern/diff/D52854759/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117773 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #116248, #117772	2024-01-26 09:39:10 +00:00
Chien-Chin Huang	644bc69530	[DCP] Allow users to save and load without creating storage reader and writer (#117772 ) Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer. Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772 Approved by: https://github.com/wz337 ghstack dependencies: #116248	2024-01-26 09:08:35 +00:00
PyTorch MergeBot	fc30bd3b7b	Revert "[dtensor] rewrite embedding ops using op strategy (#118079 )" This reverts commit e599a0879684abedec2a28b08b822fd4a4219105. Reverted https://github.com/pytorch/pytorch/pull/118079 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
PyTorch MergeBot	bfb5e7642e	Revert "[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 )" This reverts commit 8cc02b46c33b5192289e4cf64fa55d685127bfb8. Reverted https://github.com/pytorch/pytorch/pull/118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
PyTorch MergeBot	bc67f87559	Revert "[tp] enable rowwise embedding sharding in RowwiseParallel (#118242 )" This reverts commit 7a9012d7e847a6265e70873e9baab70838edd601. Reverted https://github.com/pytorch/pytorch/pull/118242 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))	2024-01-26 08:47:14 +00:00
Jeff Daily	2c9a90cde6	[ROCm] backward compatible type enums (#118137 ) Fixes builds of pytorch using unreleased ROCm packages that are missing type enums introduced in ROCm 6.0 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118137 Approved by: https://github.com/xw285cornell, https://github.com/anupambhatnagar	2024-01-26 08:40:13 +00:00
Jorge Pineda	f8e14f3b46	[PyTorch][Vulkan] Clean up aten::stack (#118314 ) Summary: After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following: 1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`. 2. Add `tensor.dim() == 0` tests. 3. Address `readability-container-size-empty` and `performance-unnecessary-copy-initialization` linter errors. Test Plan: Tested on OD. ``` [jorgep31415@29786.od /data/sandcastle/boxes/fbsource (1d0b920e0)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="stack" File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/ops/Unsqueeze.cpp File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp 3 additional file change events Buck UI: https://www.internalfb.com/buck2/98bb3bfa-a1d1-440e-8724-b4990c9cc7ca Network: Up: 1.4MiB Down: 377KiB (reSessionID-6eccf420-3951-4942-9350-998803589b8d) Jobs completed: 17. Time elapsed: 42.6s. Cache hits: 38%. Commands: 8 (cached: 3, remote: 0, local: 5) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = stack [==========] Running 5 tests from 1 test suite. [----------] Global test environment set-up. [----------] 5 tests from VulkanAPITest [ RUN ] VulkanAPITest.stack_invalid_inputs [ OK ] VulkanAPITest.stack_invalid_inputs (27 ms) [ RUN ] VulkanAPITest.stack_0d [ OK ] VulkanAPITest.stack_0d (28 ms) [ RUN ] VulkanAPITest.stack_1d [ OK ] VulkanAPITest.stack_1d (1 ms) [ RUN ] VulkanAPITest.stack_2d [ OK ] VulkanAPITest.stack_2d (148 ms) [ RUN ] VulkanAPITest.stack_3d [ OK ] VulkanAPITest.stack_3d (354 ms) [----------] 5 tests from VulkanAPITest (561 ms total) [----------] Global test environment tear-down [==========] 5 tests from 1 test suite ran. (561 ms total) [ PASSED ] 5 tests. ``` Reviewed By: copyrightly, liuk22 Differential Revision: D53071188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118314 Approved by: https://github.com/liuk22	2024-01-26 04:28:06 +00:00
PyTorch UpdateBot	2b1ee9be7a	[executorch hash update] update the pinned executorch hash (#118339 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118339 Approved by: https://github.com/pytorchbot	2024-01-26 04:26:38 +00:00
Jorge Pineda	0c5da6100f	[PyTorch][Vulkan] Clean up aten::unsqueeze (#118311 ) Summary: After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following: 1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`. 2. Add 0->1 `tensor.dim()` tests. 3. Remove `dim == 0` case from shader since that path is never executed. The `cpp` code sends the input to `submit_copy` instead. Test Plan: Tested on OD. ``` [jorgep31415@29786.od /data/sandcastle/boxes/fbsource (c66693c95)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="unsqueeze" File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl Buck UI: https://www.internalfb.com/buck2/16cf8f59-e535-493b-b123-5952ef8f1453 Network: Up: 21KiB Down: 1.4MiB (reSessionID-1219eefd-e78b-4bfd-aef8-8e4b38da82f8) Jobs completed: 8. Time elapsed: 37.8s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 1, local: 2) BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 10 tests from 1 test suite. [----------] Global test environment set-up. [----------] 10 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_0dto1d_dim0 [ OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (61 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (110 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (16 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (58 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (16 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms) [----------] 10 tests from VulkanAPITest (270 ms total) [----------] Global test environment tear-down [==========] 10 tests from 1 test suite ran. (270 ms total) [ PASSED ] 10 tests. ``` Also, to improve my confidence in unit tests, I modified [force_flush.py](https://www.internalfb.com/code/fbsource/[6e606c6f62dafd2121e78ffe14ae12f1b6d8d405]/fbcode/wearables/camera/ml/pytorch_vulkan_native/demo/force_flush.py) to run several combinations of `aten::unsqueeze` on OD. Verified these work as expected. ``` torch.zeros([]) torch.randn([]) torch.rand([]) torch.ones([]) torch.tensor(0, dtype=torch.float) ``` Found that Vulkan in general does not support the following. That's ok though since it's technically a 1d tensor which is not part of my task. ``` torch.tensor([]) ``` Differential Revision: D53071189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118311 Approved by: https://github.com/liuk22	2024-01-26 04:22:54 +00:00
CaoE	8467de4e97	Fix kaiser_window for lower precision data types on CPU (#117345 ) Fixes #117230. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117345 Approved by: https://github.com/jgong5, https://github.com/soumith	2024-01-26 03:26:12 +00:00
eqy	ef29fe745f	[CUDA] Add missing TF32 annotation to `test_uint4x2_mixed_mm` (#118143 ) Addresses numerical mismatches seen on architectures with TF32. CC @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/118143 Approved by: https://github.com/nWEIdia, https://github.com/jansel	2024-01-26 03:23:22 +00:00
Ivan Zaitsev	b599f5608c	Fix mergeability check for ghstack PRs (#118258 ) # Changes * introduce `--check-mergeability` trymerge flag that attempts to merge PR locally, using the same merge logic as the mergebot, but requires just a read-only `GITHUB_TOKEN` and git repo. * change mergeability workflow to utilize the new --check-mergeability logic # Alternatives considered 1. > Rewrite `https://github.com/pytorch/test-infra/actions/workflows/pr-dependencies-check.yml` to correctly support partially merged ghstacks. That would be a slightly better approach, but ROI is lower, as it requires reimplementing trymerge logic and additional effort to consolidate the codebase (trymerge lives in pytorch repo). `pr-dependencies-check.yml` still produces human-readable results for partially merged ghstack prs (even if it falsely reports them as non-mergeable). 2. > Instead of introducing new trymerge flag, use existing flags, including `--dry-run`. That didn't work, as no combination of existing flags skips the rule checks and ROCKSET lookups. # Testing 1. Manual testing `trymerge.py --check-mergeability` on the regular and ghstack PRs: ``` export GITHUB_TOKEN= export GIT_REPO_DIR=`pwd` export GITHUB_REPOSITORY=pytorch/pytorch export GIT_REMOTE_URL=https://github.com/pytorch/pytorch # Test 1 (2 prs, 1 is closed) python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 117862 Skipping 1 of 2 PR (#117859) as its already been merged echo $? 0 # Test 2 (3 prs, 1 is closed) python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 118125 Skipping 1 of 3 PR (#117859) as its already been merged echo $? 0 # Test 3 (3 prs, intentional conflicts introduced into `main`): python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability 118125 Skipping 1 of 3 PR (#117859) as its already been merged stdout: Auto-merging torch/_inductor/ir.py Auto-merging torch/_inductor/lowering.py CONFLICT (content): Merge conflict in torch/_inductor/lowering.py error: could not apply 66ba5b8792f... Realize inputs to DynamicScalar before unwrapping ... RuntimeError: Command `git -C /Users/ivanzaitsev/pytorch2 cherry-pick -x 66ba5b8792fa076c4e512d920651e5b6b7e466f4` returned non-zero exit code 1 ``` 2. Workflow run: https://github.com/pytorch/pytorch/actions/runs/7660736172/job/20878651852?pr=118258 <img width="516" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/28fbf0d2-ac2a-4518-b41d-b32b41373747"> <img width="621" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/ddbf8566-a417-43ec-9d0e-f623f4a71313"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118258 Approved by: https://github.com/PaliC, https://github.com/huydhn	2024-01-26 03:15:56 +00:00
Bin Bao	4e456fd95b	[AOTI] Support scalar to tensor in the ABI-compatible mode (#118024 ) Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024 Approved by: https://github.com/ezyang	2024-01-26 03:15:05 +00:00
Nikita Shulga	66c3152e36	[CI] Build docker on larger runners (#118167 ) Otherwise it takes 1+h to build CUDA12.1 docker - Limit UCC builds to just sm_52(M60) and sm_86(A10G), which I think has the biggest impact - Replace hardcoded `-j6` build parallelism with more dynamic `-j$[$(nproc) - 2]` - Remove redundant check about Ubuntu-14.04 - Added `DOCKER_BUILDKIT` to parallelize the builds As result, docker build time drops from 1+h to 35 min Pull Request resolved: https://github.com/pytorch/pytorch/pull/118167 Approved by: https://github.com/huydhn	2024-01-26 02:28:25 +00:00
Nikita Shulga	385d8b32fc	Update PocketFFT submodule (#118348 ) Accidentally downgraded by force merge of https://github.com/pytorch/pytorch/pull/117804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118348 Approved by: https://github.com/kit1980	2024-01-26 02:01:06 +00:00
Yang Chen	3cdd4e236e	[inductor][easy] dump triton kernel names in the log (#118313 ) This may help debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118313 Approved by: https://github.com/desertfire	2024-01-26 02:00:04 +00:00
Wanchao Liang	7a9012d7e8	[tp] enable rowwise embedding sharding in RowwiseParallel (#118242 ) As titled, this PR enables the rowwise embedding sharding in the RowwiseParallel style, and add tests to ensure it's working as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079, #118080	2024-01-26 01:36:24 +00:00
Wanchao Liang	8cc02b46c3	[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080 ) This PR add support for rowwise sharded embedding by adding a MaskPartial placement that inherits from the default partial placement, and override the Partial constracts to construct the mask and release the mask after the reduction The MaskPartial placement have the potential to support other ops sharding computation that requires a mask for semantic correctness. currently make it live in the embedding ops but we can move it to a common place if needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080 Approved by: https://github.com/tianyu-l ghstack dependencies: #118079	2024-01-26 01:36:24 +00:00
PyTorch MergeBot	3d062f9abe	Revert "[pytorch][kineto] log process group config in distributed info (#117774 )" This reverts commit 9c1348feb3de872f7cabd807abbc228e7192cd46. Reverted https://github.com/pytorch/pytorch/pull/117774 on behalf of https://github.com/aaronenyeshi due to This diff is breaking internal jobs, but has been internally reverted ([comment](https://github.com/pytorch/pytorch/pull/117774#issuecomment-1911251092))	2024-01-26 01:10:31 +00:00
Sherlock Huang	6596a3f23d	[Export] Remove ScriptObjectMeta (#118241 ) Summary: As title. Use CustomObjArgument as ScriptObjectMeta Test Plan: CIs Reviewed By: zhxchen17 Differential Revision: D53062230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241 Approved by: https://github.com/zhxchen17	2024-01-26 00:37:19 +00:00
Arseny Kapoulkine	401aa1a1de	Use SEQUENTIAL posix_fadvise on mmapped files (#117805 ) In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes). Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...). With this, they run at ~1.5 GB/s which is still bad but better than before! It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be. All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp. I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805 Approved by: https://github.com/mikaylagawarecki	2024-01-26 00:26:57 +00:00
Catherine Lee	de9ddd19a5	Various CI settings (#117668 ) Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long) Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs). Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668 Approved by: https://github.com/huydhn	2024-01-26 00:17:29 +00:00
Nikita Shulga	8c167f9fc3	[CMake] Explicitly error out if CuDNN older than 8.5 (#118235 ) Also update README.md Fixes https://github.com/pytorch/pytorch/issues/118193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118235 Approved by: https://github.com/zou3519	2024-01-25 23:41:04 +00:00
ydwu4	71757093c5	[dynamo] avoid graph break on torch.backends.cuda.matmul.allow_tf32 (#118236 ) Before the PR, we have a graph break for the following test: ```python def test_cublas_allow_tf32(x): if torch.backends.cuda.matmul.allow_tf32: return x.sin() + 1 return x.cos() - 1 ``` In this PR, we first add "torch.backends.cuda" to MOD_INLINELIST to trace through the python binding and get the actual call torch._C._get_cublas_allow_tf32, where it's already a TorchInGraphVariable. Because _get_cublas_allow_tf32 is accessing the same variable as at::globalContext().allowTF32CuBLAS(), which is guarded by dynamo as a global state [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp#L443), we could safely assume it returns a ConstantVariable during tracing. After this pr, we get the following graph: ```python [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_x_ : torch.Tensor): [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:515 in test_cublas_allow_tf32, code: return x.cos() - 1 [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] cos = l_x_.cos(); l_x_ = None [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sub = cos - 1; cos = None [2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (sub,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118236 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2024-01-25 23:40:23 +00:00
Angela Yi	b5c9623835	[export] Add node meta into UnflattenedModule (#118138 ) Summary: Reland of #117686 Test Plan: CI Differential Revision: D53012028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118138 Approved by: https://github.com/zhxchen17	2024-01-25 23:37:41 +00:00
Angela Yi	a93940b5db	[export] Allow constant outputs + None input/outputs (#117894 ) Added support for constant outputs. We will just embed the constant directly into the output, like `return (x, 1)`. Also adds support for None input/outputs. For None inputs we address it the same way we do to constants, which is that a placeholder with no users will be inserted into the graph, and the None will be embedded into whatever operator is using the None. For None outputs, we will also address the same way we do constants, which is that we embed it into the output, like `return (x, None)`. Differential Revision: D52881070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117894 Approved by: https://github.com/zhxchen17	2024-01-25 23:37:34 +00:00
albanD	24133e44b1	Fix return type hint for list types (#118238 ) All single element list types are `Tensor[]` so they will always be Tuple. I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238 Approved by: https://github.com/ezyang	2024-01-25 23:35:20 +00:00
David Berard	52c5803088	[NestedTensor] Support ragged_idx != 1 in pointwise ops (#118157 ) This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added. Previously: * unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1 * binary pointwise ops would produce tensors with nonsense shapes Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157 Approved by: https://github.com/jbschlosser	2024-01-25 23:34:15 +00:00
Wei (Will) Feng	91d5f94f85	[FSDP] Idempotent reshard (#117997 ) address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510 ```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997 Approved by: https://github.com/awgu	2024-01-25 23:29:23 +00:00
Lucas Pasqualin	b10b08227a	Passes process group to `_all_gather_keys` in `dcp.load` (#118301 ) As title Fixes #118277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118301 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-01-25 23:07:57 +00:00
Catherine Lee	02a411d4a6	[mergebot] Dry run for labels + easier to read Dr CI result (#118240 ) Dry run open for labels so we can run trymerge locally with dryrun without actually affected the PR Make Dr.CI results easier to read (previously a massive json dump, now just the job names + ids, in a nicer format) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118240 Approved by: https://github.com/huydhn	2024-01-25 23:06:43 +00:00
soulitzer	26f1da0b1b	Fix node traversal when setting up stacktrace preservation hooks (#118252 ) We only want to traverse over each node in the graph exactly once, and we do that by inserting nodes into the "seen" set. The issue is that we forget to check the "seen" set when inserting the root nodes. Typically that is not a problem, because the root nodes are from the different outputs and thus usually correspond to different nodes. With split_with_sizes, though all of the outputs correspond to the same node, ands this leads to the node being iterated over 3 times, and 3 sets of hooks being attached to the same node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118252 Approved by: https://github.com/zou3519 ghstack dependencies: #117552, #118234, #118249	2024-01-25 22:56:20 +00:00
soulitzer	b8bd3bb30a	Fix aot_autograd seq_nr logic (#118249 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118249 Approved by: https://github.com/zou3519 ghstack dependencies: #117552, #118234	2024-01-25 22:56:20 +00:00
feifan	3c77a3ed03	export ATen/native/sparse/*.h (#118274 ) Fixes #ISSUE_NUMBER We are trying to adapt `SparsePrivateUse1` in our code. However, I found that `sparse_stup` has not been exposed yet, which makes it impossible for me to implement stup and register. I hope that the header files in this directory can be exposed. @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/118274 Approved by: https://github.com/ezyang	2024-01-25 22:47:39 +00:00
ydwu4	fae569b4f2	[dynamo] avoid graph break on tensor.element_size() (#118229 ) Before this PR, for the following code, we have a graph break `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor int call_method element_size` ```python import torch def f(x): return x.sin().element_size() + x.sin() x = torch.randn(2, 2) torch.compile(f, backend="eager", fullgraph=True)(x) ``` After this PR, we got the following graph, where element_size() is baked in as a constant. ```python [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_x_ : torch.Tensor): [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: /home/yidi/local/pytorch/test.py:4 in f, code: return x.sin().element_size() + x.sin() [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sin = l_x_.sin() [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sin_1 = l_x_.sin(); l_x_ = None [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add = 4 + sin_1; sin_1 = None [2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118229 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/anijain2305	2024-01-25 22:28:37 +00:00
rzou	bd6bf97ea5	stop using torch.Tensor in dynamo/test_export_mutations.py (#118287 ) This causes test flakiness, because torch.Tensor allocates a Tensor with uninitialized memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118287 Approved by: https://github.com/ydwu4	2024-01-25 22:21:41 +00:00
rzou	f7f7283ec7	Skip test_none_names_refcount under Dynamo-wrapped CI (#118309 ) Fixes https://github.com/pytorch/pytorch/issues/117716 Dynamo does some things that modifies the refcount. Skipping this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118309 Approved by: https://github.com/ydwu4, https://github.com/yanboliang, https://github.com/albanD ghstack dependencies: #118152	2024-01-25 22:21:22 +00:00
Sam Larsen	4e45d791e7	Remove set_ exclusion in FakeTensor dispatch cache (#118154 ) Summary: Now that set_ is marked as a view op, this special case is no longer necessary Test Plan: CI exposed the need for this special case in the first place, so I think we can just rely on the existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/118154 Approved by: https://github.com/bdhirsh	2024-01-25 21:54:36 +00:00
Nikita Shulga	13bdd6c4e2	Revert "[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551 )" This reverts commit 3221585af0f78cee20f1fb739e140ab59a517ee1 as this commit was already landed as 83581f91ca9c3b78b0f8dc3a0a2c1cb229d20e99.	2024-01-25 13:41:39 -08:00
Lucas Pasqualin	ea851eb027	Uses Serial Loader for DCP.save when more then one thread is used. (#118114 ) The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case. Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks. Before this PR 9.475 s 8 threads After this PR 1.632 s 8 threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114 Approved by: https://github.com/wz337, https://github.com/fegin	2024-01-25 21:11:16 +00:00
laith sakka	708e6241ed	Fix sympy_subs to preserve integer and non-negative properties. (#118150 ) This diff introduce the following changes: 1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string why is this needed? I was compiling an expression: xabs(y) where y =-2 what happens is that this expression is passed as ``s1abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs. but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true) resulting in ``xabs(ks0) = xks0`` which is wrong 2. rename sympy_symbol to sympy_index_symbol to make it explicit. 3. add assertion that replaced expression is not passed as string but always a sympy expression. Fixes https://github.com/pytorch/pytorch/issues/117757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150 Approved by: https://github.com/ezyang	2024-01-25 20:54:55 +00:00
Jason Ansel	2de24c11f6	[inductor] Slightly faster memory allocation on CUDA (#118255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255 Approved by: https://github.com/peterbell10 ghstack dependencies: #118065, #118070, #118171	2024-01-25 20:49:14 +00:00
Edward Z. Yang	3e76a0e9c2	Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-01-25 20:43:18 +00:00
Yang Chen	1565d58ad9	[inductor] correctly generate grid info for benchmark_kernel (#118202 ) Previously, we generated the grid argument with tree.numel for a benchmark TritonKernel. This was not correct, because it didn't match the launch config used for profiling and running. This PR fixed the issue by emitting the grid value computed by the kernel's grid_fn, which is used by the profiler and the kernel's runner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-01-25 20:37:44 +00:00
laith sakka	b47cf4182e	Fix support non tensor inputs to operator.pos function (#118251 ) Fixes #118231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118251 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-01-25 20:37:40 +00:00
Bin Bao	476b744e23	[AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291 ) Summary: https://github.com/pytorch/pytorch/pull/117989 disabled use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models. Differential Revision: D53089956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118291 Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov	2024-01-25 20:30:17 +00:00
Colin Peppler	1f6aa4b336	[mypy] Enable follow_imports = normal for mypy-torch.backends.* (#116311 ) Summary: Test Plan: ``` lintrunner --take MYPYINDUCTOR --all-files ok No lint issues. lintrunner -a ok No lint issues. Successfully applied all patches. ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311 Approved by: https://github.com/int3	2024-01-25 20:17:22 +00:00
xadupre	3221585af0	[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551 ) With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551 Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin	2024-01-25 20:00:14 +00:00
Bin Bao	9768f73cb2	[AOTI] Skip test_index_put_with_none_index on rocm (#118290 ) Summary: The test was added in https://github.com/pytorch/pytorch/pull/118187 and is failing on rocm. Differential Revision: [D53089729](https://our.internmc.facebook.com/intern/diff/D53089729) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118290 Approved by: https://github.com/DanilBaibak	2024-01-25 19:36:00 +00:00
xadupre	83581f91ca	[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551 ) With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551 Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin	2024-01-25 18:53:41 +00:00
Sherlock Huang	bb3db079b1	[Export] Introduce class_fqn into CustomObjArgument (#118158 ) Summary: Class FQN is needed when unpacking CustomObj instance. For all other Arguments, e.g. Tensor, TensorList, SymInt, we always know their exact type. However, CustomObjArgument had an opaque type. Adding this field also helps unveiling the type of this opaque object. Test Plan: CI Differential Revision: D53029847 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118158 Approved by: https://github.com/zhxchen17	2024-01-25 18:44:25 +00:00
Chien-Chin Huang	fed0f2946f	[FSDP][BE] Fix optim_state_dict_to_load doc errors (#118195 ) As title Differential Revision: [D53038703](https://our.internmc.facebook.com/intern/diff/D53038703/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118195 Approved by: https://github.com/rohan-varma, https://github.com/wz337 ghstack dependencies: #118197	2024-01-25 18:29:04 +00:00
angelayi	01388d0790	[dynamo] Slightly better error message if key not in dict (#117902 ) Was debugging an export issue, and currently when `key` does not exist in `self.items`, the error message is ``` File "/opt/pytorch/torch/_dynamo/variables/dicts.py", line 208, in getitem_const return self.items[key] ~~~~~~~~~~^^^^^ torch._dynamo.exc.InternalTorchDynamoError: <torch._dynamo.variables.dicts.ConstDictVariable._HashableTracker object at 0x7fd7697cbf90> ``` This PR changes it to be the following. ``` File "/data/users/angelayi/pytorch/torch/_dynamo/variables/dicts.py", line 199, in getitem_const raise KeyError(arg.value) torch._dynamo.exc.InternalTorchDynamoError: shape ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117902 Approved by: https://github.com/williamwen42	2024-01-25 18:13:40 +00:00
wz337	e1f9eca113	[DeviceMesh] Reuse sub_group pg if exists (#115716 ) Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will: 1) re-use sub_group pg if it exsits, 2) create new sub_group pg if it does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716 Approved by: https://github.com/wanchaol	2024-01-25 18:07:16 +00:00
nidefawl	a289dba7b1	Add missing cuda libraries for context_gpu_test (#117493 ) This adds some missing cuda (curand and cublas) libraries that are required for the context_gpu_test to link. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117493 Approved by: https://github.com/ezyang	2024-01-25 18:04:23 +00:00
PyTorch MergeBot	eb054cc012	Revert "Fix Auto Functionalize to handle specified default values (#118035 )" This reverts commit 2d7a360911fb7b27be82c51ca86b4b34b6f1b087. Reverted https://github.com/pytorch/pytorch/pull/118035 on behalf of https://github.com/zou3519 due to needs internal changes, reverting so we can land via co-dev ([comment](https://github.com/pytorch/pytorch/pull/118035#issuecomment-1910706841))	2024-01-25 17:53:15 +00:00
BJ Hargrave	8810fdd21e	fsdp: Unit test for ModuleWrapPolicy as a Callable (#117395 ) We use `_or_policy` as a `Callable` to wrap a `ModuleWrapPolicy` instance as a `Callable`. Fixes https://github.com/pytorch/pytorch/issues/109266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117395 Approved by: https://github.com/wconstab	2024-01-25 17:40:06 +00:00
Chien-Chin Huang	c1e0674485	[DCP][BC] Remove the dependency on _shard.TensorProperties (#116248 ) ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248 Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337	2024-01-25 17:24:16 +00:00
Andrew Gu	316579e30c	[FSDP2] Introduced initial `fully_shard` frontend (#117776 ) This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP. - We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one. - We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module. - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`. - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able. - Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state. - We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794). - In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117776 Approved by: https://github.com/wconstab, https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #117994, #118186, #117984	2024-01-25 17:22:07 +00:00
Chien-Chin Huang	4f78869c18	[state_dict] Calls wait() for the DTensor to_local() result (#118197 ) See the discussion in https://github.com/pytorch/pytorch/pull/117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118197 Approved by: https://github.com/wz337, https://github.com/LucasLLC	2024-01-25 17:14:08 +00:00
Jason Ansel	817debeb89	[inductor] Slightly faster memory allocation on CPU (#118171 ) Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `12.2us` - After `10.5us` This is inspired by `a2c17a2b00` -- but in Python rather than C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #118065, #118070	2024-01-25 16:54:57 +00:00
Andrew Gu	d6b556bd98	Added `"any"` mode to `register_multi_grad_hook` (#117984 ) This is a re-open of https://github.com/pytorch/pytorch/pull/115628/. This PR adds an `"any"` option to `register_multi_grad_hook` that runs the hook when the gradient of _any_ of the input tensors is computed. The existing functionality is folded under the default `"all"` mode. The multi-threaded test case is based on the existing one for `register_multi_grad_hook`. I would appreciate a closer look on that. ~~I am not sure about the hook signature (i.e. why we see two gradients in the hook that runs instead of just one, as [`register_hook`](https://pytorch.org/docs/stable/generated/torch.Tensor.register_hook.html) docs suggest).~~ It was because I was iterating over the 2 elements in the single tensor 😢 . I did not update the `notes/autograd.rst`, which currently has a [blurb](https://pytorch.org/docs/stable/notes/autograd.html#special-hooks) on `register_multi_grad_hook`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117984 Approved by: https://github.com/soulitzer ghstack dependencies: #117994, #118186	2024-01-25 16:25:52 +00:00
Lei,zhenyuan	173777461c	expose nested tensor header file (#117956 ) This pr is for expose nested tensor related header files, it will makes other people easier when developing nested tensor related kernel for extension module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117956 Approved by: https://github.com/ezyang	2024-01-25 15:53:10 +00:00
Alexander Grund	865945cc1f	Convert `requires_cuda` to full decorator (#118281 ) Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281 Approved by: https://github.com/ezyang	2024-01-25 15:50:21 +00:00
Andrew Gu	87fb8b6218	[DTensor] Relaxed `to_local` `requires_grad` warning (#118186 ) The existing warning in `DTensor.__new__()` checks `if requires_grad != local_tensor.requires_grad:` and warns with: > To construct DTensor from `torch.Tensor`, it's recommended to use `local_tensor.detach()` and make `requires_grad` consistent. Calling `local_tensor.detach()` will have the returned `Tensor` have `requires_grad=False`, so the error message refers to the case where `local_tensor.requires_grad is True` but the user passed `requires_grad=False` to `to_local()`. However, there is the converse case, where `local_tensor.requires_grad is False` but the user passed `requires_grad=True`. In this case, the original `if requires_grad != local_tensor.requires_grad:` check succeeds, and the warning is emitted. However, the warning message does not apply in that case. This can happen via `_prepare_output_fn` -> `redistribute` -> `Redistribute.forward()`, where `output.requires_grad is False` but it passes `requires_grad=input.requires_grad` which can be `True`. We should not warn in this case since `Redistribute.forward()` is our own framework code, so I was proposing to relax the warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118186 Approved by: https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #117994	2024-01-25 15:49:32 +00:00
Andrew Gu	a5230e6019	[ez][docs] Fixed render of `tensors` in `backward` (#117994 ) Before: <img width="851" alt="Screenshot 2024-01-22 at 2 03 49 PM" src="https://github.com/pytorch/pytorch/assets/31054793/a71111ab-c7c4-4af5-a996-cbd42bcc8326"> After: ![Screenshot 2024-01-23 at 7 13 40 PM](https://github.com/pytorch/pytorch/assets/31054793/36db28a0-a96f-434c-a93f-fe78aff1e035) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117994 Approved by: https://github.com/soulitzer, https://github.com/weifengpy	2024-01-25 15:49:32 +00:00
rzou	8f973038d5	Update update_failures.py given feedback (#118237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118237 Approved by: https://github.com/drisspg	2024-01-25 15:42:01 +00:00
Alexander Grund	b5b36cf0c4	Fix failure of test_dynamo_distributed & test_inductor_collectives (#117741 ) When CUDA is not available `c10d.init_process_group("nccl"...)` will fail with > RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Hence add a corresponding skip marker to the classes deriving from DynamoDistributedSingleProcTestCase next to the `requires_nccl` marker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117741 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-25 13:25:36 +00:00
Bin Bao	ee1dbb2acf	[AOTI] Fix a None as index codegen issue (#118187 ) Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices. Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168, #118169	2024-01-25 11:53:44 +00:00
Bin Bao	d1e661a1ce	[AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169 ) Summary: _scaled_dot_product_efficient_attention is used in some TIMM models Differential Revision: [D53032358](https://our.internmc.facebook.com/intern/diff/D53032358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118169 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168	2024-01-25 11:53:44 +00:00
Bin Bao	5c7a18c5cb	[AOTI] Refactor shim_common.cpp (#118168 ) Summary: Use new_tensor_handle to reduce code repetition Differential Revision: [D53032353](https://our.internmc.facebook.com/intern/diff/D53032353) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118168 Approved by: https://github.com/chenyang78	2024-01-25 11:53:29 +00:00
yanbing-j	4b4e6550f2	Update oneDNN build option for older systems (#118057 ) Fixes [#116623](https://github.com/pytorch/pytorch/issues/116623). As we discussed in https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900406773 and https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900825829, we update oneDNN build option to support older systems and document we only support CPUs with SSE4.1+. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118057 Approved by: https://github.com/malfet	2024-01-25 11:34:51 +00:00
Huy Do	eebe7e1d37	Migrate update-viablestrict to test-infra (#118163 ) In https://github.com/pytorch/test-infra/pull/4905, so that ExecuTorch can use the same GHA on their CI. ### Testing https://github.com/pytorch/pytorch/actions/runs/7634906738/job/20799502532#step:2:15480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118163 Approved by: https://github.com/clee2000	2024-01-25 07:07:34 +00:00
Wei-Sheng Chin	357a06f7c9	[ONNX] Fix type promotion pass (#118246 ) Currently, when `node.meta["val"]` is `torch.Sym`, its `hint` [is extracted](`61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L86)`) and used in type promotion. However, it will [override](`61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1409)`) dynamic shape information carried in `node.meta["val"]` during [type propagation](`61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1401)`) and the FX graph seen in `onnxrt` always carries static shapes. Let's use `torch.Sym` directly so that the type promotion propagates and stores dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118246 Approved by: https://github.com/titaiwangms	2024-01-25 07:04:18 +00:00
Edward Z. Yang	2c6a233c45	Report the type of a tensor in wrap_to_fake (#118220 ) This could help diagnose why a tensor wasn't considered static. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118220 Approved by: https://github.com/albanD, https://github.com/bdhirsh ghstack dependencies: #118215, #118217	2024-01-25 06:53:12 +00:00
Edward Z. Yang	8b95fb4eb8	Add stack trace to "start tracing" log (#118217 ) When debugging problems on unfamiliar model code, I often want to know "how did I end up in this compiled region." Printing the stack trace at tracing start lets me find out this information. Looks like this: ``` [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f /data/users/ezyang/c/pytorch/b.py:3 [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last): [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/b.py", line 9, in <module> [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] f(torch.randn(5)) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 437, in _fn [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 601, in catch_errors [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return callback(frame, cache_entry, hooks, frame_state) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 743, in _convert_frame [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] result = inner_convert(frame, cache_entry, hooks, frame_state) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 386, in _convert_frame_assert [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return _compile( [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] guarded_code = compile_inner(code, one_graph, hooks, transform) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] r = func(args, *kwargs) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 526, in compile_inner [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] out_code = transform_code_object(code, transform) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] transformations(instructions, code_options) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] return fn(args, kwargs) [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 473, in transform [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] tracer = InstructionTranslator( [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/symbolic_convert.py", line 2030, in __init__ [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] _step_logger()( [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] File "/data/users/ezyang/c/pytorch/torch/_dynamo/logging.py", line 55, in log [2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] logger.log(level, "Step %s: %s", step, msg, kwargs) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118217 Approved by: https://github.com/albanD ghstack dependencies: #118215	2024-01-25 06:53:12 +00:00
Edward Z. Yang	2a178dade8	Augment create_symbol with user/infra backtrace fragment (#118215 ) Looks like this: ``` [2024-01-24 11:59:41,656] [0/1] torch.fx.experimental.symbolic_shapes: [INFO] create_symbol s0 = 5 for L['x'].size()[0] [2, 9223372036854775806] at b.py:5 in f (_dynamo/variables/builder.py:1788 in <lambda>) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118215 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2024-01-25 06:53:12 +00:00
Edward Z. Yang	514159ddcb	Add torch_dynamo to resume_in for ease of debugging (#118201 ) resume_in_* code objects show up in user backtraces when failures occur in code that has been Dynamo processed. It is obvious to me, a PT2 developer, that these are generated by PT2, but it is NOT obvious to a non-core dev that this is happened. Add an extra torch_dynamo breadcrumb to help get people to the right place. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118201 Approved by: https://github.com/albanD	2024-01-25 06:52:17 +00:00
PyTorch UpdateBot	5a83c47d98	[vision hash update] update the pinned vision hash (#117594 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117594 Approved by: https://github.com/pytorchbot	2024-01-25 05:33:01 +00:00
PyTorch UpdateBot	e0903b0720	[executorch hash update] update the pinned executorch hash (#118040 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118040 Approved by: https://github.com/pytorchbot	2024-01-25 05:27:53 +00:00
Jason Ansel	e5e9f390be	[dynamo] Optimize overheads from _TorchDynamoContext (#118070 ) Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `18.1us` - After `12.2us` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118070 Approved by: https://github.com/yanboliang, https://github.com/anijain2305 ghstack dependencies: #118065	2024-01-25 05:04:56 +00:00
Will Constable	a40951defd	[C10D] Fix nccl flightrecorder ignored dump timeout (#118142 ) Don't call future.get() unless it's ready, because it waits. Also, refactor the code a bit for simplicity. We should do a follow-on PR to clean up the timeouts further, but this should fix the glaring timeout bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118044, #118046, #118047	2024-01-25 04:25:36 +00:00
cyy	87335fabae	[Exception] [6/N] Remove use of torch::TypeError (#117964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117964 Approved by: https://github.com/albanD	2024-01-25 03:35:58 +00:00
soulitzer	67300a11cb	Support custom autograd Function forward AD return non-Tensor in forward (#118234 ) Fixes https://github.com/pytorch/pytorch/issues/117491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118234 Approved by: https://github.com/albanD ghstack dependencies: #117552	2024-01-25 03:24:29 +00:00
Flavio Sales Truzzi	2d7a360911	Fix Auto Functionalize to handle specified default values (#118035 ) Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing `IndexError: tuple index out of range` Test Plan: new tests Differential Revision: D52977644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118035 Approved by: https://github.com/williamwen42	2024-01-25 01:22:12 +00:00
Elias Ellison	4a49e2b52d	refactoring (#118111 ) No real changes, just moving mutation checking skip to a helper file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118111 Approved by: https://github.com/bdhirsh ghstack dependencies: #118110	2024-01-25 00:36:46 +00:00
Elias Ellison	4448f2a49d	Log stack trace of mutated idx reland (#118110 ) Relanding of https://github.com/pytorch/pytorch/pull/117720 with a fixed `next(iter(dict.values()))` instead of `next(dict.values())` and a corresponding test that would have caught the problem (as well as a type annotation that also would have). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118110 Approved by: https://github.com/bdhirsh	2024-01-25 00:30:03 +00:00
soulitzer	5b819d9ef0	Properly move retains_grad hook on in-place over view for base (#117552 ) Fixes https://github.com/pytorch/pytorch/issues/117366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117552 Approved by: https://github.com/albanD	2024-01-25 00:27:13 +00:00
Shengbao Zheng	9c1348feb3	[pytorch][kineto] log process group config in distributed info (#117774 ) Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well Test Plan: Tested in HPC Differential Revision: D52882292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774 Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi	2024-01-25 00:08:10 +00:00
Peter Bell	89530c8590	[dynamo] Test for using torch.nn when replay_records are enabled (#116215 ) This adds a reproducer for a failure that has since been fixed in main. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116215 Approved by: https://github.com/jansel ghstack dependencies: #116230, #116214	2024-01-24 23:42:35 +00:00
Peter Bell	7c33ce7702	[CI] Install dill in ci (#116214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116214 Approved by: https://github.com/malfet ghstack dependencies: #116230	2024-01-24 23:42:35 +00:00
Peter Bell	b53cc6cf8d	[dynamo] Fix test_replay_record.py (#116230 ) This test isn't run in CI because the CI runners don't have dill installed. This fixes the tests so they run for me locally, and in the next PR I add dill to the CI so we can test it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116230 Approved by: https://github.com/jansel	2024-01-24 23:42:35 +00:00
rzou	61865205b6	Deflake Dynamo stream tests (#118205 ) streams need to be synchronized, otherwise, there is undefined behavior. This PR adds the necessary synchronization. This exposed some bugs (https://github.com/pytorch/pytorch/issues/118204), so I just marked the tests as expectedFailure. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/118205 Approved by: https://github.com/yanboliang	2024-01-24 23:31:47 +00:00
rzou	5e0ef84b01	[dynamo] Refactor install_global_once, remove usages of install_global_unsafe (#118100 ) We split install_global_once into two APIs: - `install_global_by_id(prefix, value) -> name`: installs a global if it hasn't been installed yet - `install_global(prefix, value) -> name`: always installs the global (and generates a unique name for it) Then, we refactor most callsites of `install_global_unsafe` to one of the previous. Some callsites cannot be refactored because we create the global name first, do a lot of stuff with it, and then install it. This fixes more test flakiness. Test Plan: - Existing tests; I can't reliably repro the flakiness Pull Request resolved: https://github.com/pytorch/pytorch/pull/118100 Approved by: https://github.com/ezyang, https://github.com/mlazos	2024-01-24 23:25:44 +00:00
Catherine Lee	2abb812a78	Check if enable inside run call (#118101 ) In theory this way we never have to worry about subclasses calling super().setUp() ever again Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101 Approved by: https://github.com/huydhn	2024-01-24 22:38:41 +00:00
Yanbo Liang	dba160e676	[13/N][Dynamo] Refactor torch ctx manager classes check out of trace_rules.lookup (#118130 ) I'm going to merge inline/skip/allow_in_graph check into ```trace_rules.lookup```, so it's better to make it only handle function types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118130 Approved by: https://github.com/williamwen42	2024-01-24 22:33:41 +00:00
drisspg	4e29f01bf2	Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689 ) # Summary Simplification of Backend Selection This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager. For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations. Problems: - This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend. - This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend. - Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful. Other concerns: - Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends). A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689 Approved by: https://github.com/cpuhrsch	2024-01-24 22:28:04 +00:00
Xilun Wu	77186af028	[DTensor][BE] re-enable test_dtensor_ops in CPU CI (#118134 ) Test `pytest test/distributed/_tensor/test_dtensor_ops.py` This only runs CPU test and completes in 1 minute on local. <img width="3002" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/bfbcaff0-2581-41a7-817d-f68e4041b8b1"> CI Run: https://hud.pytorch.org/pr/pytorch/pytorch/118134 Search for "distributed" test and click any of them. Then search for "test_dtensor_ops". Saw successful run of `test_dtensor_ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118134 Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/wanchaol ghstack dependencies: #117726, #118132	2024-01-24 22:11:51 +00:00
Jack Taylor	e6288820e3	Revert "Update triton ROCm version to 6.0" (#118179 ) Reverting [this commit](https://github.com/pytorch/pytorch/pull/117433) due to failures observed in wheel environment e.g: ``` ImportError: /tmp/torchinductor_root/triton/0/ebfa57c0b7b95873c96cad6f9bca148d/hip_utils.so: undefined symbol: hipGetDevicePropertiesR0600` ``` Will revert for now and investigate and aim to re-land this as part of https://github.com/pytorch/pytorch/pull/116270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118179 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-01-24 22:01:27 +00:00
PyTorch MergeBot	af9b6fa04e	Revert "Check if enable inside run call (#118101 )" This reverts commit 6fc015fedc96e532da756e9408fcedb9c81a423f. Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to possibly causing failures on b025e5984ce30eed10df0cc89111e88983d823d3 ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1908940940))	2024-01-24 21:26:35 +00:00
Jane Xu	15608d8cb4	Add guardrails preventing complex params in LBFGS & SparseAdam (#118161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118161 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #118160	2024-01-24 21:22:47 +00:00
Jane Xu	17ecd1e9cd	Migrate test_complex_optimizer to OptimizerInfo (#118160 ) This PR does what it says and more. 1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about. 2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense. 3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160 Approved by: https://github.com/mikaylagawarecki	2024-01-24 21:22:47 +00:00
zjgarvey	6978c3ddf3	Removes an Incorrect Type Specification from AdaptiveMaxPool1d (#118162 ) The return type for the forward pass of nn.AdaptiveMaxPool1d is specified to be Tensor, but if self.return_indices, then the result type should be tuple[Tensor,Tensor]. For users trying to trace/script this function with indices, the incorrect typing is problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118162 Approved by: https://github.com/albanD	2024-01-24 20:31:02 +00:00
Bin Bao	821b2c543c	[AOTI] Support .item() in the ABI-compatible mode (#117989 ) Summary: Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989 Approved by: https://github.com/ezyang, https://github.com/chenyang78	2024-01-24 20:17:59 +00:00
Yukio Siraichi	2f6fc33c20	Move skip sets into a new file. (#118032 ) This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more readable YAML file, so that it is consumable from other projects (e.g. XLA). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032 Approved by: https://github.com/lezcano, https://github.com/ezyang	2024-01-24 19:22:01 +00:00
Wanchao Liang	e599a08796	[dtensor] rewrite embedding ops using op strategy (#118079 ) This PR rewrites sharded embedding rule to use OpStrategy instead of the rule, one step further to get rid of rules and consolidate the embedding operator implementation, to prepare for rowwise embedding implementation, which will come in next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079 Approved by: https://github.com/tianyu-l	2024-01-24 19:12:12 +00:00
dilililiwhy	b025e5984c	Get Device instance with correct type when privateuse1 backend is registered (#117966 ) Fixes #ISSUE_NUMBER If privateuse1 backend is registered. Let torch.device return corresponding instance of Device when only index is given. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117966 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-24 19:03:18 +00:00
Catherine Lee	6fc015fedc	Check if enable inside run call (#118101 ) In theory this way we never have to worry about subclasses calling super().setUp() ever again Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101 Approved by: https://github.com/huydhn	2024-01-24 18:51:05 +00:00
Menglu Yu	fc135454ca	[PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check (#118105 ) Summary: We observed the following error when launch e2e AFOC model test ``` RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward. ``` f524190245 Differential Revision: D53011463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118105 Approved by: https://github.com/jackiexu1992	2024-01-24 18:45:10 +00:00
Ke Wen	1e185c7803	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-24 18:42:14 +00:00
RazaProdigy	6e78592cbb	Added type checking for ExportedProgram (#117231 ) Fixes #116952 Added type checking for ExportedProgram in save function. Please review. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117231 Approved by: https://github.com/avikchaudhuri	2024-01-24 18:24:44 +00:00
Jerry Zhang	af1ebc45d3	[quant][pt2e] Add fold_quantize=True for all convert_pt2e calls (#117797 ) Summary: In preparation for enabling fold_quantize=True by default Test Plan: CI Differential Revision: D52879612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117797 Approved by: https://github.com/andrewor14	2024-01-24 17:54:13 +00:00
Nikita Shulga	90b3cf33ac	[C10] Make Scalar constructable from longs (#118149 ) On Linux and Mac `int64_t` is an alias to either `long` (Linux) or `long long` (Mac) Because of that, attempt to construct `c10::Scalar` from the other type will fail with `conversion from ‘long long int’ to ‘c10::Scalar’ is ambiguous`. I.e. attempt to compile: ```cpp int main() { c10::Scalar s = 1L; } ``` on MacOS failed with: ``` foo.cpp:3:15: error: conversion from 'long' to 'c10::Scalar' is ambiguous c10::Scalar s = 1L; ^ ~~ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor DEFINE_IMPLICIT_CTOR) ^ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:62:3: note: candidate constructor Scalar(uint16_t vv) : Scalar(vv, true) {} ^ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:63:3: note: candidate constructor Scalar(uint32_t vv) : Scalar(vv, true) {} ^ /Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:64:3: note: candidate constructor Scalar(uint64_t vv) { ^ ``` Prevent this by providing missing constructors when needed. Alas one can not use SFINAE, as template constructors on Scalar mess up a lot of implicit conversions, so I use `static_asserts` to detect early on if premise for constructing this class holds. Add ScalarTest::LongsAndLongLongs that is essentially a compile time test Discovered while trying to enable AOTI on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/118149 Approved by: https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #118077, #118076	2024-01-24 17:32:29 +00:00
rzou	880f9bb57e	Remove xfails for consistently succeeding tests (#118152 ) Fixes https://github.com/pytorch/pytorch/issues/117786, https://github.com/pytorch/pytorch/issues/117785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118152 Approved by: https://github.com/yanboliang	2024-01-24 15:47:55 +00:00
Nikita Shulga	bd99115276	[AOTI] Enable for MacOS (#118076 ) - Add `darwin` to the list of supported platform - Add `#include <sstream>` to `aoti_runtime/model.h` - Refactor Linux specific constant compilation logic to `_compile_consts_linux` - Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library - Patch file using magic to avoid converting bytes to large hexadecimal string - Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition) - Enable test_aot_inductor.py tests on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076 Approved by: https://github.com/desertfire ghstack dependencies: #118077	2024-01-24 14:24:05 +00:00
DanilBaibak	a545ebc870	Switched macOS runners type to macos-m1-stable (#117651 ) Switched macOS runners type to `macos-m1-stable`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117651 Approved by: https://github.com/huydhn	2024-01-24 11:55:13 +00:00
Quinn Zhu	12662f4d95	[dynamo] add username in debug path (#117820 ) Summary: No user name may cause conflict and permission error when people share a dev server bypass-github-pytorch-ci-checks Test Plan: ci Differential Revision: D52895486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117820 Approved by: https://github.com/kflu, https://github.com/DanilBaibak	2024-01-24 10:14:20 +00:00
Nikita Shulga	7d396918c6	[Inductor] Fix `argument unused during compilation` warning (#118077 ) By not passing linker flag if `compile_only` is set to `True` Before that change every invocation of AOTI compiler resulted in emitting at least 4 warnings: ``` clang: warning: -lomp: 'linker' input unused [-Wunused-command-line-argument] clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument] clang: warning: argument unused during compilation: '-undefined dynamic_lookup' [-Wunused-command-line-argument] clang: warning: argument unused during compilation: '-L/Users/nshulga/miniforge3/lib' [-Wunused-command-line-argument] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118077 Approved by: https://github.com/desertfire	2024-01-24 09:52:16 +00:00
Shiyan Deng	50ead5d8ae	[fx] add an option to not retrace when doing op fusion (#118120 ) Summary: If the given model is already a graph module, we would want to skip retrace in some cases. Test Plan: CI Differential Revision: D53018283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118120 Approved by: https://github.com/zyan0	2024-01-24 09:41:26 +00:00
Jason Ansel	c5702a0891	[dynamo] Optimize BACKEND_MATCH guard (#118065 ) As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `22.5us` - After `18.1us` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065 Approved by: https://github.com/ydwu4	2024-01-24 07:47:52 +00:00
Simon Fan	ed0ec2e0be	Remove dynamo runner's dependency on distributed build (#117903 ) So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903 Approved by: https://github.com/xuzhao9	2024-01-24 06:51:14 +00:00
Tarun Karuturi	725f4b58ac	Cache dfs path in propose_partitions and re-use that later when trying to find cycles in the graph (#115943 ) Summary: This diff introduces a caching mechanism to improve the performance of the partitioner in PyTorch. The changes involve adding a cache to store the DFS path of each node in the graph, which can be reused later when trying to find cycles in the graph. This shows significant improvements for the edge use cases where the ASR model (which is around 6000+ nodes) used to take 26 minutes, but after this it takes around 8 minutes. Test Plan: Relying on the existing ExecuTorch CI tests that heavily use this partitioning mechanism and also tested out locally via Bento notebooks. Differential Revision: D51289200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115943 Approved by: https://github.com/SherlockNoMad	2024-01-24 05:30:11 +00:00
Wanchao Liang	d59c2d6e05	[dtensor] refactor partial redistribution logic (#113334 ) This PR: * Make the remaining placement transform to move from redistribute.py to placement_types, specifically partial related logic * redefine partial interface to make things more consistent, and add docs about the transformation relationships Pull Request resolved: https://github.com/pytorch/pytorch/pull/113334 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #118078	2024-01-24 04:56:16 +00:00
Wanchao Liang	03205ff3ba	[dtensor] make local_shard_size_on_dim be staticmethod (#118078 ) As titled, this is so that we can use it for the case when we don't need to construct a Shard placement Pull Request resolved: https://github.com/pytorch/pytorch/pull/118078 Approved by: https://github.com/XilunWu	2024-01-24 04:56:16 +00:00
Eddie Yan	8d49737f2b	[CUDA][Complex] Bump thresholds for conv3d (#118151 ) Seeing a 1/1000 numerical mismatch CC @coyotelll Pull Request resolved: https://github.com/pytorch/pytorch/pull/118151 Approved by: https://github.com/ezyang	2024-01-24 04:18:31 +00:00
Xilun Wu	46c228f0e2	[DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output (#116437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116437 Approved by: https://github.com/wanchaol	2024-01-24 03:33:58 +00:00
Xilun Wu	26968cefb0	[DTensor][fix] re-enable [add]mm tensor test (#118132 ) Summary Re-enable tests that were disabled in #118045 as #117726 fixed the empty tensor case for DTensor [add]mm. Test Plan `pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118132 Approved by: https://github.com/malfet ghstack dependencies: #117726	2024-01-24 03:17:18 +00:00
Xilun Wu	155f27a97b	[DTensor][fix] fix is_tensor_shardable to correctly handle Replicate placement (#117726 ) Summary Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`. In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131 Test Plan ``` pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32 CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726 Approved by: https://github.com/wanchaol	2024-01-24 03:17:18 +00:00
Zhengxu Chen	e9c240670f	[sigmoid] Add canonicalized IR as an option. (#116758 ) Summary: as title, the "canonical" flag is added to sigmoid serializer, so that we can optionally "normalize" the IR to give stable names and orders to IR nodes, which could help with the cases to compare IR definitions. Test Plan: buck run @//mode/opt //aps_models/ads/config_model_authoring/stability:cli export-generated-module-state-command Differential Revision: D52431965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116758 Approved by: https://github.com/avikchaudhuri	2024-01-24 03:11:25 +00:00
Colin Peppler	21e8546b11	[inductor][fx] Fix broadcast_tensors with unbacked symints when translation validation is off (#118066 ) ## Context This is an example that runs into an AssertionError while lowering in Inductor. ``` # While lowering, b will be expanded because b.size(1) == 1. a = torch.zeros([u0, 512]) b = torch.ones([u0, 1]) return a * b ``` Below's the tail-end of the stack trace. Here's the important bits: 1. In _inductor/sizevars.py, we'll call `self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)`. 2. This leads to the creation of a `ShapeEnvEvent` with an FX node via `kwargs={"fx_node": V.graph.current_node}` ([see](`0c9b513470/torch/fx/experimental/recording.py (L245-L247)`)). 3. Eventually, we try to call `maybe_convert_node()` but it expects translation validation to be on ([see](`0c9b513470/torch/fx/experimental/recording.py (L118-L121)`)). ``` File "pytorch/torch/_inductor/lowering.py", line 221, in transform_args for i, x in zip(indices, broadcast_tensors([args[i] for i in indices])): File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped out = decomp_fn(args, *kwargs) File "pytorch/torch/_inductor/lowering.py", line 676, in broadcast_tensors x = expand(x, target) File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped out = decomp_fn(args, **kwargs) File "pytorch/torch/_inductor/lowering.py", line 793, in expand return TensorBox(ExpandView.create(x.data, tuple(sizes))) File "pytorch/torch/_inductor/ir.py", line 1871, in create new_size = cls._normalize_size(x, new_size) File "pytorch/torch/_inductor/ir.py", line 1862, in _normalize_size new_size[i] = V.graph.sizevars.expect_equals( File "pytorch/torch/_inductor/sizevars.py", line 338, in expect_equals self.expect_true(sympy.Eq(left, right), msg=msg) File "pytorch/torch/_inductor/sizevars.py", line 333, in expect_true self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node) # (1) is here File "pytorch/torch/fx/experimental/recording.py", line 257, in wrapper return event.run(self) # (2) happens right before this File "pytorch/torch/fx/experimental/recording.py", line 155, in run replacearg(index=3, key="fx_node", fn=maybe_convert_node) File "pytorch/torch/fx/experimental/recording.py", line 138, in replacearg kwargs[key] = fn(kwargs[key]) File "pytorch/torch/fx/experimental/recording.py", line 128, in maybe_convert_node assert hasattr(shape_env, "name_to_node") # (3) is here ``` ## Approach Since [translation validation](`c6be5d55a5/torch/fx/experimental/validator.py (L574)`) may not be on during Inductor lowering, we can check if that's True and return the FX node's name in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118066 Approved by: https://github.com/ezyang, https://github.com/peterbell10	2024-01-24 03:07:30 +00:00
Mikayla Gawarecki	41a56f7828	Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955 ) This PR intends to fix the following issue when swapping two tensors ```python >>> import torch >>> torch.manual_seed(5) >>> t1 = torch.randn(2) >>> t2 = torch.randn(3) >>> t1 tensor([-0.4868, -0.6038]) >>> t2 tensor([-0.5581, 0.6675, -0.1974]) >>> torch.utils.swap_tensors(t1, t2) >>> t1 tensor([-0.5581, 0.6675, -0.1974]) >>> t2 tensor([-0.4868, -0.6038]) >>> t1.fill_(0.5) # t1 back to its unswapped state :o tensor([-0.4868, -0.6038]) ``` What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned. `57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)` When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead. The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955 Approved by: https://github.com/albanD	2024-01-24 01:40:18 +00:00
Jane Xu	fc30c4d769	Migrate forloop directional tests to OptimizerInfo (#117410 ) This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`. Changes in coverage? Yes! - This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test. - This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed. What will it take to fully remove `test_basic_cases`? - We need to flavor the tests with LRSchedulers - Testing for param groups --> which all just distinguish between lrs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410 Approved by: https://github.com/albanD	2024-01-24 01:28:40 +00:00
William Wen	5b671ce486	[dynamo] fix typo in 3.11 resume_execution.py (#118108 ) whoopsie Pull Request resolved: https://github.com/pytorch/pytorch/pull/118108 Approved by: https://github.com/angelayi, https://github.com/zou3519	2024-01-24 00:59:04 +00:00
CaoE	b7b1affe97	Add half specializations for load of sum (#106454 ) Add half specializations for load of sum Pull Request resolved: https://github.com/pytorch/pytorch/pull/106454 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-24 00:35:20 +00:00
Yanbo Liang	c0732c8d5e	[Dynamo] Add complex to literal constant (#117819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117819 Approved by: https://github.com/zou3519	2024-01-23 23:46:46 +00:00
Kurt Mohler	cd084c4909	Add `TensorIteratorConfig::add_const_input` to avoid COW materialize (#118053 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053 Approved by: https://github.com/ezyang	2024-01-23 22:32:39 +00:00
Zhengxu Chen	abd759d50d	[fx] Add hooks to intercept node replacements. (#117825 ) Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places. Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform buck test mode/opt caffe2/test:test_export -- -r test_replace_hook Differential Revision: D52896531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825 Approved by: https://github.com/avikchaudhuri	2024-01-23 22:28:40 +00:00
Boyuan Feng	b369888bec	Replace `constraints` with `dynamic_shapes` in caffe2/test/cpp & torchrec/distributed/tests/test_pt2 (#118026 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`. Test Plan: CI Differential Revision: D52977354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026 Approved by: https://github.com/chenyang78	2024-01-23 22:15:15 +00:00
Aaron Shi	6ac284122b	[Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055 ) Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack. Test Plan: CI Differential Revision: D52984953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055 Approved by: https://github.com/zdevito	2024-01-23 21:48:57 +00:00
Bin Bao	c6930aad46	Update Triton pin (#117873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117873 Approved by: https://github.com/shunting314, https://github.com/malfet	2024-01-23 21:05:30 +00:00
Jane Xu	13d2cdffa2	Remove optimizer.step patching for profiler hook (#115772 ) 1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR. ``` I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x. I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x. ``` 2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one). 3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below) <details> This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done). There is no Python refcycle, as the backrefs for `p_ref()` look like: ![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8) (so 5 backrefs but none of them python) And the refs: ![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-23 20:15:41 +00:00
Tianyu Liu	77705e7486	[dtensor] fix unnecessary redistribute in new_factory_strategy (#118037 ) Summary Previously, assuming `x` is a DTensor with non-replicate placement, calling `x.new_full` would create a replicated (but unused) copy of `x`, incurring unnecessary communications. This PR fixes the issue. Test `python test/distributed/_tensor/test_tensor_ops.py -k test_new_full` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118037 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-01-23 19:35:43 +00:00
PyTorch MergeBot	58e7ec5843	Revert "Log stack trace of mutated idx (#117720 )" This reverts commit 365c7a292fedbf776014b878849ebd3dcb7463f0. Reverted https://github.com/pytorch/pytorch/pull/117720 on behalf of https://github.com/eellison due to cause of https://github.com/pytorch/pytorch/issues/118104 ([comment](https://github.com/pytorch/pytorch/pull/117720#issuecomment-1906693119))	2024-01-23 18:40:20 +00:00
Catherine Lee	364728b27b	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-23 18:39:30 +00:00
PyTorch MergeBot	5ec2d7959d	Revert "[ez] Provide a slightly better error message if process times out (#117865 )" This reverts commit 5538b37a065e5a68c3fb9d1f8eaa3e4fd12fd0b8. Reverted https://github.com/pytorch/pytorch/pull/117865 on behalf of https://github.com/clee2000 due to Does not play nice with retry_shell, which expects timeoutexpired, but i cant control the error message of that ([comment](https://github.com/pytorch/pytorch/pull/117865#issuecomment-1906640922))	2024-01-23 18:13:41 +00:00
mantaionut	6784594532	Fix sparse windows on CPU with MKL (#102604 ) Fix https://github.com/pytorch/pytorch/issues/97352. This PR changes the way the linking to intel MKL is done and updating MKL on Windows to mkl-2021.4.0 . There are for both conda and pip packages MKL version with which you can link dynamically. mkl-devel contains the static versions of the dlls and MKL contains the needed dlls for the runtime. MKL dlls and static libs starting with 2021.4.0 have the version in their names( for MKL 2023 we have mkl_core.2.dll and for 2021.4.0 we have mkl_core.1.dll) so its possible to have multiple versions installed and it will work properly. For the wheel build, I added dependency for whell MKL and on conda a dependecy for the conda MKL and on libtorch I copied the MKL binaries in libtorch. In order to test this PR I have to use custom builder https://github.com/pytorch/builder/pull/1467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102604 Approved by: https://github.com/IvanYashchuk, https://github.com/malfet	2024-01-23 17:41:18 +00:00
Dmitry Nikolaev	7598a4efdc	[ROCm] Disable MIOpen for empty tensors for RNN (#117672 ) Some MIOpen RNN functions (lstm, rnn, gru) can't work with empty tensors and return error "MIOpen Error: Lengths must be > 0" This PR disables MIOpen tor empty tensors and force to use native methods The solution is based on condition of using CUDNN `3a52147cc5/aten/src/ATen/native/TensorProperties.cpp (L91)` It also fix [test_nn.py::TestNN::test_RNN_input_size_zero](`29fa6fbc4e/test/test_nn.py (L4592)`) on ROCM Pull Request resolved: https://github.com/pytorch/pytorch/pull/117672 Approved by: https://github.com/cpuhrsch	2024-01-23 17:30:18 +00:00
Sherlock Huang	0c9b513470	[Export] Fix serialize_metadata (#118031 ) Summary: As title. Test Plan: CI Differential Revision: D52979069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118031 Approved by: https://github.com/zhxchen17	2024-01-23 17:03:04 +00:00
Oguz Ulgen	9ebaa27922	Fix types.MethodDescriptorType related bug in dynamo (#118041 ) Methods that were `types.MethodDescriptorType` were failing because the `tensor.method()` to `method(tensor)` conversion was dropping the tensor and just calling `method`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118041 Approved by: https://github.com/yanboliang ghstack dependencies: #118000	2024-01-23 16:11:38 +00:00
Oguz Ulgen	3b38f7b266	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 16:11:38 +00:00
Peter Bell	3ec4f00316	[inductor] Allow reinplacing functionalized scatter ops (#116899 ) This expands the reinplacing pass to allow reinplacing view-scatter operations. e.g. if our python code is: ``` a = view1(inp) b = view2(a) b.copy_(src) ``` this generates a functionalized graph like: ```python a = view1(inp) a_updated = view2_scatter(a, src) inp_updated = view1_scatter(inp, a_updated) ``` First, the `canonicalize_view_scatter_ops` step rewrites the functionalized graph in the form: ```python inp_updated = _generalized_scatter(inp, src, [view1, view2]) a_updated = view1(inp_updated) ``` I then register `_generalized_scatter` as a normal inplacable op which can be handled by the pre-existing mechanism. Since we've fused the two scatter ops into one, the reinplacing pass sees only one user of `inp` which allows the entire operation to be reinplaced if desired (and I add heuristics that sometimes choose not to reinplace). Finally, there is a decomposition step which decomposes out-of-place or in-place `_generalized_scatter` operations either back into view_scatter operations, or into the version with mutations. When introducing mutations, the reinplaced version is equivalent to the original mutation: ``` a = view1(inp) b = view2(a) b.copy_(src) ``` Or when out-of-place we end up with a minor restructuring of the graph: ``` a = view1(inp) tmp = view2_scatter(a, src) inp_updated = view1_scatter(inp, tmp) a_updated = view1(inp_updated) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116899 Approved by: https://github.com/lezcano ghstack dependencies: #116898, #117121	2024-01-23 15:31:28 +00:00
Peter Bell	5502a63b22	[inductor] Allow reinplacing before meta-only users (#117121 ) Currently if you have the code: ```python idx = torch.arange(10, device=x.device) src = torch.ones(10, dtype=x.dtype, device=x.device) x.index_put_((idx,), src) expand = x.expand((2, x.shape[0])) ``` The `index_put_` cannot be reinplaced under dynamic shapes due to the user `aten.sym_size(x, 0)` however since this function only looks at the tensor metadata, it is actually fine to reinplace. Here I ignore these operators in the analysis of the reinplacing pass, so reinplacing can happen under dynamic shapes as well. I also handle cases where views are created just to be fed to `sym_size`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117121 Approved by: https://github.com/lezcano ghstack dependencies: #116898	2024-01-23 15:31:28 +00:00
Peter Bell	eb0fcab421	[inductor] Move reinplace pass to its own file (#116898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116898 Approved by: https://github.com/lezcano	2024-01-23 15:31:28 +00:00
rzou	e309d6fa1c	Better unsupported op error message (#117770 ) Previously, if someone wrote a python abstract impl but didn't import the module it is in, then we would raise an error message suggesting that the user needs to add an abstract impl for the operator. In addition to this, we suggest that the user try importing the module associated with the operator in the pystub (it's not guaranteed that an abstract impl does exist) to avoid confusion. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/117770 Approved by: https://github.com/ydwu4, https://github.com/williamwen42	2024-01-23 15:05:16 +00:00
Bin Bao	4d625c1c92	[AOTI] Fix a bug in the torch._export.aot_load API (#118039 ) Summary: tree_flatten_spec should use args instead of *args clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes Test Plan: CI Differential Revision: D52982401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039 Approved by: https://github.com/angelayi	2024-01-23 14:54:02 +00:00
Nikita Shulga	bff348b28f	[AOTI] Add missing include to `model.h` (#118075 ) At lest if one tries to compile the AOTI code on Darwin, compilation fails with implicit instantiation of undefined template error: ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3: /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>' std::stringstream ss; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075 Approved by: https://github.com/desertfire ghstack dependencies: #118074	2024-01-23 14:34:00 +00:00
Nikita Shulga	2963e85a3f	[EZ][AOTI] Fix typos (#118074 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118074 Approved by: https://github.com/desertfire	2024-01-23 14:34:00 +00:00
Edward Z. Yang	ae459c5809	Don't use private accessor on SymNode to get _expr (#118007 ) This materially impacts https://github.com/pytorch/pytorch/pull/117862 split this out for testing Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118007 Approved by: https://github.com/tugsbayasgalan	2024-01-23 14:29:19 +00:00
Edward Z. Yang	73c9be1395	Don't use private accessor on SymNode to get _expr (round 2) (#118013 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118013 Approved by: https://github.com/tugsbayasgalan	2024-01-23 14:29:12 +00:00
Jeff Daily	905a7cc340	[ROCm] skip test_eager_transforms.py test_compile_vmap_hessian_cuda (#118009 ) Memory leak detected on ROCm. Skip until it can be addressed. PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test_eager_transforms.py -k test_compile_vmap_hessian_cuda See #117642 for moving rocm CI to unstable due to this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118009 Approved by: https://github.com/jeanschmidt	2024-01-23 09:57:18 +00:00
leslie-fang-intel	4cfd16cb6d	[Inductor] optimize transpose_mxn with bf16 data type (#117958 ) Summary Add the vectorization implementation of `transpose_mxn` with BFloat16 data type when matrix size is 16X16 or 32X32 which observed in Stable Diffusion BF16. TestPlan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_16_16_bf16_fp16 python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_32_32_bf16_fp16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117958 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-23 09:43:35 +00:00
chuanqiw	40890ba8e7	[CI] Add python test skip logic for XPU (#117621 ) Add python test skip logic for XPU For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now. Works for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621 Approved by: https://github.com/huydhn	2024-01-23 08:20:42 +00:00
Will Constable	455bba38f4	[C10D] Make Flight Recorder report time_created in ns (#118047 ) Addresses (6) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046	2024-01-23 08:18:08 +00:00
Will Constable	5df92a9244	[C10D] Add version tag to NCCL Flight Recorder Dump (#118046 ) Addresses (3) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044	2024-01-23 08:18:08 +00:00
Will Constable	dace1fda2e	[C10D] Make NCCL Flight Recorder dump produce a dict (#118044 ) Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044 Approved by: https://github.com/zdevito	2024-01-23 08:18:08 +00:00
haozhe.zhu	28c8a07b4d	add mask_convert_to_lp to support bool->fp16/bf16 convert (#117830 ) Fix https://github.com/pytorch/pytorch/issues/117624 https://github.com/pytorch/pytorch/issues/117627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117830 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-23 07:52:43 +00:00
Will Constable	6049998971	[C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016 ) Summary: Previously, heatbeat was incremented once per finishing a for loop over a list of in-progress work items, under the assumption that either the processing would be predictably quick, or it would hang completely. In fact, there can be cuda API contention that causes the processing of works to slow down arbitrarily but not truly deadlock. To guard against this, we bump the heartbeat at the smallest unit of progress, one work item being successfully processed. Test Plan: CI Differential Revision: D52973948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2024-01-23 07:25:18 +00:00
Yue Dong	a8978d3676	[dynamo] Add size(), get_coordinate() support for DeviceMesh in dynamo (#117710 ) Summary: This fix is part of: https://github.com/pytorch/pytorch/issues/117670 Test Plan: Unit tetst and CI Differential Revision: D52857348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117710 Approved by: https://github.com/wconstab, https://github.com/yanboliang, https://github.com/wanchaol, https://github.com/anijain2305	2024-01-23 07:10:52 +00:00
PyTorch MergeBot	bb28965924	Revert "Remove skips for passing tests (#118000 )" This reverts commit 3c339b5b21fdbd530f82765f84bcabde8266d3e0. Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))	2024-01-23 06:10:25 +00:00
suo	d84173c025	[export] fix unlifting of custom class constants (#117979 ) we didn't have a test covering this case, add one. Aside: we should invest in actually unit testing the lifting/unlifting passes, both separately and also against each other. I have a diff cooking for that. Differential Revision: [D52962180](https://our.internmc.facebook.com/intern/diff/D52962180/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117979 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #115222, #117978	2024-01-23 05:51:00 +00:00
suo	7b0979ef8e	[export] fixes to unflatten + custom obj composition (#117978 ) The test I added for this didn't actually enable torchbind tracing, oops. Fix that and fix the issues that cropped up. Differential Revision: [D52962205](https://our.internmc.facebook.com/intern/diff/D52962205/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117978 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #115222	2024-01-23 05:50:41 +00:00
Animesh Jain	e056cf5507	[ac][pattern matcher] Do not percolate tags beyond the inputs of matched portion (#118034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118034 Approved by: https://github.com/yf225	2024-01-23 05:02:32 +00:00
Nikita Shulga	3708f2608e	[DTensor] Skip `[add]mm` empty tensor test (#118045 ) As DTensor does not support multiplication of [4,0] and [0,4] matrices Pull Request resolved: https://github.com/pytorch/pytorch/pull/118045 Approved by: https://github.com/yf225, https://github.com/wanchaol	2024-01-23 04:08:11 +00:00
Menglu Yu	0036385b55	[Inductor][Reliability] Add runtime numeric check for pt2 Optimus in the pre grad pass (#115142 ) Summary: Titled Test Plan: # local reproduce Patch ``icfg.fx_passes_numeric_check["pre_fx_passes"] = True" ``` buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch ``` P965217137 # MC candidates ### FIRST + CMF f520754604 P1056796962 ### ICVR f520816217 P1056839342 ### IG_CTR f520819178 P1056903302 ### MAI f520823559 P1057712009 ### AFOC f520822438 P1057760058 ### DPA f520826815 P1057808574 ### How the runtime numeric check to catch [SEVs](https://docs.google.com/document/d/1WOtlbgCBbmU1klK1LiGSO0lYf_7mtSP4nAdvhQHM0JE/edit#heading=h.k61fy2rhaijp) bug fix diff: D51378532 ### CMF+(FIRST) f509587388 P1058305139 by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058293804)) https://pxl.cl/4bQDG f501760099 P1058400691 by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058412054)) https://pxl.cl/4bQMw Pull Request resolved: https://github.com/pytorch/pytorch/pull/115142 Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang	2024-01-23 03:56:50 +00:00
Oguz Ulgen	3c339b5b21	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 03:41:23 +00:00
JackCaoG	4646d0e1b2	Update xla.txt (#117999 ) XLA CI is currently broken in PyTorch, I think there are 2 reasons causing that 1. There is an offending Pytorch PR `c393b2f1ee`. Han is working on a fix in https://github.com/pytorch/xla/pull/6345 2. Commit that pytorch pin to 2990cb38c17e06d0dbe25437674ca40130d76a8f was not a valid commit. I think this is because we tried to help them to land a breaking pr in https://github.com/pytorch/xla/pull/6307. However I think we did a rebase which vanish that commit. now the CI failed ``` fatal: reference is not a tree: 2990cb38c17e06d0dbe25437674ca40130d76a8f 585 ``` Let me first update the pin to the master so it at least run some test, this way we can discover if there is any additional issue. I will rebase after @qihqi 's fix passed all CI Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117999 Approved by: https://github.com/clee2000	2024-01-23 03:36:32 +00:00
voznesenskym	fed45aee54	Replace invoking self.value if there is a user defined init, avoiding arbitrary code execution (#117818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117818 Approved by: https://github.com/ezyang	2024-01-23 03:14:58 +00:00
rzou	dc1b9d758e	Update passrate calculation script to skip inductor and export (#118030 ) We don't want to count running test/inductor/ and test/export/ with PYTORCH_TEST_WITH_DYNAMO=1 as a part of the pass rate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118030 Approved by: https://github.com/ydwu4 ghstack dependencies: #117998	2024-01-23 02:33:57 +00:00
rzou	162f643090	Script to generate failures histogram (#118008 ) Generates something that looks like https://gist.github.com/zou3519/43aa8ef28a327bd68cfbac83d84c0999 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118008 Approved by: https://github.com/yanboliang, https://github.com/oulgen	2024-01-23 02:28:55 +00:00
rzou	af7cd5c32a	[Dynamo] Install module globals per output_graph (#117998 ) Fixes https://github.com/pytorch/pytorch/issues/117851 In tests, we ran into an issue where: - In frame A, Dynamo would install a global - We call reset() - reset() did not delete the installed global due to a refcycle - In frame B, Dynamo would re-use the same global - Python gc ran, deleting the installed global, leading to the compiled version of frame B raising NameNotFound This PR changes the following: - module globals are now installed at a per-frame basis. - renames install_global to install_global_unsafe: if the names are not unique and end up being re-used across frames, then we've got trouble. Test Plan: - I tested that this got rid of the test flakiness locally. I'm not sure how to easily write a test for this, because I don't actually know what the refcycle in the above is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117998 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-01-23 02:28:02 +00:00
Thiago Crepaldi	a85fd20d45	[ONNX] Improve support to mmap for ONNXProgram.save (#117863 ) Currently, when the user passes a model state_dict which is not a file, ONNXProgram.save calls torch.save along with io.BytesIO, which does not support memory-map. That makes the file stream to be fully allocated in memory. This PR removes the torch.save call and passes the dict directly to the serializer. this is beneficial for the scenario when model_state_dict is generated by torch.load(..., mmap=True) as the state dict will be mappped in memory instead of fully loaded in memory. This PR leverages https://github.com/pytorch/pytorch/pull/102549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117863 Approved by: https://github.com/wschin	2024-01-23 02:00:00 +00:00
Boyuan Feng	052860294f	Replace `constraints` with `dynamic_shapes` in export-to-executorch tutorial (#117916 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in export-to-executorch tutorial. Test Plan: CI Differential Revision: D52932772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117916 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2024-01-23 01:17:19 +00:00
rockerBOO	d810b10232	Add beta1 support to CyclicLR momentum (#113548 ) Fixes #73910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113548 Approved by: https://github.com/janeyx99	2024-01-23 01:16:58 +00:00
haozhe.zhu	d01ba4e94e	enable fp8 cast for inductor CPU (#117737 ) Enable FP8 cast for this issue https://github.com/pytorch/pytorch/issues/117119. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117737 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-23 01:16:15 +00:00
YuqingJ	d8420c0b0c	[Nested Tensor]Add helper functions to set max_seqlen/min_seqlen directly (#117815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117815 Approved by: https://github.com/soulitzer	2024-01-23 01:00:45 +00:00
Jeff Daily	a27a6e8cf1	[ROCm] skip test_sparse_csr test_triton_bsr_softmax_cuda (#118006 ) The tests were taking too long and leading to CI timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118006 Approved by: https://github.com/huydhn	2024-01-23 00:09:42 +00:00
Jane Xu	c6be5d55a5	Migrate param_group testing to OptimizerInfo (#117675 ) Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize. This PR introduces two tests to encompass coverage: 1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case. 2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does. Together, these tests do a better job of testing param groups than today's tests, though we do lose some flavors. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees. The leftover param group configs are used in conjunction with LRSchedulers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675 Approved by: https://github.com/albanD	2024-01-22 23:48:46 +00:00
Aaron Orenstein	d280b6ae58	Ensure that deleter is called even for a no-data tensor. (#117418 ) Summary: When using a custom deleter InefficientStdFunctionContext was using a std::unique_ptr<> to store the pointer and call the deleter - but this failed to call the deleter if the pointer was null. Since we have a separate holder class anyway take out the std::unique_ptr<> and call the deleter directly. Fixes #117273 Test Plan: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418 Approved by: https://github.com/wjakob, https://github.com/yanboliang	2024-01-22 23:27:27 +00:00
Catherine Lee	cef5b93f28	[ez] Serial when NUM_PROCS is 1 (#117977 ) Makes it easier to understand whats going on Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977 Approved by: https://github.com/huydhn	2024-01-22 23:11:41 +00:00
Richard Barnes	f9fca33baf	[codemod][highrisk] Fix shadowed variable in caffe2/caffe2/onnx/onnx_exporter.cc (#117996 ) Summary: Our upcoming compiler upgrade will require us not to have shadowed variables. Such variables have a _high_ bug rate and reduce readability, so we would like to avoid them even if the compiler was not forcing us to do so. This codemod attempts to fix an instance of a shadowed variable. Please review with care: if it's failed the result will be a silent bug. What's a shadowed variable? Shadowed variables are variables in an inner scope with the same name as another variable in an outer scope. Having the same name for both variables might be semantically correct, but it can make the code confusing to read! It can also hide subtle bugs. This diff fixes such an issue by renaming the variable. - If you approve of this diff, please use the "Accept & Ship" button :-) Test Plan: Sandcastle Reviewed By: igorsugak Differential Revision: D52582853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117996 Approved by: https://github.com/PaliC, https://github.com/kit1980, https://github.com/malfet	2024-01-22 22:57:06 +00:00
Colin Peppler	b901999350	[inductor] For View.create(x, sizes) call realize_input() instead of realize() when handling unbacked symints (#117013 ) # Context Let's say we do `View.create(x, sizes)` where `x` is a `SliceView` and `sizes` contains unbacked symints e.g. `sizes = [i14, 256]`. Then, this we'll run ([this code](`7e37f63e5e/torch/_inductor/ir.py (L2058-L2071)`)) where we. 1. Call `x.realize()` -- SliceView(Pointwise) -> SliceView(ComputedBuffer). 2. Retrieve storage & layout via `as_storage_and_layout(x)` 3. Calculate `new_layout` based off layout & `new_sizes` 3. `return ReinterpretView(storage, new_layout)` However, (2) will raise `NotImplementedError` ([see](`7e37f63e5e/torch/_inductor/ir.py (L1704-L1731)`)) since `x` is a `SliceView` and that isn't supported. Thus, I tried adding support for `SliceView` in `as_storage_and_layout`. This worked for my case, but if instead `sizes` had backed symints e.g. `sizes=[s0, 256]` then some benchmarked models lost accuracy. ``` if isinstance(x, SliceView): return as_storage_and_layout( x.data, freeze=freeze, want_contiguous=want_contiguous, stride_order=stride_order, ) ``` So instead of the above, I tried unwrapping the `SliceView` via `x = x.unwrap_view()`. This works for my usecase and passes CI but I'm not entirely sure why. If we unwrap our `SliceView` and create a `ReinterpretView`, I'd assume we'd lose the reindexer from `SliceView`. ~~But maybe we can re-create the same indexing from the `ReinterpretView`'s strides?~~ edit: we do lose vital information (like offset) when you release your `SliceView` and create a `ReinterpretView` so that's a no-go. Moving onto the final version of this PR. We call `ExternKernel.realize_input()` (feels a bit weird to use `ExternKernel` but it's exactly what I need). It will go ahead and handle our `SliceView` case ([see](`a468b9fbdf/torch/_inductor/ir.py (L3733-L3739)`)) by converting it to a `ReinterpretView` with the correct offset. # Test ``` $ python test/inductor/test_unbacked_symints.py .. ---------------------------------------------------------------------- Ran 10 tests in 20.813s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117013 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-01-22 22:34:10 +00:00
ydwu4	f96b7d06d7	[export] skip export tests when test with dynamo in ci (#117988 ) Fixes https://github.com/pytorch/pytorch/issues/117947. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988 Approved by: https://github.com/suo, https://github.com/zou3519	2024-01-22 22:14:36 +00:00
Richard Barnes	c14751b6cf	Remove extraneous [[fallthrough]] in ivalue.cpp (#117985 ) Test Plan: Sandcastle Differential Revision: D52963965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117985 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-01-22 21:54:39 +00:00
PyTorch MergeBot	b5799d9977	Revert "[c10d] Barrier uses stream sync instead of device sync (#117804 )" This reverts commit 0f6bbb1c070c3a9713893659377e20e147c125f6. Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788. Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))	2024-01-22 21:54:03 +00:00
Boyuan Feng	792dfa7e16	Allow dynamic shapes of `tuple` type for inputs of `dataclass` type (#117917 ) Summary: In `torch.export.export(f, args, kwargs, ..., dynamic_shpapes=None, ...)`, `dataclass` is an acceptable type of inputs (for args and kwargs). The `dynamic_shapes` of the `dataclass` inputs needs to be the same `dataclass` type which replaces each tensor attributes with `dynamic_shapes` of the corresponding tensors. (https://github.com/pytorch/pytorch/blob/main/torch/export/dynamic_shapes.py#L375) However, some `dataclass` may have limitations on the types of attributes (e.g., having to be tensors) such that the same `dataclass` cannot be constructed for dynamic shapes. For an input of `dataclass` type, this task enables a `dynamic_shapes` of a tuple type that specifies dynamic shape specifications for each tensor of the input in the same order as the input dataclass type's flatten_fn (https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py#L103) Test Plan: buck test //caffe2/test:test_export Differential Revision: D52932856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117917 Approved by: https://github.com/avikchaudhuri	2024-01-22 21:50:28 +00:00
Adnan Akhundov	4df65bf51b	Optimize recursive_add_node in fx splitter (#117969 ) Summary: The `FxNetAccFusionsFinder.recursive_add_node` function can run into an exponential complexity when applied to an fx graph with multiple densely connected layers of nodes. Here we add a `visited` set which reduces the worst case complexity to linear. In the internal MRS models with the densely connected layer structure, this fix reduces the fx split time from forever to < 100ms, hence unblocking the internal enablement. P.S. As much as I want to add a unit test, I can't find any existing tests for the `_SplitterBase` infra. Happy to add one if pointed to where. Thanks! Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D52951321](https://our.internmc.facebook.com/intern/diff/D52951321) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117969 Approved by: https://github.com/oulgen, https://github.com/khabinov	2024-01-22 21:49:36 +00:00
Tianyu Liu	86e8551446	[dtensor] switch softmax forward ops to OpStrategy (#117723 ) Summary This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation. Test `python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd` `python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723 Approved by: https://github.com/XilunWu	2024-01-22 21:26:48 +00:00
Matteo Migliarini	fdac55c35d	Added example regarding weight_decay distinction with per-parameter API (#117436 ) Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms. Fixes #115935 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436 Approved by: https://github.com/janeyx99	2024-01-22 21:26:02 +00:00
Boyuan Feng	b14d57ceda	Replace `constraints` with `dynamic_shapes` in scripts/sijiac/prototypes and test/inductor (#117915 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `scripts/sijiac/prototypes` and `test/inductor`. Test Plan: buck test mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor Differential Revision: D52931743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117915 Approved by: https://github.com/angelayi	2024-01-22 21:24:03 +00:00
Jane Xu	95a6866220	Migrate fused optim load_state_dict to OptimizerInfo (#117890 ) The new tests look like: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (29f899ef)]$ python test/test_optim.py -v -k test_cpu_load_state_dict /home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( test_cpu_load_state_dict_impl_capturable_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_fused_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda' test_cpu_load_state_dict_impl_capturable_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_capturable_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_capturable_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... skipped 'SGD does not currently support capturable' test_cpu_load_state_dict_impl_fused_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_fused_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok test_cpu_load_state_dict_impl_fused_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok ---------------------------------------------------------------------- Ran 12 tests in 12.865s OK (skipped=6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117890 Approved by: https://github.com/albanD	2024-01-22 21:14:38 +00:00
Catherine Lee	9a2c8f644b	Mark DynamicShapesExportTests::test_retracibility_dynamic_shapes as slow (#117896 ) Mark `dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dynamic_shapes` explicitly as slow I cannot figure out what the correct way to do this is Tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/117896 Approved by: https://github.com/huydhn	2024-01-22 21:12:03 +00:00
Edward Z. Yang	903e1913ff	Rename unbacked SymInt prefix to u (#117859 ) Currently, it conflicts with Inductor's naming convention for index variables Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117859 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/avikchaudhuri	2024-01-22 20:53:47 +00:00
Ke Wen	0f6bbb1c07	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-22 20:14:51 +00:00
Wanchao Liang	c170fbd309	[dtensor] refactor redistribute and fix uneven sharding redistribution (#115525 ) This PR: - refactors the redistribute implementation logic to make it more sound, by figuring out the transform informations first and then apply transformation step by step, we also cache the decisions so that it could be reuse again - for uneven sharding, refactor uneven sharding logic, and use a logical shape concept for each transform information to fix the uneven sharding multi-mesh redistribute bug fixes https://github.com/pytorch/pytorch/issues/115310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115525 Approved by: https://github.com/XilunWu	2024-01-22 18:57:44 +00:00
Wanchao Liang	2bb2cc0b71	[tp] add clarification to doc and improve TP examples (#117618 ) This PR adds a clarification about evenly sharded assumption in the main tp doc and improved the tp examples by adding device mesh constructions fixes https://github.com/pytorch/pytorch/issues/100044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-01-22 18:56:50 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
Yue Dong	56ef5afdee	[dynamo] Add more dynamo call_methods and getattr support or Placement (#117733 ) Summary: Explained by title. This fix is part of: https://github.com/pytorch/pytorch/issues/117670 Test Plan: Unit tetst and CI - Unit test: `buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:dtensor_compile -- test_placement_compile` Differential Revision: D52863073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117733 Approved by: https://github.com/yanboliang	2024-01-22 18:22:54 +00:00
suo	f612e96180	[export] set proper fqn in lift constant tensor pass (#115222 ) See comments: previously we were populating the lifted constant in the buffer list without an FQN, which messed up unflattening. Differential Revision: [D50568062](https://our.internmc.facebook.com/intern/diff/D50568062/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115222 Approved by: https://github.com/tugsbayasgalan	2024-01-22 18:13:49 +00:00
Guilherme Leobas	80cf0ce153	Enhance torch.vmap support from inside torch.compile (#116050 ) This work rewrites vmap support in torch.compile by inlining most of the frames into the existing FX graph. It also unlocks to PyTorch to support features that were previously missing, such as keyword args. Fixes: https://github.com/pytorch/pytorch/issues/114306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050 Approved by: https://github.com/zou3519	2024-01-22 17:53:45 +00:00
Angela Yi	b2a3d6ba0d	[exportdb] Remove torch/fb/exportdb (#117866 ) Summary: This has already been moved to torch/_export/db Test Plan: no tests? I think? Reviewed By: avikchaudhuri Differential Revision: D52875607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117866 Approved by: https://github.com/ydwu4	2024-01-22 17:41:33 +00:00
Edward Z. Yang	a359afbc3f	Make and/or on uint8 tensors properly return 0x00 or 0x01 (#117827 ) Fixes https://github.com/pytorch/pytorch/issues/117215 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117827 Approved by: https://github.com/albanD	2024-01-22 17:30:22 +00:00
Michael Schmidt	c6c54df81b	Fix incorrect type hints of `Module.to` (#117937 ) Fixes #117936 While #113647 fixed the issue that `device` did not accept strings, it did not get the type hints fully correct. This PR removes the `str` variants from the type hints for the `dtype` parameter(s) in all overloads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117937 Approved by: https://github.com/albanD	2024-01-22 16:47:30 +00:00
Dinahao Zhou	60519fa3b7	change master to main in datapipes readme (#117919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117919 Approved by: https://github.com/albanD	2024-01-22 16:29:41 +00:00
Stas Bekman	86b4b27e26	[docs] start a new FSDP notes doc (#117323 ) As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion. I hope I did the RST right, I haven't done RST in a while. - The first section is Andrew's words verbatim + formatting - The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better. tagging @albanD as requested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323 Approved by: https://github.com/awgu	2024-01-22 15:46:35 +00:00
PyTorch MergeBot	8dc421a6b4	Revert "accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 )" This reverts commit 03b12e56c758431df6f95075ce3a0113ccaeb3f9. Reverted https://github.com/pytorch/pytorch/pull/115539 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/115539#issuecomment-1904157729))	2024-01-22 14:48:35 +00:00
cyy	c3780010a5	Remove calls of c10::guts::void_t (#117942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117942 Approved by: https://github.com/Skylion007	2024-01-22 06:12:37 +00:00
PyTorch UpdateBot	3580e5d407	[executorch hash update] update the pinned executorch hash (#117953 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117953 Approved by: https://github.com/pytorchbot	2024-01-22 04:34:44 +00:00
cyy	39df084001	[Clang-tidy header][16/N] Enable clang-tidy on headers in torch/csrc/autograd (#117821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117821 Approved by: https://github.com/Skylion007	2024-01-22 00:52:56 +00:00
cyy	3baade4425	Remove calls of c10::guts::conjunction,c10::guts::disjunction,c10::guts::negation (#117926 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117926 Approved by: https://github.com/Skylion007	2024-01-22 00:35:42 +00:00
PyTorch MergeBot	02209b5880	Revert "[docs] start a new FSDP notes doc (#117323 )" This reverts commit 7f474da6bcc735cde5ef1417dc28472769307f5d. Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))	2024-01-21 19:47:27 +00:00
suo	c393b2f1ee	[export] require Module to be passed to export (#117528 ) This PR changes torch.export to require an nn.Module as input, rather than taking an arbitrary callable. The rationale for this is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function: 1. We "guarantee" that every call_function node has an `nn_module_stack` populated. 2. We offer ways to access the state_dict/parameters/buffers of the exported program. We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model. An alternative design would be to implicitly convert the top-level function into a module, rather than require that the user provide a module. I think that's reasonable (it's what we did in TorchScript), but in the spirit of being explicit (another design tenet of export) I avoid that here. Differential Revision: [D52789321](https://our.internmc.facebook.com/intern/diff/D52789321/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117528 Approved by: https://github.com/thiagocrepaldi, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2024-01-21 19:36:13 +00:00
Alexander Grund	3ee092f75b	VSX: Fix overflow in complex division (#116972 ) For large complex values the division produces inf or NaN values which leads other functions to produce such too, e.g. `torch._refs.sgn` used in a test. Example: ``` $ python -c 'import torch; print(torch._refs.sgn(torch.complex(torch.tensor([-501]16, dtype=torch.float32), torch.tensor([-1e20]16, dtype=torch.float32))))' tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj]) $ python -c 'import torch; t = torch.complex(torch.tensor([-501]16, dtype=torch.float32), torch.tensor([-1e20]16, dtype=torch.float32)); print(t / t.abs())' tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj]) ``` Implement the same algorithm as used in numpy and x86 (#93277) Reason here is that for a tensor with a component of `1e20` the abs-squared value used in the division contains a term `1e20 * 1e20` which overflows the dynamic range of float32 (3e38) and yields an "inf", so the division yields "nan" Output after change: ``` $ python -c 'import torch; t = torch.complex(torch.tensor([-501]16, dtype=torch.float32), torch.tensor([-1e20]16, dtype=torch.float32)); print(torch._refs.sgn(t), t.sgn(), t / t.abs())' tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) ``` CC @quickwritereader who wrote the initial code and @VitalyFedyunin who was involved in the initial review and @lezcano who reviewed #93277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116972 Approved by: https://github.com/lezcano	2024-01-21 19:21:13 +00:00
James Wu	afabed6ae6	[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 ) fixes #116715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298 Approved by: https://github.com/eellison	2024-01-21 18:47:01 +00:00
Bin Bao	41556324a9	[cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693 ) Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation. Result: ``` python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper Before: Ran 1 test in 32.843s After: Ran 1 test in 26.229s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693 Approved by: https://github.com/jansel	2024-01-21 16:07:52 +00:00
Stas Bekman	7f474da6bc	[docs] start a new FSDP notes doc (#117323 ) As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion. I hope I did the RST right, I haven't done RST in a while. - The first section is Andrew's words verbatim + formatting - The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better. tagging @albanD as requested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323 Approved by: https://github.com/albanD, https://github.com/awgu	2024-01-21 15:11:24 +00:00
Aaron Gokaslan	b50ccad86e	[BE]: Add type alias typing annotation to prims_common (#117928 ) Explicitly mark unions assignments as type aliases to make it easier for static type checkers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117928 Approved by: https://github.com/ezyang	2024-01-21 14:26:59 +00:00
Edward Z. Yang	df4e3d9d08	Document OpsHandler protocol (#117790 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117790 Approved by: https://github.com/jansel	2024-01-21 07:20:53 +00:00
eqy	8f7caaee67	[cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908 ) Remove the unnecesssary dependence on assuming a fixed number of digits per field CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908 Approved by: https://github.com/cpuhrsch	2024-01-21 05:00:01 +00:00
Adnan Akhundov	fbd1d567ed	[inductor] Fix CPP wrapper codegen for ExternKernel args (#117931 ) Summary: We see IR nodes `repr`-ed directly in the CPP wrapper codegen. Recently, this issue has been fixed for the Python wrapper codegen in D52899373 (https://github.com/pytorch/pytorch/pull/117838). Here we extend the fix to CPP wrapper codegen / AOTInductor. Test Plan: New unit tests. In OSS: ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_multi_output_arg ``` ``` python test/inductor/test_aot_inductor.py -k test_triton_kernel_extern_kernel_arg ``` Differential Revision: D52936248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117931 Approved by: https://github.com/oulgen, https://github.com/chenyang78, https://github.com/desertfire	2024-01-21 04:58:56 +00:00
Tugsbayasgalan Manlaibaatar	fa1e89b337	Ban mutation on dropout outputs in export (#117879 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117879 Approved by: https://github.com/ezyang ghstack dependencies: #117811	2024-01-21 04:53:40 +00:00
PyTorch UpdateBot	949a76a7f0	[executorch hash update] update the pinned executorch hash (#117899 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117899 Approved by: https://github.com/pytorchbot	2024-01-21 04:19:27 +00:00
suo	2ae66ddba0	[export] fix test ownership (#117886 ) as title Differential Revision: [D52924188](https://our.internmc.facebook.com/intern/diff/D52924188/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117886 Approved by: https://github.com/ydwu4	2024-01-21 01:18:16 +00:00
le-zheng	bad5e1e0bb	[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op hardswish (#117489 ) Summary Enable the fusion pattern of `QConv2d -> hardswish` lowering to `hardswish` as `QConv2d` post operator. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_cpu python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_hardswish ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117489 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #117487, #117488	2024-01-21 00:01:32 +00:00
fduwjj	05ef2030ea	[c10d] Add logs for NCCL Comm Abort call (#117868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868 Approved by: https://github.com/kwen2501	2024-01-20 21:34:13 +00:00
Taras Tsugrii	2de3474711	Simplify kwargs propagation in __call__. (#117880 ) In case no keyword arguments are passed, `*kwargs` would expand just fine without the need for extra overhead of `or {}`. In addition to reducing boilerplate, this also comes with a small perf improvement: ``` In [1]: def null(args, *kwargs): ...: pass ...: In [2]: def call1(args, *kwargs): ...: return null(args, *(kwargs or {})) ...: In [3]: def call2(args, *kwargs): ...: return null(args, **kwargs) ...: In [4]: %timeit call1() 145 ns ± 2.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [5]: %timeit call2() 118 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [6]: %timeit call1() 147 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [7]: %timeit call2() 117 ns ± 0.846 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117880 Approved by: https://github.com/Skylion007	2024-01-20 19:29:35 +00:00
Edward Z. Yang	50633620b2	sympy.Symbol is a subclass of sympy.Expr (#117857 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117857 Approved by: https://github.com/peterbell10	2024-01-20 18:09:44 +00:00
leslie-fang-intel	af831415a8	fix cpp backend relu codegen with inf input (#117622 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/117544. For CPP backend, current `ReLU` will code gen to `f"{x} * ({x}>0)"` in `CppOverrides`. The result mismatches with eager when input has `inf`, since `inf * 0` will result to `nan` based on [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754). Change the code gen to `f"std::max({x}, decltype({x})(0))"` to align with eager implementation as in `1deb75b584/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L392)` TestPlan ``` python -u -m pytest test_cpu_repro.py -k test_relu_with_inf_value ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117622 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-01-20 13:28:03 +00:00
Honglin Zhu	4bf481fb1b	Fix inductor pattern match error for qlinear with bmm (#117633 ) Summary: PR https://github.com/pytorch/pytorch/pull/116599 convert `bmm` when input dim exceeds 2 and not contiguous to `qlinear`. However, there is an error when check weight size because of not considering the permute op. Test Plan: python test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous Fixes: - Pull Request resolved: https://github.com/pytorch/pytorch/pull/117633 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-01-20 12:26:26 +00:00
haozhe.zhu@intel.com	0ae952db76	enable mkldnn bf32 matmul (#116015 ) ### Testing FP32 matmul vs. mkldnn BF32 matmul on SPR single core: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 32.842 \| 38.279 \| 1.165 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 38.590 \| 73.967 \| 1.917 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 18456.267 \| 74588.002 \| 4.041 56 cores: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 1199.400 \| 1715.548 \| 1.430 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \|1129.204 \| 1708.912 \| 1.513 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 3655.915 \| 7992.877 \| 2.186 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 3707.993 \| 8026.191 \| 2.165 Batch: 768, M: 128, N: 64, K: 128 \| 1296.419 \| 1308.411 \| 1.009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-20 09:30:23 +00:00
Michael Lazos	aaae2d8bb6	Add compilable and capturable foreach adamax with tests (#117835 ) Based off of https://github.com/pytorch/pytorch/pull/110345 Fixes https://github.com/pytorch/pytorch/issues/117812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835 Approved by: https://github.com/janeyx99	2024-01-20 05:29:05 +00:00
suo	e732adf0a7	[pytree] add access api (#117771 ) This PR introduces an API to use KeyPaths to actually access values on pytrees. Differential Revision: [D52881260](https://our.internmc.facebook.com/intern/diff/D52881260/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117771 Approved by: https://github.com/zou3519, https://github.com/XuehaiPan	2024-01-20 04:03:26 +00:00
Wei Lu	a1b3b5748f	[Pytoch][Vulkan] Create context for conv1d (#117780 ) Summary: `conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we - created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp` - registered them in `Register.cpp` - rewrote the graph representation of the op in `vulkan_rewrite.cpp` Test Plan: ## Numerical test ``` [luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef Network: Up: 0B Down: 0B Jobs completed: 4. Time elapsed: 0.6s. BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (159 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (57 ms) [----------] 2 tests from VulkanAPITest (217 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (217 ms total) [ PASSED ] 2 tests. ``` Full test result in P1053644934, summary as below ``` [----------] 419 tests from VulkanAPITest (28080 ms total) [----------] Global test environment tear-down [==========] 419 tests from 1 test suite ran. (28080 ms total) [ PASSED ] 418 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` ## Graph representation comparison We created a model using `conv1d` and traced it as below ``` # Define a simple model that uses conv1d class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.conv1d = nn.Conv1d(16, 33, 3) def forward(self, x): return self.conv1d(x) # Create an instance of the model model = MyModel() # Create a dummy input tensor for tracing input_tensor = torch.randn(20, 16, 50) # Use torch.jit.trace to trace the model and generate a graph traced_model = torch.jit.trace(model, input_tensor) ``` Then we converted the traced model to Vulkan backend using `optimize_for_mobile` ``` from torch.utils import mobile_optimizer vulkan_model = mobile_optimizer.optimize_for_mobile( traced_model, backend="vulkan", preserved_methods=to_preserve ) ``` Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)` - before this diff: `conv1d` was used ``` graph(%self.1 : __torch__.___torch_mangle_16.MyModel, %x : Tensor): %60 : Device = prim::Constant[value="cpu"]() %self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]() %37 : bool = prim::Constant[value=0]() %36 : NoneType = prim::Constant() %59 : Device = prim::Constant[value="vulkan"]() %self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]() %7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0 %18 : int[] = prim::Constant[value=[1]]() %19 : int[] = prim::Constant[value=[0]]() %39 : Tensor = aten::to(%x, %59, %36, %37, %37) %20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7) %58 : Tensor = aten::to(%20, %60, %36, %37, %37) return (%58) ``` - after this diff: `conv1d` was replaced with `run_conv1d_context` ``` graph(%self.1 : __torch__.___torch_mangle_6.MyModel, %x : Tensor): %85 : Device = prim::Constant[value="cpu"]() %51 : bool = prim::Constant[value=0]() %50 : NoneType = prim::Constant() %84 : Device = prim::Constant[value="vulkan"]() %53 : Tensor = aten::to(%x, %84, %50, %51, %51) %prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1) %22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0) %83 : Tensor = aten::to(%22, %85, %50, %51, %51) return (%83) ``` Differential Revision: D52865379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780 Approved by: https://github.com/yipjustin	2024-01-20 02:35:32 +00:00
PyTorch MergeBot	10923f8720	Revert "[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 )" This reverts commit 1967394690f144a7ba1717eccec977286cafe2da. Reverted https://github.com/pytorch/pytorch/pull/117298 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing in MacOS `1967394690`, may be due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/117298#issuecomment-1901594120))	2024-01-20 02:14:58 +00:00
le-zheng	94f0472579	[Quant] [PT2] Add Hardswish into X86InductorQuantizer Conv2d Unary Annotation (#117488 ) Summary Add `hardswish` into X86InductorQuantizer Conv2d Unary Annotation TestPlan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117488 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #117487	2024-01-20 01:37:33 +00:00
James Wu	1967394690	[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 ) fixes #116715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298 Approved by: https://github.com/eellison	2024-01-20 01:37:28 +00:00
Nikita Shulga	181e6dafd0	[MPS] Fix lintear for 5D tensors (#117837 ) torch.nn.Linear crashes with internal assert if invoked with 5D tensors, due to the bug in MPS framework, i.e. invoking ```swift import MetalPerformanceShadersGraph let graph = MPSGraph() let x = graph.constant(1, shape: [2, 1, 2, 1, 2], dataType: .float32) let y = graph.constant(1, shape: [2, 3], dataType: .float32) let z = graph.matrixMultiplication(primary: x, secondary: y, name: nil) let device = MTLCreateSystemDefaultDevice()! let buf = device.makeBuffer(length: 48)! let td = MPSGraphTensorData(buf, shape: [2, 1, 2, 1, 3], dataType: .int32) let cmdBuf = MPSCommandBuffer(from: device.makeCommandQueue()!) graph.encode(to: cmdBuf, feeds: [:], targetOperations: nil, resultsDictionary: [z:td], executionDescriptor: nil) cmdBuf.commit() ``` crashes with ``` AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayIdentity.mm:813: failed assertion `New volume: 4 should match old volume: 8 [reshapeWithCommandBuffer] MPSNDArrayIdentity.' zsh: abort ./build/matmul ``` Workaround the issue by flattening the forward and backward tensors if number of dimentions is greater than 4 Add regression tests to Linear opinfo samples Fixes https://github.com/pytorch/pytorch/issues/114942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117837 Approved by: https://github.com/janeyx99	2024-01-20 01:19:19 +00:00
Valentine233	d4cc1c5bff	Add new pattern matchers for SDPA (#113004 ) Add two new pattern matchers to enable SDPA in more models. - Pattern 14: `BertLarge` - Pattern 15: `DistilBert` Perf on SPR: <img width="1007" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/f0813343-c9e8-4fd4-9fa0-d0e67e1d57af"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113004 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-01-20 00:46:46 +00:00
Catherine Lee	8f91a53e9a	Add environment for close-nonexistent-disable-issues (#117885 ) Made a new environment called rockset-read-only that has a read only api key for rockset Pull Request resolved: https://github.com/pytorch/pytorch/pull/117885 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-01-19 23:45:46 +00:00
Thiago Crepaldi	3c1498d117	[ONNX] Add bfloat16 support for scaled_dot_product_attention (#117878 ) Using ONNX opset 14, the aten scaled_dot_product_attention oeprator can be implemented with bfloat16 support because Add-14 does support bfloat16 This PR simply add bfloat16 to the list of supported types Pull Request resolved: https://github.com/pytorch/pytorch/pull/117878 Approved by: https://github.com/BowenBao	2024-01-19 23:24:44 +00:00
PyTorch MergeBot	f684e44fd6	Revert "Reduce pytest prints (#117069 )" This reverts commit 40dbd567e04483c671f9c897171bf9d1e7162b68. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))	2024-01-19 23:07:51 +00:00
Catherine Lee	5538b37a06	[ez] Provide a slightly better error message if process times out (#117865 ) Just a slightly clearer error message Pull Request resolved: https://github.com/pytorch/pytorch/pull/117865 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 22:58:00 +00:00
Nathanael See	29f899ef87	[pytorch][vulkan] cumsum dim <= 1 (#117580 ) Summary: Following the implementation of Softmax, striding over the texture differently based on the desired dimension. Softmax performs a similar operation as cumsum (generally called "scan") iterating over all items in a dimension, but cumsum only needs to iterate once to collate the sum, compared to softmax which needs to iterate multiple times to collect the max and denominator for the final calculation. Similar to the softmax implmentation there's likely opportunities to optimize, but this gets all dims < 4 functional first. Test Plan: `LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="cumsum"`: ``` Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = cumsum [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.cumsum_1d [ OK ] VulkanAPITest.cumsum_1d (93 ms) [ RUN ] VulkanAPITest.cumsum_2d [ OK ] VulkanAPITest.cumsum_2d (74 ms) [ RUN ] VulkanAPITest.cumsum_3d [ OK ] VulkanAPITest.cumsum_3d (105 ms) [ RUN ] VulkanAPITest.cumsum_4d [ OK ] VulkanAPITest.cumsum_4d (73 ms) [----------] 4 tests from VulkanAPITest (346 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (346 ms total) [ PASSED ] 4 tests. ``` Differential Revision: D52814000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117580 Approved by: https://github.com/yipjustin	2024-01-19 21:52:48 +00:00
rzou	dd6c0f6844	Trim Dynamo shards 7->3 (#117869 ) We added all of the tests we wanted for now. These fit comfortably in 3 shards (the total test time previously was 0.5 hours on each shard). Going to decrease the number of shards to 3 so that it's less unwieldy to work with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117869 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-19 21:48:35 +00:00
Elias Ellison	365c7a292f	Log stack trace of mutated idx (#117720 ) Log stack trace of mutated tensor that prevents cudagraphs. Will do some subsequent refactors when all of the checks are moved to this fashion. Differential Revision: [D52896588](https://our.internmc.facebook.com/intern/diff/D52896588) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117720 Approved by: https://github.com/bdhirsh ghstack dependencies: #117823	2024-01-19 21:38:44 +00:00
Elias Ellison	6c99bf0766	move disable_cudagraph_reason disabling after codecache is accessed (#117823 ) Disabling cudagraphs has to happen after a codecache loading or it wont properly be disabled on a cache hit. Differential Revision: [D52896590](https://our.internmc.facebook.com/intern/diff/D52896590) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117823 Approved by: https://github.com/bdhirsh, https://github.com/masnesral	2024-01-19 21:33:25 +00:00
Nikita Shulga	c4eab49ded	[MacOS] Embed libomp.dylib/omp.h into MacOS wheel (#114816 ) To keep them on par with what we do on x86 And `omp.h` as it is needed for `torch.compile` on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/114816 Approved by: https://github.com/atalman	2024-01-19 21:21:33 +00:00
Scott Wolchok	414a1fd29f	[PyTorch] Add IValue::IValue(std::vector<T>&&) ctors (#117769 ) There are two IValue constructors that take `const std::vector<T>&`. Add moving variants to allow callers to save on reference counting. Differential Revision: [D52879065](https://our.internmc.facebook.com/intern/diff/D52879065/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117769 Approved by: https://github.com/suo, https://github.com/Skylion007	2024-01-19 21:11:11 +00:00
Catherine Lee	d45fd68012	OIDC for update_pytorch_labels (#117876 ) Companion: https://github.com/pytorch-labs/pytorch-gha-infra/pull/339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117876 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-19 21:08:28 +00:00
Scott Wolchok	ad3d41692e	[PyTorch] return `decltype(auto)` from getItem (#117569 ) This allows getItem to take advantage of the nicer (sometimes-const-reference) return type from `List::get() const` added in the previous diff. Differential Revision: [D52809097](https://our.internmc.facebook.com/intern/diff/D52809097/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117569 Approved by: https://github.com/iseeyuan, https://github.com/malfet ghstack dependencies: #117568	2024-01-19 21:04:53 +00:00
Scott Wolchok	632fcc4831	[PyTorch] Make `List::get() const` match `List::operator[]() const` (#117568 ) As far as I can tell, `get()` is supposed (and documented) to be the same as a const `operator[]`. We have an efficient implementation for `operator[]`. Let's use it for `get()`. Differential Revision: [D52809098](https://our.internmc.facebook.com/intern/diff/D52809098/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117568 Approved by: https://github.com/suo, https://github.com/malfet	2024-01-19 21:04:53 +00:00
Oguz Ulgen	15d568d621	[Inductor] Use codegen reference for buffer to string (#117838 ) Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead. Test Plan: added new test Differential Revision: D52899373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838 Approved by: https://github.com/aakhundov	2024-01-19 20:18:53 +00:00
redwrasse	1f5c27eb18	cleanup code comments _compute_numerical_gradient (#117484 ) cleanup code comments for ` _compute_numerical_gradient`: - reference parameters passed - indicate that central difference approximation is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/117484 Approved by: https://github.com/soulitzer	2024-01-19 18:51:52 +00:00
redwrasse	ab216bbaeb	cleanup code comments analytical Jacobian as vjp projection (#117483 ) Cleanup code comments for `_compute_analytical_jacobian_rows` to make clear Jacobian is computed by standard basis vector projections using the vector-Jacobian-product operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117483 Approved by: https://github.com/soulitzer	2024-01-19 18:50:26 +00:00
Catherine Lee	40dbd567e0	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-19 18:42:12 +00:00
rzou	2f4456a73e	Remove xfail on test_make_weak_keyed_dict_from_weak_keyed_dict (#117848 ) Based on the logs, this test has been consistently passing, so we remove the xfail. Fixes https://github.com/pytorch/pytorch/issues/116765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117848 Approved by: https://github.com/Skylion007 ghstack dependencies: #117765	2024-01-19 18:05:30 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit 74e13624998f2a4de29bce73a949d7f0339ec04e. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	f316c35a34	[export] Support preserving submodule callling convention in non-strict export (#117796 ) Summary: Title Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D52889236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117796 Approved by: https://github.com/angelayi	2024-01-19 17:16:45 +00:00
angelayi	249a226113	[export] Error on not pytree-flattened nodes (#117598 ) Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API". The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598 Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao	2024-01-19 17:13:39 +00:00
Catherine Lee	6c5c2121b1	Run some OOMing tests serially (#117759 ) They were disabled due to being flaky due to OOMs but got renamed. Seeing if running serially helps I kind of want to keep this test disabled since the rest of the file is probably fine... Issues in question: #113132 #113136 #113140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-01-19 16:45:35 +00:00
atalman	de25718300	[release] Docker Release build trigger on rc for testing (#117849 ) Enable triggering the Docker Release builds on RC. Use test channel in this case. Hence following logic is applied: 1. On RC trigger use test channel and upload to pytorch-test : https://github.com/orgs/pytorch/packages/container/package/pytorch-test 2. On Final RC use prod channel and upload to pytorch : https://github.com/orgs/pytorch/packages/container/package/pytorch 3. Nightly: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly Pull Request resolved: https://github.com/pytorch/pytorch/pull/117849 Approved by: https://github.com/malfet	2024-01-19 15:01:46 +00:00
Qingpeng Li	03b12e56c7	accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 ) When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function. Simple benchmark on AMD 3600 CPU Ubuntu 22.04: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|original\|1986\|1658\| \|this PR\|1295\|995\| faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code. CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned. The simple benchmark cpp file: [demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539 Approved by: https://github.com/lezcano	2024-01-19 14:56:43 +00:00
Nikita Shulga	98a044d33e	[CI] Build M1 conda binaries on M1 runners (#117801 ) As usual, almost no work on PyTorch side, all changes are on the builder end, namely: - `8b67d32929` - depend on `blas * mkl` only on x86 machines - `eb78393f1e` - install arm64 conda when running on Apple Silicon - `0d3aea4ee0` - constrain llvmdev-9 to x86 machines only - `6c6a33b271` - set correct DEVELOPER_DIR path TODO: - We should auto-detect this `DEVELOPER_DIR` via `xcode-select` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117801 Approved by: https://github.com/atalman	2024-01-19 14:31:12 +00:00
rzou	17c5f69852	Run test_jit with PYTORCH_TEST_WITH_DYNAMO=1 in CI (#117765 ) Gets rid of all the single test excludes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117765 Approved by: https://github.com/voznesenskym	2024-01-19 13:42:41 +00:00
le-zheng	f115f1cde1	[Quant] Enable QConv2d with hardswish post op (#117487 ) Summary Enable QConv2d implementation with post op `hardswish` Test Plan ``` python -m pytest test_quantized_op.py -k test_qconv2d_hardswish_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117487 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-01-19 13:24:06 +00:00
cyy	5756b7a08e	Remove math_compat.h (#117828 ) Follows #116167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117828 Approved by: https://github.com/malfet	2024-01-19 12:56:17 +00:00
lezcano	f2d6e99f8d	Workaround a cusolver bug on CUDA < 12.1 in triangular_solve (#117636 ) Fix https://github.com/pytorch/pytorch/issues/79191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117636 Approved by: https://github.com/malfet	2024-01-19 12:42:37 +00:00
suo	4057d005ff	Initial torchbind support in PT2 (#117697 ) This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2. It implements: * ProxyTensor support * Simple torch.export support (proxytensor-only path, e.g. non-strict). * add some tests exercising the path. Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now. Still on the agenda: * Dynamo support * Actual FakeMode support * Mutability support Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally. Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697 Approved by: https://github.com/SherlockNoMad	2024-01-19 06:28:20 +00:00
Michael Lazos	c51a4e64c0	Add support for compiling SDPAParams (#117207 ) Allows us to `allow_in_graph` this `torch._C` struct for supporting scaled dot product attention. helps unblock https://github.com/pytorch/pytorch/pull/116071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117207 Approved by: https://github.com/voznesenskym	2024-01-19 05:51:15 +00:00
PyTorch UpdateBot	8524fa566c	[executorch hash update] update the pinned executorch hash (#117593 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117593 Approved by: https://github.com/pytorchbot	2024-01-19 04:34:12 +00:00
Michael Lazos	f302a0d380	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-19 04:28:50 +00:00
dilililiwhy	924ed91612	Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738 ) Fixes #117517 Try to move nccl related function getDurationFromFirstEvent to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-01-19 04:28:47 +00:00
cyy	38d9b3d937	Remove use of math_compat.h (#116167 ) Because ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167 Approved by: https://github.com/ezyang	2024-01-19 03:37:55 +00:00
cyy	5c17f66a3d	[Exception] [5/N] Remove torch::IndexError (#117713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713 Approved by: https://github.com/ezyang	2024-01-19 03:36:15 +00:00
Tobias Ringwald	3131e0460e	Changed return type of randint64_cpu to int64_t to prevent codegen is… (#117443 ) …sues. Fixes #117435. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117443 Approved by: https://github.com/ezyang	2024-01-19 03:23:20 +00:00
Tugsbayasgalan Manlaibaatar	1adf77ce5e	Don't use functional tensor inside _unstack_pytree (#117811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117811 Approved by: https://github.com/ydwu4	2024-01-19 03:15:06 +00:00
Ke Wen	c16e6e4cf7	[ProcessGroup] Make watchdog check work queue more frequently (#117297 ) Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed. Take DDP and Ampere for example: DDP's bucket size = 25 MB Ampere's NVLink speed = 250 GB/s 25 MB / 250 GB/s = 100 ms. So we are updating the interval to 100 ms. Update: 25 MB / 250 GB/s = 0.1 ms But let's see how it goes so far between making the checking more aggressive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297 Approved by: https://github.com/fduwjj	2024-01-19 02:33:31 +00:00
Nikita Shulga	aadbaf8e2d	[EZ][BE] Move `build_android_gradle.sh` (#117795 ) From `.circleci/scripts` to `scripts`, next to another `build_android.sh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117795 Approved by: https://github.com/huydhn	2024-01-19 02:14:28 +00:00
titaiwangms	d618e86328	[ONNX] Bump transformers in CI test (#117703 ) Fixes #117660 (1) skip dynamic tests for exported program in `test_fx_to_onnx_onnxruntime.py`, as they are not expected to pass anyway. (2) Move dolly model to runtime, since it's working in exporting, but it is blocked by non-persistent buffers as well. (3) openai whisper has changed/regression due to modeling modifications. (4) Replace OpenLlama with Llama, because OpenLlama is deprecated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117703 Approved by: https://github.com/thiagocrepaldi	2024-01-19 02:10:10 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
ydwu4	c317bf2c2b	[HigherOrderOp][BE] factor out merge_graph_inputs (#116912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116912 Approved by: https://github.com/zou3519 ghstack dependencies: #116721, #116823	2024-01-19 00:35:26 +00:00
ydwu4	c6028f8f73	[HigherOrderOp] Add while_loop support (#116823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116823 Approved by: https://github.com/zou3519 ghstack dependencies: #116721	2024-01-19 00:35:26 +00:00
ydwu4	113f0749f5	[HigherOrderOp] move some common utils in cond to utils.py (#116721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116721 Approved by: https://github.com/zou3519	2024-01-19 00:35:26 +00:00
PyTorch MergeBot	77cfacab55	Revert "Reduce pytest prints (#117069 )" This reverts commit 2f89ef23007626aca1a577a4a388e315253c834f. Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))	2024-01-19 00:27:03 +00:00
JackCaoG	a468b9fbdf	Update xla.txt to fix missing commit (#117708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117708 Approved by: https://github.com/masnesral, https://github.com/huydhn	2024-01-18 23:51:51 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit 5aa92b5090e3db4a053548a3f360dd06c16df2f7. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Catherine Lee	2f89ef2300	Reduce pytest prints (#117069 ) * custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing) * normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes). Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497 "items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069 Approved by: https://github.com/huydhn	2024-01-18 23:30:59 +00:00
Shunting Zhang	e432b2e607	[inductor] multi-kernel support (#103469 ) For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time - persistent reduction - regular reduction A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime. Here I talk more about implementation details: - Inductor maintains states for generating kernels. E.g. the wrapper code. After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart. *There is one thing I need some comments from others*: There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel. But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list. Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex. I'm not sure if there is some easy and clean way to resolve this. Testing command: ``` TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469 Approved by: https://github.com/jansel	2024-01-18 23:16:31 +00:00
Nikita Shulga	fee96adde7	[EZ] Update `weekly.yml` to use actions from test-infra (#117775 ) It was deleted from `pytorch/pytorch` by https://github.com/pytorch/pytorch/pull/117506 Thanks [BowenBao](https://github.com/BowenBao) for alerting Pull Request resolved: https://github.com/pytorch/pytorch/pull/117775 Approved by: https://github.com/huydhn	2024-01-18 22:58:32 +00:00
BowenBao	6d9432c44c	[ONNX][dynamo_export] Decomposition skips using custom operator (#117314 ) A context manager that disables the decomposition of certain ops during dynamo tracing. The approach is to temporarily hijack the operator callable with PT2 custom operator. The custom operator will not be decomposed and will show up as a single node to be exported to ONNX. For the time being the decomposition of these ops is otherwise unavoidable. https://github.com/pytorch/pytorch/issues/116684 https://github.com/pytorch/pytorch/issues/115883 This solution will no longer be required once the issue is resolved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117314 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-01-18 22:19:28 +00:00
Angela Yi	92d718aed1	[export] Add lifted constant obj to input (#116985 ) Test Plan: wip Differential Revision: D52556070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985 Approved by: https://github.com/suo	2024-01-18 22:10:53 +00:00
ydwu4	eba5d5485d	[dynamo] make ConstantSource propagate through built-in ops for TensorVariable (#117704 ) Fixes #117685. This PR only makes ConstantSource perserved for built-in ops when we find all the inputs are either constant tensors or python constants. It doesn't fundamentally solve the problem of preserving ConstantSource information through all operators that's potentially can be constant folded. For the following code in the issue: ``` class Bob(torch.nn.Module): def __init__(self, p, val) -> None: super().__init__() self.p = p self.y = torch.nn.Parameter(torch.tensor(val)) def forward(self, x: torch.Tensor) -> torch.Tensor: # This only looks dynamic but it's actually a constant value if get_y(self.y) < self.p: return torch.cat([x,x]) else: return x ``` The graph exported looks like following: ```python class GraphModule(torch.nn.Module): def forward(self, x): arg0: "f32[s0, s1]"; arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) l_x_ = arg0 # File: /home/yidi/local/pytorch/test/dynamo/test_export.py:1498 in forward, code: return torch.cat([x, x]) cat = torch.cat([l_x_, l_x_]); l_x_ = None return pytree.tree_unflatten([cat], self._out_spec) ``` Test Plan: Added a new test for the given repro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117704 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-01-18 20:18:34 +00:00
Tailing Yuan	1462d72904	Speed up triu_tril_kernel (#115013 ) 1. Batch Processing: Enhance kernel efficiency by having each thread handle multiple elements, reducing the frequency of offset calculations. 2. Inplace Operation Optimization: For inplace functions, eliminate unnecessary copying to enhance performance. Up to 5x speed up compared to torch 2.1.1 # Benchmark Test on NVIDIA RTX 3080, WSL, CUDA 12.1. Peak performance is recorded. \| function \| dtype \| shape \| k \| torch 2.1.1 \| this PR \| speed up -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- various dtype \| \| \| \| \| \| \| triu_ \| int8 \| [1, 3072, 3072] \| 0 \| 0.107 \| 0.028 \| 3.76x \| triu_ \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.029 \| 3.79x \| triu_ \| float32 \| [1, 3072, 3072] \| 0 \| 0.114 \| 0.045 \| 2.52x \| triu_ \| float64 \| [1, 3072, 3072] \| 0 \| 0.172 \| 0.082 \| 2.11x \| triu \| int8 \| [1, 3072, 3072] \| 0 \| 0.111 \| 0.056 \| 2.00x \| triu \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.049 \| 2.22x \| triu \| float32 \| [1, 3072, 3072] \| 0 \| 0.116 \| 0.091 \| 1.27x \| triu \| float64 \| [1, 3072, 3072] \| 0 \| 0.175 \| 0.176 \| 1.00x various shape \| \| \| \| \| \| \| triu_ \| float32 \| [1, 8192, 8192] \| 0 \| 0.798 \| 0.311 \| 2.56x \| triu_ \| float32 \| [4, 1024, 1024] \| 0 \| 0.054 \| 0.023 \| 2.37x \| triu_ \| float32 \| [4, 1021, 1021] \| 0 \| 0.054 \| 0.023 \| 2.33x \| triu_ \| float32 \| [256, 128, 256] \| 0 \| 0.111 \| 0.038 \| 2.92x \| triu_ \| float32 \| [128, 257, 125] \| 0 \| 0.051 \| 0.029 \| 1.77x \| triu_ \| float32 \| [20480, 16, 16] \| 0 \| 0.072 \| 0.036 \| 1.97x \| triu \| float32 \| [1, 8192, 8192] \| 0 \| 0.797 \| 0.611 \| 1.31x \| triu \| float32 \| [4, 1024, 1024] \| 0 \| 0.056 \| 0.042 \| 1.32x \| triu \| float32 \| [4, 1021, 1021] \| 0 \| 0.058 \| 0.044 \| 1.32x \| triu \| float32 \| [256, 128, 256] \| 0 \| 0.114 \| 0.093 \| 1.22x \| triu \| float32 \| [128, 257, 125] \| 0 \| 0.051 \| 0.036 \| 1.43x \| triu \| float32 \| [20480, 16, 16] \| 0 \| 0.075 \| 0.061 \| 1.23x various dim \| \| \| \| \| \| \| triu_ \| float32 \| [3072, 3072] \| 0 \| 0.093 \| 0.037 \| 2.49x \| triu_ \| float32 \| [1, 3072, 3072] \| 0 \| 0.114 \| 0.045 \| 2.52x \| triu_ \| float32 \| [1, 1, 3072, 3072] \| 0 \| 0.138 \| 0.053 \| 2.60x \| triu \| float32 \| [3072, 3072] \| 0 \| 0.097 \| 0.091 \| 1.07x \| triu \| float32 \| [1, 3072, 3072] \| 0 \| 0.116 \| 0.091 \| 1.27x \| triu \| float32 \| [1, 1, 3072, 3072] \| 0 \| 0.140 \| 0.090 \| 1.55x various k \| \| \| \| \| \| \| \| triu_ \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.029 \| 3.79x \| triu_ \| float16 \| [1, 3072, 3072] \| 1536 \| 0.103 \| 0.042 \| 2.44x \| triu_ \| float16 \| [1, 3072, 3072] \| -1536 \| 0.114 \| 0.020 \| 5.68x \| triu \| float16 \| [1, 3072, 3072] \| 0 \| 0.108 \| 0.049 \| 2.22x \| triu \| float16 \| [1, 3072, 3072] \| 1536 \| 0.104 \| 0.039 \| 2.65x \| triu \| float16 \| [1, 3072, 3072] \| -1536 \| 0.115 \| 0.058 \| 2.00x # Benchmark Code ```python3 import time import torch torch.manual_seed(42) def timeit(f, run_times=1000): torch.cuda.synchronize() t1 = time.time() for _ in range(run_times): f() torch.cuda.synchronize() t2 = time.time() return (t2 - t1) / run_times for dtype in [torch.int8, torch.float16, torch.float32, torch.float64]: for shape in [ [1, 8192, 8192], [3072, 3072], [1, 3072, 3072], [1, 1, 3072, 3072], [4, 1024, 1024], [4, 1021, 1021], [256, 128, 256], [128, 257, 125], [20480, 16, 16], ]: for k in [0, shape[-1] // 2, -shape[-1] // 2]: a = torch.empty(shape, dtype=dtype, device="cuda") for _ in range(4): t_triu = timeit(lambda: a.triu(k)) t_triu_ = timeit(lambda: a.triu_(k)) t_clone = timeit(lambda: a.clone()) print(dtype, shape, f"{k=}", f"triu_ {t_triu_ * 1000:.6f} ({t_triu_ / t_clone:.2f}xMemcpy)", f"triu {t_triu * 1000:.6f} ({t_triu / t_clone:.2f}xMemcpy)") a = torch.rand(shape, device="cuda") a = (a * 10).to(dtype) assert (a.triu(k) == a.cpu().triu(k).cuda()).all() assert (a.tril(k) == a.cpu().tril(k).cuda()).all() assert (a.clone().triu_(k) == a.triu(k)).all() assert (a.clone().tril_(k) == a.tril(k)).all() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115013 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-01-18 19:58:00 +00:00
rzou	16ebfbbf07	All tests run with markDynamoStrictTest now (#117763 ) Last test to remove from the denylist was dynamo/test_logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117763 Approved by: https://github.com/voznesenskym ghstack dependencies: #117729, #117747, #117754, #117761	2024-01-18 19:42:41 +00:00
rzou	5278200507	Add some better docs for dynamo_test_failures.py (#117761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117761 Approved by: https://github.com/voznesenskym ghstack dependencies: #117729, #117747, #117754	2024-01-18 19:42:41 +00:00
rzou	07216721cf	[codemod] markDynamoStrictTest batch 23 (#117754 ) [codemod] markDynamoStrictTest test_custom_ops [codemod] markDynamoStrictTest test_python_dispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117754 Approved by: https://github.com/voznesenskym ghstack dependencies: #117729, #117747	2024-01-18 19:37:04 +00:00
PyTorch MergeBot	def4959662	Revert "[inductor] allow mm template to accumulate with float16 dtype (#117479 )" This reverts commit a7fbbc2a4a05fa4863f9d0e2adabcdc5e276c675. Reverted https://github.com/pytorch/pytorch/pull/117479 on behalf of https://github.com/PaliC due to breaking tests internally ([comment](https://github.com/pytorch/pytorch/pull/117479#issuecomment-1899032973))	2024-01-18 18:53:37 +00:00
suo	23d53a4360	add test_public_bindings to internal CI (#117712 ) enable this test in meta-internal CI, since it's mildly infuriating to not be able to locally test this when working inside meta One change: This test uses `pkgutil.walk_packages`, which ignores namespace packages. A quirk in Meta's internal python packaging system is that it adds `__init__.py` to each source directory. So this test picks up more files to check internally than in the GitHub CI. So I changed this test from using raw `pkgutil` to a version that also looks into namespace packages, so we're checking the same thing across both CIs. Differential Revision: [D52857631](https://our.internmc.facebook.com/intern/diff/D52857631/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117712 Approved by: https://github.com/ezyang	2024-01-18 18:20:43 +00:00
Jez Ng	1b773df3c6	Place .lrodata later in the binary (#117575 ) Summary: By default, in LLD 16, .lrodata is placed immediately after .rodata. However, .lrodata can be very large in our compiled models, which leads to relocation out-of-range errors for relative relocations. So we place it after other the sections that are referenced from .text using relative relocations. This is the default behavior in GNU ld. Reviewed By: muchulee8, desertfire, khabinov, chenyang78 Differential Revision: D52557846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117575 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2024-01-18 17:58:18 +00:00
PyTorch MergeBot	7451dd0585	Revert "Add node meta value into UnflattenedModule (#117686 )" This reverts commit cbf24ba962f72175ec1c71a25f3379f7d9149ec1. Reverted https://github.com/pytorch/pytorch/pull/117686 on behalf of https://github.com/PaliC due to breaks internal modeling tests ([comment](https://github.com/pytorch/pytorch/pull/117686#issuecomment-1898939899))	2024-01-18 17:46:38 +00:00
rzou	5aa895e53e	Don't run inductor tests in Dynamo shard (#117747 ) In theory we could, but these get really slow once we turn on strict mode, so we're not going to for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747 Approved by: https://github.com/bdhirsh ghstack dependencies: #117729	2024-01-18 17:43:30 +00:00
PyTorch MergeBot	646229218f	Revert "[export] Error on not pytree-flattened nodes (#117598 )" This reverts commit 560213de2d8f734987e25680e72d565501ab8318. Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/PaliC due to breaking executorch tests internally ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1898926720))	2024-01-18 17:37:59 +00:00
Wanchao Liang	4720109d7f	[dynamo] add common methods to DistributedVariable (#117590 ) This PR refactors the distributed related variables to use DistributedVariable for common methods, so that things like `python_type` works for all distributed variables. Maybe we can add `as_python_constant` to the DistributedVariable too? I didn't add in this PR but if that make sense I can update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117590 Approved by: https://github.com/voznesenskym	2024-01-18 17:32:31 +00:00
Nikita Shulga	044b9012d5	Update PocketFFT (#117595 ) This updates PocketFFT submodule to `9d3ab05a7f` Probably fixes https://github.com/pytorch/pytorch/issues/117589 (as it includes https://github.com/mreineck/pocketfft/issues/5 that should fix PocketFFT compilation on Windows) Also adjust `#if __cplusplus >= 201703` replace path in Android scripts (need to submit the fix back to PocketFFT) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117595 Approved by: https://github.com/huydhn	2024-01-18 17:08:44 +00:00
rzou	db1a6eda9e	[codemod] markDynamoStrictTest batch 22 (#117729 ) [codemod] markDynamoStrictTest test_autograd [codemod] markDynamoStrictTest test_ao_sparsity [codemod] markDynamoStrictTest test_jit [codemod] markDynamoStrictTest test_quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117729 Approved by: https://github.com/bdhirsh	2024-01-18 16:59:26 +00:00
Ozan Aydin	fa86fa7a61	Fix MSVC 14.38 - VS 2022 Build (#117497 ) Fixes #115922 This PR is prepared to separate existing https://github.com/pytorch/pytorch/pull/116926 and to apply suggestions in the review. `scalar_t` which is defined as `c10::impl::ScalarTypeToCPPType<ScalarType::Half>::t` appears to be causing the issue with `Visual Studio 2022 17.8.4` (coming with `MSVC 14.38.33130`) Error message: ``` aten\src\ATen/cpu/vec/vec_base.h(150): fatal error C1001: Internal compiler error. (compiler file 'D:\a_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\toinil.c', line 910) ``` --- Related line was added for a similar issue before as a workaround (`scalar_t` definition) [Fix compile error for vs2022](https://github.com/pytorch/pytorch/pull/85958) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117497 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-01-18 16:53:27 +00:00
Jason Ansel	a669319450	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-18 16:20:12 +00:00
Animesh Jain	6e4e81a9ef	[dynamo] Extend LazyVariableTracker to tuples (#117426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-01-18 15:51:28 +00:00
Bin Bao	26956980c6	[AOTI] Add torch._export.aot_load (#117610 ) Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable. Test Plan: CI Differential Revision: D52825456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610 Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78	2024-01-18 15:02:16 +00:00
Edward Z. Yang	2fb9d8811f	Don't try to directly compare symbols, it won't work (#117674 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117674 Approved by: https://github.com/lezcano	2024-01-18 12:18:45 +00:00
Francisco Massa	8bf788c390	[SAC][Dynamo] Add support for functools.partial in CheckpointHigherOrderVariable (#117657 ) # Context In some cases, we might want to build the `context_fn` with runtime-defined policies. One way of implementing this is to make `context_fn` be a partial, which holds the information that we want to pass. One concrete example is the [automatic policy selection from `xformers`](`ad986981b1/xformers/checkpoint.py (L185)`). # The problem The previous implementation wouldn't work with partials because `FunctoolsPartialVariable` doesn't have a `fn` attribute. This PR addresses this case, but ideally we could get this solved in a more general fashion, as callable classes and `NestedUserFunctionVariable` are not supported by this PR. # Tests I've added a basic test that mimics the tests around it. The tests could probably be simplified, but I've decided to keep changes to a minimum. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117657 Approved by: https://github.com/yf225	2024-01-18 11:59:23 +00:00
PyTorch MergeBot	b0084be114	Revert "Re-enable SGD (#117434 )" This reverts commit e7fac72be75a9fa7a31c6fc8062364fdfc4aaa3a. Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))	2024-01-18 11:37:36 +00:00
lezcano	0d1e7053ac	[easy] Log guard failure (#117639 ) Facilitates greatly debugging guard creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/117639 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #112252, #117630, #110524, #108420	2024-01-18 09:37:33 +00:00
lezcano	4ba5318d3f	[dynamo] Add DictView variable tracker (#108420 ) This also starts a comparison pattern where we don't ask variables what's their type, but what are their capabilities. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108420 Approved by: https://github.com/jansel ghstack dependencies: #112252, #117630, #110524	2024-01-18 09:37:33 +00:00
lezcano	f4df0f061c	Implement set in terms of dict (#110524 ) This allows to heavily simplify the implementation of set, which was "quite unique". Now we represent a set a as a dict where all its values are None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110524 Approved by: https://github.com/jansel ghstack dependencies: #112252, #117630	2024-01-18 09:36:41 +00:00
lezcano	bc85eb948f	Break on unsupported keys for dicts / elements for sets (#117630 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/117630 Approved by: https://github.com/jansel ghstack dependencies: #112252	2024-01-18 09:35:46 +00:00
lezcano	4512a95371	[easy]Remove specialized value (#112252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112252 Approved by: https://github.com/jansel	2024-01-18 09:34:50 +00:00
Sun, Jiayi	2dd4a254a0	add Half support for interpolate operators on CPU (#105648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 09:07:16 +00:00
CaoE	c9528a11dd	Add Half support for masked_softmax on CPU (#117028 ) Add Half support for `masked_softmax` on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117028 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 08:59:20 +00:00
xinan.lin	e60bc502b4	[Inductor Intel GPU backend Upstream] Generalize part of Inductor test case (#117513 ) Following the RFC https://github.com/pytorch/pytorch/issues/114856, before upstream Intel XPU Inductor Backend, we need to preapre corresponding Inductor test cases. This PR aims to generalize part of Inductor test case so that a new GPU backend can reuse the existing test case with minimal code change. This Pull Request preferentially generalizes the test cases that cover Inductor's base functionality as follow: - test/inductor/test_codecache.py - test/inductor/test_codegen_triton.py - test/inductor/test_kernel_benchmark.py - test/inductor/test_torchinductor.py - test/inductor/test_torchinductor_codegen_dynamic_shapes.py - test/inductor/test_torchinductor_dynamic_shapes.py - test/inductor/test_torchinductor_opinfo.py - test/inductor/test_triton_heuristics.py - test/inductor/test_triton_wrapper.py Feature request: https://github.com/pytorch/pytorch/issues/114856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117513 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-01-18 08:26:21 +00:00
cyy	b72ddbab60	[Clang-tidy header][15/N] Enable clang-tidy on headers in c10/cuda and c10/mobile (#116602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116602 Approved by: https://github.com/ezyang	2024-01-18 08:15:50 +00:00
Yue Dong	57ca455471	[dynamo] Add hasattr support for TupleVariable (#117694 ) Summary: This change adds support hasattr support for TupleVariable in dynamo. This fix is part of: https://github.com/pytorch/pytorch/issues/117670 Test Plan: Unit test and CI Differential Revision: D52850665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117694 Approved by: https://github.com/yanboliang	2024-01-18 07:47:43 +00:00
Tobias Ringwald	bc9cb04822	Replaced CHECK with TORCH_CHECK in order to not abort, but throw a Ru… (#117653 ) …ntimeError instead. Fixes #117499. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117653 Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG, https://github.com/alanwaketan	2024-01-18 07:47:22 +00:00
Michael Lazos	e7fac72be7	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-18 06:47:15 +00:00
Yu, Guangye	79811e765c	[2/4] Intel GPU Runtime Upstreaming for Device (#116833 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `aten`. # Design We will compile the code for XPU separately into a library named `libtorch_xpu.so`. Currently, it primarily offers device-related APIs, including - `getCurrentDeviceProperties` - `getDeviceProperties` - `getGlobalIdxFromDevice` - `getDeviceFromPtr` # Additional Context `XPUHooks` is an indispensable part of the runtime. We upstream `XPUHooks` in this PR since there is some code related to `Device` in it and we also refine some logic and code to avoid forward declaration in `DLPack`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116833 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-01-18 05:02:42 +00:00
CK Luk	61ea3036bc	Allow explicit shutdown of the compile-worker pools (#117664 ) Summary: Allow the trainer to explicitly shutdown the compile-worker pools to save CPU resource, thereby avoiding QPS degradation. Test Plan: See the test plan in D52839313 Differential Revision: D52839313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117664 Approved by: https://github.com/yanboliang	2024-01-18 04:56:11 +00:00
mingxzhao	1859895ffa	Docs: fix docstring errors in model_averaging (#117038 ) pydocstyle check averagers.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`: D102: Missing docstring in public method /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`: D400: First line should end with a period (not '`') 6 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`: D102: Missing docstring in public method /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`: D107: Missing docstring in __init__ 4 utils.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:17 in public function `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:68 in public function `average_parameters_or_parameter_groups`: D200: One-line docstring should fit on one line with quotes (found 3) 5 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level: D100: Missing docstring in public module 1 hierarchical_model_averager.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:16 in public class `HierarchicalModelAverager`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:98 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D400: First line should end with a period (not ',') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`: D400: First line should end with a period (not '`') 8 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:99 in public method `__init__`: D107: Missing docstring in __init__ 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117038 Approved by: https://github.com/H-Huang	2024-01-18 04:12:51 +00:00
Menglu Yu	4f2620ce56	[PT2][split_cat] fix a bug in merge_splits (#117707 ) Summary: Recently, we found merge splits (D45204109) is not working for AFOC model, thus patch a fix. Test Plan: The error log: P1046934021 # Flows used to local reproduce ### non-first: f522317780 after the fix: P1047603217 ### first: f522253163 after the fix: P1047764917 Differential Revision: D52856359 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117707 Approved by: https://github.com/jackiexu1992	2024-01-18 04:04:32 +00:00
suo	02c96f6949	[export] modify torch.export tests to pass a Module in (#117572 ) We have a lot of tests that pass a function to torch.export. We are planning to disallow this, so fix up the tests to pass a module in. Differential Revision: [D52791309](https://our.internmc.facebook.com/intern/diff/D52791309/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117572 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #117570, #117571	2024-01-18 03:40:40 +00:00
suo	ccc8440609	[export] introduce WrapperModule (#117571 ) Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module. Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571 Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri ghstack dependencies: #117570	2024-01-18 03:40:34 +00:00
suo	5697986482	[export] change exportdb to require torch.nn.Module (#117570 ) Part of the effort to make torch.export require nn.Module. Differential Revision: [D52631366](https://our.internmc.facebook.com/intern/diff/D52631366/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117570 Approved by: https://github.com/tugsbayasgalan	2024-01-18 03:40:10 +00:00
Elias Ellison	41153542ae	Use wait stream instead of synchronize() in cudagraph warmup (#117578 ) Fix for https://github.com/pytorch/pytorch/issues/113895 There are three phases to cudagraph trees. Warmup, recording, and execution. On recording and execution we are executing under the current_stream. In warmup we execute under a side stream that we also use for cudagraph recording so as to reuse memory. After we execute on the side stream we need to sync the current stream to the side stream. Previously there was a `torch.cuda.synchronize` but not a `torch.cuda.current_stream().wait_stream(stream)`. This PR removes the global sync and adds a wait_stream. I have confirmed that it fixes https://github.com/pytorch/pytorch/issues/113895. It's not entirely clear me why torch.cuda.synchronize would be insufficient - I would have thought the global sync would encompass the stream to stream sync. However, we do have a number of [instances](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L748-L749) throughout the code base where we do a stream->stream sync after the global sync so clearly I am missing something here. In any case the stream->stream sync is better perf than a global synchronize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117578 Approved by: https://github.com/zdevito	2024-01-18 03:33:44 +00:00
angelayi	560213de2d	[export] Error on not pytree-flattened nodes (#117598 ) Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API". The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598 Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao	2024-01-18 03:06:42 +00:00
Edward Z. Yang	634ce3c913	Document and type torch._inductor.virtualized (#117658 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117658 Approved by: https://github.com/eellison, https://github.com/peterbell10 ghstack dependencies: #117650	2024-01-18 03:03:20 +00:00
Edward Z. Yang	16ff6cd340	Catch some missing unbacked symbol dependencies (#117650 ) Whenever an IR node has reference to an unbacked SymInt, we must register it as a use of the unbacked SymInt. This fix isn't complete but the rest of the fix is fairly difficult, so putting this in to start. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117650 Approved by: https://github.com/lezcano	2024-01-18 03:03:20 +00:00
rzou	cb2b98ad6b	[codemod] markDynamoStrictTest batch 21 (#117609 ) [codemod] markDynamoStrictTest test_torch [codemod] markDynamoStrictTest test_ops_gradients [codemod] markDynamoStrictTest test_ops [codemod] markDynamoStrictTest test_modules [codemod] markDynamoStrictTest test_ops_jit [codemod] markDynamoStrictTest test_ops_fwd_gradients Pull Request resolved: https://github.com/pytorch/pytorch/pull/117609 Approved by: https://github.com/bdhirsh ghstack dependencies: #117700, #117701, #117702	2024-01-18 02:49:26 +00:00
PyTorch MergeBot	bbf65bc451	Revert "[Dynamo] Remove the workaround since it has been fixed (#117615 )" This reverts commit b3e2571e83eff4a5ce45a7ad037c2fa2df87da9d. Reverted https://github.com/pytorch/pytorch/pull/117615 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it seems to start failing some dynamo tests in trunk `b3e2571e83`. I try to disable the failed test but yet another one shows up ([comment](https://github.com/pytorch/pytorch/pull/117615#issuecomment-1897683076))	2024-01-18 02:48:34 +00:00
titaiwangms	cbf24ba962	Add node meta value into UnflattenedModule (#117686 ) Fixes #116670 Following the lead of #116720, added node.meta['val'] back to newly created subgraphs. node.meta['val'] is essential to ONNX in terms of the shape and type information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117686 Approved by: https://github.com/angelayi	2024-01-18 02:37:15 +00:00
Ke Wen	6d96beb6be	[c10d] Remove health check (#117699 ) https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called). If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699 Approved by: https://github.com/wconstab	2024-01-18 02:14:49 +00:00
Lu Fang	21ddca4225	Enable HIP build for //sigrid/predictor:pytorch_disagg_gpu_task (#117616 ) Summary: Tweak some header include, as well as explicitly ignore hipEventDestroy return value. Test Plan: CI Reviewed By: jiaqizhai Differential Revision: D52722234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117616 Approved by: https://github.com/xw285cornell	2024-01-18 01:37:50 +00:00
Sergii Dymchenko	3882714168	Fix check-labels.yml for ghstack PRs (#117680 ) Otherwise check-labels doesn't run on ghstack PRs, see https://github.com/pytorch/pytorch/pull/117609 for example: no Check Labels workflow run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117680 Approved by: https://github.com/izaitsevfb	2024-01-18 01:33:55 +00:00
Sergii Dymchenko	f7143b79bd	Stricter pull_request_target in labeler.yml (#117677 ) Copied from https://github.com/pytorch/pytorch/blob/main/.github/workflows/check-labels.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/117677 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-01-18 01:33:49 +00:00
Ke Wen	58c4bc62bb	[c10d] Deprecate Work.result() (#117565 ) Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather). It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs. Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565 Approved by: https://github.com/wconstab	2024-01-18 01:22:37 +00:00
Eddie Yan	5aa92b5090	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-01-18 01:20:36 +00:00
Kurman Karabukaev	a60b566d37	[TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066 ) Summary: Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity. RFC: https://github.com/pytorch/pytorch/issues/114097 Test Plan: Integration tests Differential Revision: D52343874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066 Approved by: https://github.com/zdevito	2024-01-18 01:16:55 +00:00
Nikita Shulga	a1afd1b195	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" It should have never been landed, but was landed again, thanks to ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910 This reverts commit e457b6fb18782425661e8a09d0222d0b29518ad1.	2024-01-17 17:06:32 -08:00
Ke Wen	410515241d	[c01d] Remove CoalescedWorkNCCL (#117696 ) `CoalescedWorkNCCL` is dead code now. Nowhere is it used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117696 Approved by: https://github.com/wconstab	2024-01-18 01:00:43 +00:00
Ke Wen	387ea260af	[c10d] Enable watchdog for coalesced work (#117682 ) Fixes https://github.com/pytorch/pytorch/issues/114301 Previously, coalesced work (created by `end_coalescing`) is not watched by watchdog, which results in silent timeout. The culprit is that we reset `coalescing_state_` to 0 before checking it to see if we should enqueue a work. Example: ``` import torch import torch.distributed as dist from datetime import timedelta dist.init_process_group(backend="nccl", timeout=timedelta(seconds=10)) rank = dist.get_rank() world_size = dist.get_world_size() device = torch.device(f"cuda:{rank}") # Create tensors of different sizes to create hang s = 100 * 1024 * 1024 * (world_size - rank) with dist._coalescing_manager(device=device): dist.all_reduce(torch.ones(s, device=device)) dist.broadcast(torch.ones(s, device=device), src=0) torch.cuda.synchronize() print(f"{dist.get_rank()} done") ``` Watchdog fires: ``` $ torchrun --nproc-per-node 2 example.py ... [rank1]:[E ProcessGroupNCCL.cpp:545] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10000 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:545] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10567 milliseconds before timing out. ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117682 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-01-18 00:42:36 +00:00
cyy	396a5c3091	[Exception] [4/N] Replace torch::IndexError and torch::ValueError with C10 counterparts (#117317 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117317 Approved by: https://github.com/ezyang	2024-01-18 00:35:29 +00:00
rzou	c64fd8b89c	[codemod] markDynamoStrictTest batch 20 (#117702 ) [codemod] markDynamoStrictTest test_tensorexpr_pybind [codemod] markDynamoStrictTest test_tensorexpr [codemod] markDynamoStrictTest test_jit_llga_fuser [codemod] markDynamoStrictTest test_jit_fuser_te Pull Request resolved: https://github.com/pytorch/pytorch/pull/117702 Approved by: https://github.com/bdhirsh ghstack dependencies: #117700, #117701	2024-01-18 00:30:22 +00:00
rzou	3770311093	[codemod] markDynamoStrictTest batch 19 (#117701 ) [codemod] markDynamoStrictTest export/test_verifier [codemod] markDynamoStrictTest export/test_upgrade [codemod] markDynamoStrictTest export/test_unflatten [codemod] markDynamoStrictTest export/test_serialize [codemod] markDynamoStrictTest export/test_serdes [codemod] markDynamoStrictTest export/test_retraceability [codemod] markDynamoStrictTest export/test_passes [codemod] markDynamoStrictTest export/test_pass_infra [codemod] markDynamoStrictTest export/test_functionalized_assertions [codemod] markDynamoStrictTest export/test_export_nonstrict [codemod] markDynamoStrictTest export/test_export [codemod] markDynamoStrictTest export/test_experimental [codemod] markDynamoStrictTest export/test_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/117701 Approved by: https://github.com/bdhirsh, https://github.com/malfet ghstack dependencies: #117700	2024-01-18 00:30:22 +00:00
Nikita Shulga	82c0083819	Fix trition wheels build (take 2) (#117706 ) Sorry, I should have been more thorough in reviewing https://github.com/pytorch/pytorch/pull/117648 Triton wheels are built of `main` branch, rather than `nightly`, see `2db53a01e5/.github/workflows/build-triton-wheel.yml (L1-L6)` Test plan: merge and hope for the best :P Pull Request resolved: https://github.com/pytorch/pytorch/pull/117706 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-01-18 00:26:36 +00:00
rzou	898f6a48a9	[codemod] markDynamoStrictTest batch 18 (#117700 ) [codemod] markDynamoStrictTest functorch/test_vmap [codemod] markDynamoStrictTest profiler/test_profiler_tree [codemod] markDynamoStrictTest profiler/test_profiler [codemod] markDynamoStrictTest profiler/test_memory_profiler [codemod] markDynamoStrictTest functorch/test_ops [codemod] markDynamoStrictTest functorch/test_aotdispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117700 Approved by: https://github.com/bdhirsh	2024-01-18 00:25:38 +00:00
Yanbo Liang	b3e2571e83	[Dynamo] Remove the workaround since it has been fixed (#117615 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117615 Approved by: https://github.com/angelayi	2024-01-18 00:21:22 +00:00
Boyuan Feng	3114813314	Replace `constraints` with `dynamic_shapes` in deeplearning/aot_inductor test (#117573 ) Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `deeplearning/aot_inductor/test/test_custom_ops.py`. Test Plan: buck test mode/dev-nosan fbcode//deeplearning/aot_inductor/test:test_custom_ops -- test_export_extern_fallback_nodes_dynamic_shape Differential Revision: D52790332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117573 Approved by: https://github.com/angelayi	2024-01-17 23:50:08 +00:00
Brian Hirsh	2db53a01e5	propagate torch stack trace metadata to copy_() nodes during input mutations (#117587 ) Tested by running the below script: ``` import torch @torch.compile(backend="aot_eager", fullgraph=True) def f(x): y = x.view(-1) y.mul_(2) return x = torch.ones(4) f(x) ``` Which gives me this ATen graph (notice that the copy_() node is bundled under the stacktrace for `mul_(2)`): ``` ===== Forward graph 0 ===== <eval_with_key>.2 from /data/users/hirsheybar/e/pytorch/torch/fx/experimental/proxy_tensor.py:521 in wrapped class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[4]"): # File: /data/users/hirsheybar/e/pytorch/tmp5.py:8, code: y = x.view(-1) view: "f32[4]" = torch.ops.aten.view.default(arg0_1, [-1]) # File: /data/users/hirsheybar/e/pytorch/tmp5.py:9, code: y.mul_(2) mul: "f32[4]" = torch.ops.aten.mul.Tensor(view, 2); view = None view_1: "f32[4]" = torch.ops.aten.view.default(mul, [4]); mul = None copy_: "f32[4]" = torch.ops.aten.copy_.default(arg0_1, view_1); arg0_1 = view_1 = None return () ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117587 Approved by: https://github.com/eellison	2024-01-17 23:07:45 +00:00
titaiwangms	26a63907ba	Ordering placeholder and get_attr nodes in unflattened module (#116910 ) Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes. Before: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- get_attr bias bias () {} get_attr weight weight () {} placeholder l_x_ l_x_ () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` After: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- placeholder l_x_ l_x_ () {} get_attr weight weight () {} get_attr bias bias () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #117409, #116667, #117591, #117500	2024-01-17 23:03:15 +00:00
titaiwangms	e457b6fb18	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 23:03:15 +00:00
PyTorch MergeBot	763ddb396d	Revert "[codemod] markDynamoStrictTest batch 18 (#117604 )" This reverts commit 24f288114a696a27771c075b8e8df556c13eced6. Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117604#issuecomment-1897082562))	2024-01-17 22:16:27 +00:00
PyTorch MergeBot	01c0c67937	Revert "[codemod] markDynamoStrictTest batch 19 (#117605 )" This reverts commit 0cda1e0b218895ce6121531991348b8bcbce9b94. Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117605#issuecomment-1897065994))	2024-01-17 22:12:59 +00:00
PyTorch MergeBot	87c2427173	Revert "[codemod] markDynamoStrictTest batch 20 (#117606 )" This reverts commit 308e154af5fd6388f49eabe631e7b78ca3ac9c39. Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117606#issuecomment-1897042843))	2024-01-17 22:08:20 +00:00
Mihir Patel	84cfe6d8b2	Drop all gather stats to debug not warning (#117669 ) Logger default level results in these all gather stats being spammed into every run which is very annoying Pull Request resolved: https://github.com/pytorch/pytorch/pull/117669 Approved by: https://github.com/Skylion007, https://github.com/awgu	2024-01-17 21:44:59 +00:00
Animesh Jain	8841d26046	[dynamo] LazyVariable - redirect __str__ to the realized variable __str__ (#117583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117583 Approved by: https://github.com/lezcano, https://github.com/jansel	2024-01-17 21:12:12 +00:00
Guoliang He	a7fbbc2a4a	[inductor] allow mm template to accumulate with float16 dtype (#117479 ) Fixes #108621 replace #108637 and #108982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117479 Approved by: https://github.com/jansel	2024-01-17 21:01:14 +00:00
Sam Larsen	208e64a9ba	Initial implementation of FakeTensor caching (#113873 ) Summary: Cache the result of FakeTensor dispatch and skip re-evaluation on cache hits. Test Plan: New unit tests. Caching is enabled in this diff, so all existing tests exercise the cache as well. Differential Revision: [D52841637](https://our.internmc.facebook.com/intern/diff/D52841637) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113873 Approved by: https://github.com/eellison	2024-01-17 20:38:54 +00:00
Xuehai Pan	c0940d2e93	[pytree] reuse `flatten_fn` in `flatten_with_keys_fn` to ensure consistency (#117656 ) Reuse `flatten_fn` in `flatten_with_keys_fn` to ensure `flatten_fn` and `flatten_with_keys_fn` get the same `leaves` and `context`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117656 Approved by: https://github.com/suo	2024-01-17 20:38:49 +00:00
Richard Barnes	bffc8ecfb0	[codemod] Fix shadows in PyTorch (#117562 ) Test Plan: Sandcastle Differential Revision: D52802592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117562 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-01-17 20:33:50 +00:00
PyTorch MergeBot	da6abaeeac	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit bb0fd1bd3ca145b77159427bc5bacf5f98ec3896. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
PyTorch MergeBot	cb0bfcf590	Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910 )" This reverts commit 12561bb5fed08283baf7a31e6678341a04e83adb. Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))	2024-01-17 19:34:26 +00:00
Sherlock Huang	89cf1ddb5c	[AOTInductor] Allow user to explicitly specify Device to run on (#117413 ) Summary: AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run. This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1". todo: - Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function. Test Plan: CI Differential Revision: D52747132 Privacy Context Container: 368960445142440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov	2024-01-17 19:28:04 +00:00
rzou	308e154af5	[codemod] markDynamoStrictTest batch 20 (#117606 ) [codemod] markDynamoStrictTest test_tensorexpr_pybind [codemod] markDynamoStrictTest test_tensorexpr [codemod] markDynamoStrictTest test_jit_llga_fuser [codemod] markDynamoStrictTest test_jit_fuser_te Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604, #117605	2024-01-17 19:20:11 +00:00
rzou	0cda1e0b21	[codemod] markDynamoStrictTest batch 19 (#117605 ) [codemod] markDynamoStrictTest export/test_verifier [codemod] markDynamoStrictTest export/test_upgrade [codemod] markDynamoStrictTest export/test_unflatten [codemod] markDynamoStrictTest export/test_serialize [codemod] markDynamoStrictTest export/test_serdes [codemod] markDynamoStrictTest export/test_retraceability [codemod] markDynamoStrictTest export/test_passes [codemod] markDynamoStrictTest export/test_pass_infra [codemod] markDynamoStrictTest export/test_functionalized_assertions [codemod] markDynamoStrictTest export/test_export_nonstrict [codemod] markDynamoStrictTest export/test_export [codemod] markDynamoStrictTest export/test_experimental [codemod] markDynamoStrictTest export/test_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604	2024-01-17 19:20:11 +00:00
rzou	24f288114a	[codemod] markDynamoStrictTest batch 18 (#117604 ) [codemod] markDynamoStrictTest functorch/test_vmap [codemod] markDynamoStrictTest profiler/test_profiler_tree [codemod] markDynamoStrictTest profiler/test_profiler [codemod] markDynamoStrictTest profiler/test_memory_profiler [codemod] markDynamoStrictTest functorch/test_ops [codemod] markDynamoStrictTest functorch/test_aotdispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219	2024-01-17 19:20:01 +00:00
rzou	006d655956	[codemod] markDynamoStrictTest batch 17 (#117219 ) [codemod] markDynamoStrictTest test_xnnpack_integration [codemod] markDynamoStrictTest test_vulkan [codemod] markDynamoStrictTest test_package Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219 Approved by: https://github.com/bdhirsh	2024-01-17 19:19:50 +00:00
titaiwangms	1967165d4d	[codemod] markDynamoStrictTest batch 16 (#117218 ) [codemod] markDynamoStrictTest test_public_bindings [codemod] markDynamoStrictTest test_package [codemod] markDynamoStrictTest test_legacy_vmap [codemod] markDynamoStrictTest test_namedtensor [codemod] markDynamoStrictTest test_fx [codemod] markDynamoStrictTest test_dataloader [codemod] markDynamoStrictTest test_content_store [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest lazy/test_ts_opinfo [codemod] markDynamoStrictTest functorch/test_vmap_registrations Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218 Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym ghstack dependencies: #117409, #116667, #117591, #117500, #116910, #117553	2024-01-17 19:12:41 +00:00
titaiwangms	ca0abf8606	Add inductor-specific testing strict mode denylist (#117553 ) We have one for Dynamo that currently applies to all "compile" configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I don't want to figure out the inductor situation right now, so we're going to add another denylist for inductor and work through it later. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553 Approved by: https://github.com/voznesenskym ghstack dependencies: #117409, #116667, #117591, #117500, #116910	2024-01-17 19:12:41 +00:00
titaiwangms	12561bb5fe	Ordering placeholder and get_attr nodes in unflattened module (#116910 ) Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes. Before: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- get_attr bias bias () {} get_attr weight weight () {} placeholder l_x_ l_x_ () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` After: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- placeholder l_x_ l_x_ () {} get_attr weight weight () {} get_attr bias bias () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #117409, #116667, #117591, #117500	2024-01-17 19:12:33 +00:00
titaiwangms	bb0fd1bd3c	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire ghstack dependencies: #117409, #116667, #117591	2024-01-17 19:12:24 +00:00
PyTorch MergeBot	0c26565d5d	Revert "Add pull request target to bc lint (#106065 )" This reverts commit d4136c90882337a0891f5216292e9e3d55c13262. Reverted https://github.com/pytorch/pytorch/pull/106065 on behalf of https://github.com/izaitsevfb due to Tightening CI security ([comment](https://github.com/pytorch/pytorch/pull/106065#issuecomment-1896439167))	2024-01-17 18:51:46 +00:00
PyTorch MergeBot	9da01affd3	Revert "[inductor] Faster C++ kernel python bindings (#117500 )" This reverts commit 3a52147cc59b240737602d3d046080bbf6f567f1. Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
PyTorch MergeBot	8c7e3a18ff	Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910 )" This reverts commit 5e0e78585d9f662ecb957c327c8d3fa31bff4f9a. Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
PyTorch MergeBot	e877c2e6ff	Revert "Add inductor-specific testing strict mode denylist (#117553 )" This reverts commit ab6207a34248fdf2d2766d0062f358b63380e151. Reverted https://github.com/pytorch/pytorch/pull/117553 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
PyTorch MergeBot	7f3cac06b9	Revert "[codemod] markDynamoStrictTest batch 16 (#117218 )" This reverts commit 46a8408fa123da571dc1c13dba9479ba6d540249. Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))	2024-01-17 18:42:39 +00:00
Sijia Chen	29fa6fbc4e	[Dynamo] Fix a corner case of reinplace_inplaceable_ops pass for triton kernels (#117612 ) Summary: We saw the following failure when compiling custom triton kernels: ``` RuntimeError: Argument 'getitem_22' of Node 'triton_kernel_wrapper_functional_proxy_3' was used before it has been defined! Please check that Nodes in the graph are topologically ordered ``` The root-cause is when doing the replacement, the replacement is replaced by another replacement. The fix will keep finding the replacement until it is not replaced Test Plan: Added a test case Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117612 Approved by: https://github.com/aakhundov	2024-01-17 18:41:42 +00:00
PyTorch MergeBot	e94b79f627	Revert "[codemod] markDynamoStrictTest batch 17 (#117219 )" This reverts commit 5bb2298da769121421711504da47955d3129b54f. Reverted https://github.com/pytorch/pytorch/pull/117219 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	8483f493af	Revert "[codemod] markDynamoStrictTest batch 18 (#117604 )" This reverts commit 70b22be32a2e6a1a51cb70a1418d73bfba533cc0. Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	0bfd9653ef	Revert "[codemod] markDynamoStrictTest batch 19 (#117605 )" This reverts commit 45d7859e751dff2096df8b346226b71cf6031424. Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	d51583b214	Revert "[codemod] markDynamoStrictTest batch 20 (#117606 )" This reverts commit ab847a2f5c903c629f4e2ab9bfea11f7edc1cf0e. Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))	2024-01-17 18:35:56 +00:00
PyTorch MergeBot	06dab05405	Revert "[export] Error on not pytree-flattened nodes (#117598 )" This reverts commit 35e847830511b2c700586d312177794be094d67e. Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing ONNX test in trunk `35e8478305`, probably a landrace as the PR signal looks fine ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1896389009))	2024-01-17 18:29:04 +00:00
vfdev-5	d0fc268918	Fixed issue in upsample_nearestnd lowering with scales (#117538 ) Fixed #116848 Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264 Originally, the code was ```python def upsample_nearestnd( x, output_size, scales_x: Tuple[Optional[float], ...], n: int = 2, exact: bool = False, ): # ... scales = [i / o for i, o in zip(i_sizes, o_sizes)] for i, scale in enumerate(scales): if scale: scales[i] = scale ``` which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output. This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749 ```python def upsample_nearestnd( x, output_size, scales_x: Tuple[Optional[float], ...], n: int = 2, exact: bool = False, ): # ... scales = [i / o for i, o in zip(i_sizes, o_sizes)] for i, scale in enumerate(scales_x): if scale: scales[i] = scale ``` however, this leads to a wrong scale value as it should be inverted as (1 / scale). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538 Approved by: https://github.com/peterbell10	2024-01-17 18:14:35 +00:00
rzou	ab847a2f5c	[codemod] markDynamoStrictTest batch 20 (#117606 ) [codemod] markDynamoStrictTest test_tensorexpr_pybind [codemod] markDynamoStrictTest test_tensorexpr [codemod] markDynamoStrictTest test_jit_llga_fuser [codemod] markDynamoStrictTest test_jit_fuser_te Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604, #117605	2024-01-17 17:43:27 +00:00
rzou	45d7859e75	[codemod] markDynamoStrictTest batch 19 (#117605 ) [codemod] markDynamoStrictTest export/test_verifier [codemod] markDynamoStrictTest export/test_upgrade [codemod] markDynamoStrictTest export/test_unflatten [codemod] markDynamoStrictTest export/test_serialize [codemod] markDynamoStrictTest export/test_serdes [codemod] markDynamoStrictTest export/test_retraceability [codemod] markDynamoStrictTest export/test_passes [codemod] markDynamoStrictTest export/test_pass_infra [codemod] markDynamoStrictTest export/test_functionalized_assertions [codemod] markDynamoStrictTest export/test_export_nonstrict [codemod] markDynamoStrictTest export/test_export [codemod] markDynamoStrictTest export/test_experimental [codemod] markDynamoStrictTest export/test_db Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219, #117604	2024-01-17 17:43:27 +00:00
rzou	70b22be32a	[codemod] markDynamoStrictTest batch 18 (#117604 ) [codemod] markDynamoStrictTest functorch/test_vmap [codemod] markDynamoStrictTest profiler/test_profiler_tree [codemod] markDynamoStrictTest profiler/test_profiler [codemod] markDynamoStrictTest profiler/test_memory_profiler [codemod] markDynamoStrictTest functorch/test_ops [codemod] markDynamoStrictTest functorch/test_aotdispatch Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604 Approved by: https://github.com/bdhirsh ghstack dependencies: #117219	2024-01-17 17:43:17 +00:00
atalman	6d1406d177	[oidc] Migrate Triton wheel upload to oidc (#117648 ) Fix for triton upload job that is currently failing: https://github.com/pytorch/pytorch/actions/runs/7555471235/job/20574022304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117648 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/malfet	2024-01-17 17:04:36 +00:00
angelayi	35e8478305	[export] Error on not pytree-flattened nodes (#117598 ) Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API". The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598 Approved by: https://github.com/avikchaudhuri	2024-01-17 16:33:57 +00:00
Sam Larsen	40a6710ad3	Mark set_ as an inplace view op (#115769 ) Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them. Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake` Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769 Approved by: https://github.com/bdhirsh	2024-01-17 15:32:18 +00:00
rzou	5bb2298da7	[codemod] markDynamoStrictTest batch 17 (#117219 ) [codemod] markDynamoStrictTest test_xnnpack_integration [codemod] markDynamoStrictTest test_vulkan [codemod] markDynamoStrictTest test_package Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219 Approved by: https://github.com/bdhirsh	2024-01-17 14:41:07 +00:00
Jithun Nair	3bb8d2b905	Update triton ROCm version to 6.0 (#117433 ) Related to PyTorch nightly wheels upgrade to ROCm6.0: https://github.com/pytorch/pytorch/pull/116983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117433 Approved by: https://github.com/malfet, https://github.com/jeffdaily	2024-01-17 12:09:45 +00:00
Digant Desai	e2830e6328	[PyTorch] SDPA decomp: actually use attn_mask (#117579 ) Summary: Need to pass this along Test Plan: ``` cd ~/fbsource/fbcode/executorch/backends/xnnpack/test buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32 ``` Reviewed By: larryliu0820 Differential Revision: D52812369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579 Approved by: https://github.com/larryliu0820	2024-01-17 10:26:43 +00:00
fduwjj	1deb75b584	[c10d] Move the timeout dump check from watchdog to monitoring thread (#117168 ) To avoid potential hang in watchdog thread which will prevent us from dumping timeout debugging info, we move the check of global collective timeout signals and dumping debugging info to monitoring thread. We also need to ensure that we don't wait very long to check out the timeout signal from store; otherwise, we will miss the signal and don't get debugging info dumped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117168 Approved by: https://github.com/wconstab	2024-01-17 08:05:40 +00:00
titaiwangms	ed6006ee5d	[Reland][ONNX] Guard xfail tests with error messages (#117592 ) Reland #117425 Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (xfail_if_model_type_is_not_exportedprogram). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117592 Approved by: https://github.com/BowenBao	2024-01-17 08:05:35 +00:00
suo	9448065061	[pytree] add key path api (#116786 ) This PR introduces a key path API to pytrees, drawing direct inspiration from JAX's [key path API](https://jax.readthedocs.io/en/latest/jax-101/05.1-pytrees.html#key-paths). I added the 3 APIs described there, and a registry of `flatten_with_keys` fns for each node type, which is a version of `flatten` that also returns `KeyEntry`s describing how to access values from the original pytree. Current use cases for this API: - Folks would like to do argument traversal over input pytrees to do verification and compatibility enforcement. Keypaths are useful for this—https://fburl.com/code/06p7zrvr is a handrolled pass doing basically the same thing but probably more fragilely. - In export non-strict mode, we need to figure out a way to track sources for pytree inputs. In strict mode, dynamo handles this for us, but we'd like a decoupled component to handle this when we're not using dynamo. I'm sure there are places it would be useful. Some design notes: - I only implemented the API for the Python pytree impl. optree has some differences in how their keypath APIs are designed (see https://github.com/pytorch/pytorch/issues/113378 for discussion). I have some issues with the proposed typed_path solution in that discussion and prefer JAX's API, but we can hash that out separately. - The way folks register a `flatten_with_keys` fn is through a new kwarg to `register_pytree_node`. This follows how we do serialization fns, although the list of additional arguments is getting unwieldy. - My impl handles pytrees with an undefined `flatten_with_keys` fn is different from JAX. I will raise an error, JAX creates a fallback keyentry. Differential Revision: [D52547850](https://our.internmc.facebook.com/intern/diff/D52547850/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116786 Approved by: https://github.com/voznesenskym	2024-01-17 07:24:35 +00:00
Ana Basalo	5667a990fd	Chore: improve log message about cache size limit exceeded (#116557 ) Fixes #114527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116557 Approved by: https://github.com/ezyang	2024-01-17 06:07:18 +00:00
baseplate-admin	3cd2c68fbe	Fix syntax highlighting in android (#117439 ) Hi i have found code blocks are not highlighted properly. This PR aims to fix that Pull Request resolved: https://github.com/pytorch/pytorch/pull/117439 Approved by: https://github.com/ezyang	2024-01-17 05:17:13 +00:00
Yanbo Liang	735715e6d3	[Dynamo] Make profiler function will be ignored warn only once (#117585 ) Fix #111632 #111622 accidentally reverted #111921, we should bring it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117585 Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/msaroufim	2024-01-17 04:05:45 +00:00
Roger Lam	2c5488d719	Match all_gather_into_tensor args names in remapping (#117224 ) Fixes #114179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117224 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2024-01-17 03:50:29 +00:00
Jerry Zhang	8f1bc876b2	[quant] Support custom qmin/qmax for activation and weight for xnnpack quantizer (#117305 ) Summary: att, this allows us to experiment with 4 bit quant in xnnpack Test Plan: python test/test_quantization.py -k test_dynamic_linear_int4_weight Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117305 Approved by: https://github.com/digantdesai	2024-01-17 03:22:49 +00:00
Wei-Sheng Chin	e4c2dfb35b	[Dynamo, ONNX] Run llama attention with onnxrt and dynamic shapes (#117009 ) As title. This PR enables dynamic shapes for running llama with ORT. Both forward and backward are captured as a single graph with this PR. Summary of changes: - Test llama attention, llama decoder, llama model to ensure (1) no graph breaks (2) models exported with dynamic shapes with onnxrt dynamo backend - Reshape SymInt to tensor with shape (1,) to align with the cast done for int in fx_onnx_interpreter.py - Create an util function to map Python types (e.g., float) to ONNX tensor element type (e.g., onnx.TensorProto.FLOAT). - Return `hint` for torch.Sym* in type promotion pass. - Remove _replace_to_copy_with_to since exporter supports aten::_to_copy it now. - Modify _get_onnx_devices to return CPU device for torch.Sym*. - Introduce _adjust_scalar_from_fx_to_onnx (e.g., change 0 to tensor(0)) and _adjust_scalar_from_onnx_to_fx (e.g., change tensor(0) to 0) for adjusting scalars when passing values to and receive values from ORT. - Now, ValueInfoProto of graph inputs (i.e., input_value_infos) are stored and used as `ORT-expected type` when calling `_adjust_scalar_from_fx_to_onnx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117009 Approved by: https://github.com/titaiwangms	2024-01-17 03:02:41 +00:00
rzou	fb06ed36d1	Change dynamo_test_failures.py to silently run skipped tests (#117401 ) - We silently run skipped tests and then raise a skip message with the error message (if any) - Instead of raising expectedFailure, we raise a skip message with the error message (if any) We log the skip messages in CI, so this will let us read the logs and do some basic triaging of the failure messages. Test Plan: - existing tests. I hope that there are no tests that cause each other to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117401 Approved by: https://github.com/voznesenskym ghstack dependencies: #117391, #117400	2024-01-17 02:48:19 +00:00
garfield1997	9056c7d941	use getPinnedMemoryAllocator for privateuseone (#117530 ) Fixes #117482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117530 Approved by: https://github.com/ezyang	2024-01-17 02:33:02 +00:00
sanchitintel	8852bb561c	More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367 ) ### Summary In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review. At the time, landing that PR asap seemed essential, so I agreed to roll-back that change, In some cases, more threads can be used than are being used with the current approach. <strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>. On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR. I've also added op-level benchmarks pertaining to example input shapes in this PR. ### Benchmarks Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids) One socket of 48 physical cores was used, with & without HyperThreading. Intel OpenMP & tcmalloc were preloaded. Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones - `KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all` #### Already existing benchmarks \|Benchmark name (dim is 1, by default) \| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup Percentage = (old-new)100/old \| Speedup ratio (old/new)\| \|-------------\|--------\|-------\|----------------------------\|----------\| \|Softmax_N1_C3_H256_W256_cpu\|31.364\|11.594\|63.03% \|2.705\| \|Softmax_N4_C3_H256_W256_cpu\|34.475\|24.966\| 27.58%\|1.380\| \|Softmax_N8_C3_H512_W256_cpu\|94.044\|78.372\|16.66%\|1.199\| \|Softmax2d_N8_C3_H512_W256_cpu\|100.195\|79.529\|20.62%\|1.259\| #### Some of the following benchmarks are being added in this PR \|Benchmark name\| Previous implementation's latency (in ms) \| This implementation's latency (in ms)\|Speedup percentage = (old-new)100/old\| Speedup ratio (old/new) \| \|-------------\|--------\|-------\|----------------------------\|--------------------\| \|LogSoftmax_M128_N128_dim1_cpu\|7.629\|6.475\|15.12%\| 1.178\| \|LogSoftmax_M48_N128_dim1_cpu\|6.848\|5.969\|12.83%\| 1.147\| \|LogSoftmax_M16_N1024_dim1_cpu\|7.004\|6.322\|9.73%\| 1.107\| \|LogSoftmax_M32_N1024_dim1_cpu\|7.037\|6.558\|6.80%\| 1.073\| \|LogSoftmax_M48_N1024_dim1_cpu\|7.155\|6.773\|5.33%\|1.056\| \|LogSoftmax_M16_N512_dim1_cpu\|6.797\|5.862\|13.75%\|1.159\| \|LogSoftmax_M32_N512_dim1_cpu\|7.223\|6.202\|14.13%\|1.164\| \|LogSoftmax_M48_N512_dim1_cpu\|7.159\|6.301\|11.98%\|1.136\| \|LogSoftmax_M16_N256_dim1_cpu\|6.842\|5.682\|16.95%\|1.204\| \|LogSoftmax_M32_N256_dim1_cpu\|6.840\|6.086\|11.02%\|1.123\| \|LogSoftmax_M48_N256_dim1_cpu\|7.005\|6.031\|13.94%\|1.161\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-17 02:26:29 +00:00
Tobias Ringwald	4a54ab328c	Removed an internal assertion for the optional stable value and inste… (#117414 ) …ad defaulted to the standard (=false). Fixes #117255. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117414 Approved by: https://github.com/ezyang	2024-01-17 02:25:21 +00:00
Nikita Shulga	1872834247	[MPS] Fix `torch.mm` correctness for large matrices (#117549 ) Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows: ```objc NSMutableArray<MPSGraphTensor> rows = [NSMutableArray new]; for (int64_t i = 0; i < M; i += tile_size) { const auto i_end = std::min(i + tile_size, M); NSMutableArray<MPSGraphTensor> row_chunks = [NSMutableArray new]; for (int64_t j = 0; j < K; j += tile_size) { const auto j_end = std::min(j + tile_size, K); MPSGraphTensor* tile = nil; for (int64_t k = 0; k < N; k += tile_size) { const auto k_end = std::min(k + tile_size, N); auto selfChunk = [graph sliceTensor:selfTensor starts:@[ @(i), @(k) ] ends:@[ @(i_end), @(k_end) ] strides:@[ @(1), @(1) ] name:nil]; auto otherChunk = [graph sliceTensor:otherTensor starts:@[ @(k), @(j) ] ends:@[ @(k_end), @(j_end) ] strides:@[ @(1), @(1) ] name:nil]; auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil]; tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM; } [row_chunks addObject:tile]; } auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject; [rows addObject:row]; } return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject; ``` One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable Fixes https://github.com/pytorch/pytorch/issues/116769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549 Approved by: https://github.com/kulinseth	2024-01-17 01:33:08 +00:00
Iris Z	f518cf811d	[DCP] Adds support for meta tensor loading for DCP.load_state_dict() (#113319 ) Currently, DCP requires the `model.state_dict()` to be materialized before passing it to DCP to load, since DCP uses the pre-allocated storage from the initialized model state_dict. Therefore, even for fine-tuning and distributed inference, users would need to explicitly materialize the model on GPU before `DCP.load_state_dict()`. Today's flow: ``` with torch.device("meta"): model2 = parallelize_module( MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan ) model.to_empty(device='cuda') state_dict_to_load = model2.state_dict() DCP.load_state_dict( state_dict=state_dict_to_load, storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR), ) model2.load_state_dict(state_dict_to_load) ``` This PR adds support for meta tensor loading. In DCP's planner, when encountering tensors/DTensor on meta device, we initialize tensor/DTensor on the current device on the fly and replace the tensor/DTensor on meta device in the state_dict. After the change, users no longer needs to manually call `model.to_empty()` when loading existing checkpoints for fine-tuning and distributed inference. Updated user flow: ``` with torch.device("meta"): model2 = parallelize_module( MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan ) # no longer need to call model.to_empty(device='cuda') state_dict_to_load = model2.state_dict() DCP.load_state_dict( state_dict=state_dict_to_load, storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR), ) model2.load_state_dict(state_dict_to_load, assign=True) ``` Note that for distributed training, it's still the users' responsibility to reset the parameters (`model.reset_parameters()`) as checkpoint might not exist. Note that we need to loop thru the state_dict to replace meta tensor/DTensor instead of calling `model.to_empty()` since `DCP.load()` only takes in state_dict but not model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113319 Approved by: https://github.com/fegin, https://github.com/LucasLLC	2024-01-17 00:23:29 +00:00
Jeff Daily	4a44a3c76d	update kineto submodule (#114297 ) Rework roctracer shutdown flushing `9365c1aa09` This fixes flaky unit tests that use kineto to verify certain kernels have executed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114297 Approved by: https://github.com/malfet, https://github.com/atalman	2024-01-17 00:17:03 +00:00
Huy Do	cf470e7b59	Migrate update-commit-hash to test-infra (#117506 ) After https://github.com/pytorch/test-infra/pull/4885, the GHA is now reusable on `test-infra`. This tests the change and we can also land it after https://github.com/pytorch/test-infra/pull/4885 lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117506 Approved by: https://github.com/malfet, https://github.com/atalman	2024-01-17 00:15:04 +00:00
Masaki Kozuki	1d14adfa66	[mta] Fused SGD (#116585 ) depends on #116583 rel: - #94791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585 Approved by: https://github.com/janeyx99	2024-01-16 23:54:38 +00:00
Joel Schlosser	5aac95c713	Introduce slice_inverse() op (#117041 ) Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR: * Introduces the op itself * Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly * Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()` * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption) @albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041 Approved by: https://github.com/bdhirsh	2024-01-16 23:44:54 +00:00
vfdev-5	f6767244cf	Added meta function for _upsample_bicubic2d_aa (#117347 ) This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127 ``` /opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors) E torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>((FakeTensor(..., size=(1, s0, s1, s2)),), {'size': [s4, floor(s3s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}): E aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers E E from user code: E File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image E image = interpolate( E E Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information E E E You can suppress this exception and fall back to eager by setting: E import torch._dynamo E torch._dynamo.config.suppress_errors = True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347 Approved by: https://github.com/peterbell10	2024-01-16 23:33:55 +00:00
nidefawl	b1c3f9f1b9	Fix missing mkl-dnn include paths (#117492 ) Fixes #91968 and #100960 This commit fixes missing include paths by linking `caffe2_pybind11_state_gpu` against `caffe2::mkldnn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117492 Approved by: https://github.com/ezyang	2024-01-16 23:28:17 +00:00
rzou	46a8408fa1	[codemod] markDynamoStrictTest batch 16 (#117218 ) [codemod] markDynamoStrictTest test_public_bindings [codemod] markDynamoStrictTest test_package [codemod] markDynamoStrictTest test_legacy_vmap [codemod] markDynamoStrictTest test_namedtensor [codemod] markDynamoStrictTest test_fx [codemod] markDynamoStrictTest test_dataloader [codemod] markDynamoStrictTest test_content_store [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest lazy/test_ts_opinfo [codemod] markDynamoStrictTest functorch/test_vmap_registrations Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218 Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym ghstack dependencies: #117553	2024-01-16 23:04:31 +00:00
rzou	ab6207a342	Add inductor-specific testing strict mode denylist (#117553 ) We have one for Dynamo that currently applies to all "compile" configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I don't want to figure out the inductor situation right now, so we're going to add another denylist for inductor and work through it later. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553 Approved by: https://github.com/voznesenskym	2024-01-16 23:04:31 +00:00
titaiwangms	5e0e78585d	Ordering placeholder and get_attr nodes in unflattened module (#116910 ) Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes. Before: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- get_attr bias bias () {} get_attr weight weight () {} placeholder l_x_ l_x_ () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` After: ```bash test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode name target args kwargs ------------- ----------- ------------------------ -------------------------------------------------------------- -------- placeholder l_x_ l_x_ () {} get_attr weight weight () {} get_attr bias bias () {} call_function convolution aten.convolution.default (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1) {} output output output (convolution,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910 Approved by: https://github.com/tugsbayasgalan	2024-01-16 22:58:37 +00:00
PyTorch MergeBot	4ec667cc64	Revert "[ONNX] Guard xfail tests with error messages (#117425 )" This reverts commit 1993956da33376f34125306209930ed00c486abd. Reverted https://github.com/pytorch/pytorch/pull/117425 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing in trunk `1993956da3` ([comment](https://github.com/pytorch/pytorch/pull/117425#issuecomment-1894650769))	2024-01-16 22:56:35 +00:00
Jason Ansel	3a52147cc5	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-16 22:30:04 +00:00
Prachi Gupta	2a3fb7dbb6	[ROCm] Fix NHWC related tests in test_inductor_freezing (#117158 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117158 Approved by: https://github.com/eellison, https://github.com/pruthvistony	2024-01-16 20:48:49 +00:00
Colin Peppler	4712c7dac8	[inductor] add C-shim for index_put (#116667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116667 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-01-16 20:29:14 +00:00
Xilun Wu	3e8c8ce37b	Update Reviewers for PT-D team (#117409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117409 Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/fduwjj	2024-01-16 19:40:41 +00:00
titaiwangms	1993956da3	[ONNX] Guard xfail tests with error messages (#117425 ) Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (`xfail_if_model_type_is_not_exportedprogram`). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117425 Approved by: https://github.com/thiagocrepaldi	2024-01-16 19:33:51 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	28be47c267	[RELAND][export] Exempt autograd ops for predispatch export (#117448 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/116527/files Test Plan: CI Differential Revision: D52675324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117448 Approved by: https://github.com/ydwu4	2024-01-16 19:32:15 +00:00
Huy Do	99e54744f7	Fix ExecuTorch pinned commit update failure (#117518 ) https://github.com/pytorch/pytorch/pull/117003 shows in interesting failure in which building ExecuTorch runner fails because it needs the change from https://github.com/pytorch/pytorch/pull/117378. This reveals a chicken-and-egg bug in the job setup where building ExecuTorch runner depends on PyTorch and thus couldn't be part of the Docker image build where PyTorch is not yet available. The failure happens because an outdated version of PyTorch is there on the Docker image. So, like vision and audio, the step to build ExecuTorch runner needs to be done during test time. I also fix the installation of vision and audio in ET job because they are now installed using PyTorch pinned commits as usual after https://github.com/pytorch/executorch/pull/1247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117518 Approved by: https://github.com/larryliu0820, https://github.com/malfet	2024-01-16 18:25:15 +00:00
rzou	c30346db0e	Check in some torch.compile helper scripts (#117400 ) - passrate.py: compute the pass rate - update_failures.py: update `dynamo_test_failures.py` Both of these scripts require you to download the test results from CI locally. Maybe we can automate this more in the future. Checking these in for now, with no tests :P. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117400 Approved by: https://github.com/voznesenskym ghstack dependencies: #117391	2024-01-16 17:14:43 +00:00
rzou	a7a2773567	Check invariants for dynamo_test_failures.py (#117391 ) Test that: - the xfail list and the skip list don't intersect - the test names look sane Pull Request resolved: https://github.com/pytorch/pytorch/pull/117391 Approved by: https://github.com/voznesenskym	2024-01-16 17:14:43 +00:00
CaoE	29516bd2a0	add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281 ) Step1 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-16 15:25:08 +00:00
Wang, Chuanqi	0fa6ee44d9	[CI] Skip lib for xpu binary unit test (#117514 ) Skip .so and .a libraries under build/bin/ for test_xpu_bin in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/117514 Approved by: https://github.com/malfet	2024-01-16 12:07:15 +00:00
Nikita Shulga	13473df0d7	[MPS] Make addmm support empty matmul (#117223 ) Refactor common part between `mm_out_mps` and `addmm_out_mps` into `do_mm` static function. Change input placeholder initialization logic in a way that `addmm` can handle matrix multiplication with empty dimension. Add tests for `mm`+`addmm` with empty tensors to OpInfo but skip addmm with empty matrices from onnx tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/117223 Approved by: https://github.com/albanD	2024-01-16 06:46:20 +00:00
Oguz Ulgen	28bb31e4a5	[Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358 ) (#116897 ) For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing. This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation. This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism. While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116897 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/voznesenskym	2024-01-16 03:57:13 +00:00
PyTorch UpdateBot	f20eaadfef	[vision hash update] update the pinned vision hash (#117509 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117509 Approved by: https://github.com/pytorchbot	2024-01-16 03:17:24 +00:00
Nikita Shulga	ae3d7091cb	[BE] Replace deprecated `set_default_tensor_type` (#117505 ) Not sure what it was doing there, but replaced it with `set_default_dtype` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117505 Approved by: https://github.com/Skylion007	2024-01-16 02:32:49 +00:00
Yanbo Liang	dd2cff1591	[Dynamo] Use isinstance rather than istype when check if python module type (#117022 ) This is to fix a issue from Meta internal use case, where third-party ```DictConfig``` has bug on [```__eq__```](`fd730509ef/omegaconf/dictconfig.py (L596)`) and it triggers Dynamo error because we are using ```obj in [x, y]``` check. Then I found we can use ```isinstance``` to cover all and removing these special cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117022 Approved by: https://github.com/ckluk2, https://github.com/jansel	2024-01-15 23:25:30 +00:00
Kurt Mohler	bac0878780	Error if compiled nondeterministic backward called in deterministic mode (#114780 ) Part of #113707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114780 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-01-15 22:45:40 +00:00
Mihir Patel	c1ab2777c0	Update state_dict.py to propagate cpu offload (#117453 ) Update state_dict.py to propagate cpu offload. It looks like this flag is accidentally ignored? Pull Request resolved: https://github.com/pytorch/pytorch/pull/117453 Approved by: https://github.com/Skylion007	2024-01-15 22:13:37 +00:00
vfdev-5	1a57c18760	Fixed cuda grads for interpolate::trilinear on non-contig grad output (#117373 ) Fixes #113642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117373 Approved by: https://github.com/lezcano	2024-01-15 18:05:47 +00:00
Peter Bell	001585f446	[fx][inductor] Add statically_known_true utility for SymBool (#117359 ) This adds a function `statically_known_true` for `SymBool` that works like inductor's `is_expr_static_and_true`. That is, it tries to simplify the expression to a constant or returns `False` if it cannot be simplified. This is useful in cases that can be optimized if the condition is met, otherwise it doesn't effect correctness so we can avoid adding guards. I also use this new function in inductor for `FakeTensorUpdater` and `remove_noop_pass` which both generated unexpected guards previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359 Approved by: https://github.com/lezcano	2024-01-15 18:01:10 +00:00
atalman	661747c727	XPU, move oidc to top level workflow and use gha_workflow_s3_and_ecr_read_only policy (#117498 ) 1. oidc permissions need to be set on top level workflow 2. rename gha_workflow_s3_and_ecr_read_only to gha_workflow_s3_and_ecr_read_only policy which better reflects the policy usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/117498 Approved by: https://github.com/chuanqi129, https://github.com/huydhn	2024-01-15 17:46:20 +00:00
Peter Bell	7a8013fbfa	[inductor] Handle more edge cases in slice and slice_scatter (#117377 ) Fixes #117110 When slicing we can end up with start and end which are out of bounds, which is handled in python slicing by clamping to the correct bounds. There is also the case where end < start which should result in an empty slice. In the isoneutral_mixing failure we have the second case, with `start=2, end=0` which in `slice_scatter` became `src_size[dim] = -2`. This PR improves slice's edge case handling and factors the start and end normalization code out so it can be shared with slice_scatter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377 Approved by: https://github.com/lezcano	2024-01-15 17:05:48 +00:00
Edward Z. Yang	5c700f60a5	Properly preserve SymInt input invariant when splitting graphs (#117406 ) Fixes https://github.com/pytorch/pytorch/issues/111636 Fixes https://github.com/pytorch/pytorch/issues/108877 Fixes https://github.com/pytorch/pytorch/issues/116956 Inductor has an invariant that every dynamic shape symbol s0, s1, etc. which is referenced by an input tensor must also be passed in explicitly as an argument. It has some capability of reverse engineering symbols if it's obvious how to get them (e.g., if you pass in `arg: f32[s0, 4]` it will know that it can retrieve `s0 = arg.size(0)`) but in full generality it is not always possible to derive this (e.g., if the only mention of s0 is in `arg2: f32[s0 + s1, 4]`). However, the graph splitter used by optimize_ddp did not respect this invariant. This PR makes it respect it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117406 Approved by: https://github.com/wconstab	2024-01-15 15:04:57 +00:00
albanD	75818adcf7	Pyi doc inclusion + fix (#117267 ) Reland of https://github.com/pytorch/pytorch/pull/114705 with extra fix to smoothly handle when the modules we're trying to load are not available (and thus the pyi won't contain the docs in this case). Tested locally that it works properly in fbcode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117267 Approved by: https://github.com/ezyang	2024-01-15 13:06:53 +00:00
Sun, Jiayi	7a851fedc8	support torch.mm with conjugate transposed inputs (#117238 ) Fix https://github.com/pytorch/pytorch/issues/116855. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117238 Approved by: https://github.com/lezcano	2024-01-15 12:36:01 +00:00
Edward Z. Yang	41ffea2f99	Properly unwrap_storage tensors sent to DynamicScalar (#117444 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117444 Approved by: https://github.com/Skylion007	2024-01-15 12:15:04 +00:00
Sun, Jiayi	d9b265adaf	modify the conditions as PythonModuleVariable (#116856 ) ## Motivation The current code of `value in [torch.backends.cudnn, torch.ops]` requires `value` to have the implementation of `__eq__`. If the value is a custom object and does not implement `__eq__`, dynamo will throw error. For example, ConvolutionOpContext, the custom 'torch._C.ScriptClass' object registered in IPEX, dynamo will throw the following error: torch._dynamo.exc.InternalTorchDynamoError: '__eq__' is not implemented for __torch__.torch.classes.ipex_prepack.ConvolutionOpContext I think this is a common issue, To avoid this issue, the PR replaces the current code `value in [torch.backends.cudnn, torch.ops]`with `isinstance(value, (torch.backends.cudnn.CudnnModule, torch._ops._Ops)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116856 Approved by: https://github.com/jansel	2024-01-15 11:10:57 +00:00
PyTorch UpdateBot	d089bb1b72	[xla hash update] update the pinned xla hash (#117485 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117485 Approved by: https://github.com/pytorchbot	2024-01-15 10:33:18 +00:00
Jiong Gong	2b56d80460	[inductor][cpp] apply simplify_index_in_vec_range to vector store and vector transpose (#117263 ) As the title, this PR extends the `simplify_index_in_vec_range` to store and transpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117263 Approved by: https://github.com/jansel ghstack dependencies: #117221, #117260	2024-01-15 08:41:28 +00:00
Jiong Gong	3b00dd5843	[inductor][cpp] apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260 ) For the one of the kernels in the UT `test_vec_contiguous_ModularIndexing`: Before: ```c++ for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L)) { auto tmp0 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x1_inner = 0; x1_inner < 16; x1_inner++) { tmpbuf[x1_inner] = in_ptr0[static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (256Lx1_inner) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data()); } () ; tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0_vec.mean.store(out_ptr0 + static_cast<long>(x1 + (28Lx0))); tmp_acc0_vec.m2.store(out_ptr1 + static_cast<long>(x1 + (28Lx0))); } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L)) { { #pragma omp declare reduction( welford:Welford<float>: omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) Welford<float> tmp_acc0 = Welford<float>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))]; tmp_acc0 = welford_combine(tmp_acc0, tmp0); } out_ptr0[static_cast<long>(x1 + (28Lx0))] = tmp_acc0.mean; out_ptr1[static_cast<long>(x1 + (28Lx0))] = tmp_acc0.m2; } } ``` After: ```c++ for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L)) { { #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()}) #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()}) Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((128L(c10::div_floor_integer(x2, 256L))) + (256Lx1) + (7168L(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336Lx0) + (static_cast<long>(x2) % static_cast<long>(128L)))); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0); } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (28Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.m2); } } } ``` This PR also further speeds up the model `swin_base_patch4_window7_224` from 1.25x to 1.28x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117260 Approved by: https://github.com/jansel ghstack dependencies: #117221	2024-01-15 06:57:25 +00:00
PyTorch UpdateBot	3a0bcd2c12	[audio hash update] update the pinned audio hash (#117423 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117423 Approved by: https://github.com/pytorchbot	2024-01-15 05:50:51 +00:00
Sai-Pra	19502ff6aa	Fixed typo in build_activation_images.py (#117458 ) In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458 Approved by: https://github.com/malfet	2024-01-15 03:27:40 +00:00
PyTorch UpdateBot	03c6f79548	[vision hash update] update the pinned vision hash (#117311 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117311 Approved by: https://github.com/pytorchbot	2024-01-15 03:15:20 +00:00
Edward Z. Yang	2200118f59	Enable some uint{16,32,64} tests that are working (#116809 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809 Approved by: https://github.com/albanD	2024-01-15 02:25:21 +00:00
Nikita Shulga	a298fba146	[MPS] Increase metal language support to 2.3 (#117472 ) As Conda binaries are still built on MacOS 12, which renders MPS unusable after https://github.com/pytorch/pytorch/pull/116942 Test plan: ``` % xcrun -sdk macosx metal --std=macos-metal2.3 -Wall -o Index Index.metal % xcrun -sdk macosx metal --std=macos-metal2.2 -Wall -o Index Index.metal Index.metal:167:1: error: type 'const constant ulong3 *' is not valid for attribute 'buffer' REGISTER_INDEX_OP_ALL_DTYPES(select); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Index.metal:159:5: note: expanded from macro 'REGISTER_INDEX_OP_ALL_DTYPES' REGISTER_INDEX_OP(8bit, idx64, char, INDEX_OP_TYPE, ulong3); \ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... ``` Fixes https://github.com/pytorch/pytorch/issues/117465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117472 Approved by: https://github.com/xuzhao9	2024-01-15 01:16:52 +00:00
Edward Z. Yang	61a181e83c	Report function name in stack trace annotations (#117459 ) When working with internal flows, it can sometimes be ambiguous what version of the code they are working with. In this case, having the function name available in the stack trace can help identify what you are looking at. Example now looks like: ``` [DEBUG] # File: /data/users/ezyang/a/pytorch/a.py:5 in f, code: return x + x ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117459 Approved by: https://github.com/Skylion007	2024-01-15 00:29:13 +00:00
vasiliy	a6d33614d6	add float8 types to dtypes table (#117375 ) Summary: As titled Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375 Approved by: https://github.com/ezyang	2024-01-15 00:23:07 +00:00
Adnan Akhundov	c3e2b94827	Realize non-ReinterpretView Views in custom Triton kernel args (#117468 ) Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input. This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro). Test Plan: ``` $ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input ... ---------------------------------------------------------------------- Ran 1 test in 3.909s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468 Approved by: https://github.com/oulgen	2024-01-14 23:31:38 +00:00
Aaron Gokaslan	62496ffd0d	[dynamo][easy]: Add support for `operator.truth` (#117463 ) * This is an old builtin function equivalent to the bool constructor. it is easy enough to add support for. * I also realized the tests were in the wrong class (the one reserved for testing default args) so I moved them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117463 Approved by: https://github.com/jansel	2024-01-14 19:08:31 +00:00
Edward Z. Yang	2748f05056	Add torch.fx.interpreter to uninteresting_files (#117460 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117460 Approved by: https://github.com/Skylion007	2024-01-14 18:35:21 +00:00
Huy Do	a1155883d4	Clean up Docker config on ROCm runner (#117432 ) This fixes the issues on trunk when logging in to ECR on ROCm runner is failing. During my test, it's also ok to fail the login part with that `not implemented` error https://github.com/pytorch/pytorch/actions/runs/7516446579/job/20461801473, and pulling the image from ECR still works, so I set `continue-on-error: true` on the step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117432 Approved by: https://github.com/malfet	2024-01-14 18:27:09 +00:00
Edward Z. Yang	a76610e6fb	[BE] Delete unused is_dynamo_compiling (#117455 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117455 Approved by: https://github.com/Skylion007, https://github.com/yanboliang ghstack dependencies: #117451, #117452, #117454	2024-01-14 15:15:29 +00:00
Edward Z. Yang	347255809c	Make c10::SymInt typecaster support scalar-like fake tensor (#117454 ) We can use `__index__` to do this conversion because that will trigger a guard on data dependent SymInt if the tensor is a fake tensor, but if we fetch item directly and put it in the Scalar, we may still be able to make it work out. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117454 Approved by: https://github.com/yanboliang ghstack dependencies: #117451, #117452	2024-01-14 15:15:29 +00:00
Edward Z. Yang	796fe40a96	[BE] Delete unnecessary variable fastpath (#117452 ) This fastpath is unnecessary because in the logic below we do the same thing: ``` auto& var = THPVariable_Unpack(obj); if (var.numel() != 1 \|\| !at::isIntegralType( var.dtype().toScalarType(), /include_bool/ true)) { throw_intlist_exception(this, i, obj, idx); } auto scalar = var.item(); TORCH_CHECK(scalar.isIntegral(/include bool/ false)); res.push_back(scalar.toSymInt()) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117452 Approved by: https://github.com/yanboliang ghstack dependencies: #117451	2024-01-14 14:39:46 +00:00
Edward Z. Yang	220cf46c2a	Always accept 0-d scalar tensors as int, even if __index__ fails (#117451 ) Fixes https://github.com/pytorch/pytorch/issues/117288 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117451 Approved by: https://github.com/yanboliang	2024-01-14 14:39:46 +00:00
fduwjj	38c18f3825	[c10d] Add a timeout check interval variable for timeout dump (#117093 ) The current timeout check frequency is relied on monitoring thread's timeout thread which can be too long (even if we set it to 2mins) so let's use a separate timeout variable which users can configure it. And we only only let default PG to check TCPStore so even more frequent check should be fine. (Our stress test is performed on every half second). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117093 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-01-14 02:33:17 +00:00
Edward Z. Yang	003c900d5e	Add _assert_scalar (#117378 ) Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378 Approved by: https://github.com/jansel	2024-01-14 00:50:36 +00:00
Mengwei Liu	1a8545164a	[export] Add unit test for SDPA export result (#117390 ) Summary: A follow up for #117097. In that PR I didn't add `_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition table. This PR does that and also add a unit test. Test Plan: python test/export/test_export.py -k test_scaled_dot_product_attention Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390 Approved by: https://github.com/drisspg	2024-01-14 00:21:28 +00:00
Aaron Gokaslan	bf27dd6df9	Add dynamo support for operator.abs (#117442 ) A test case for operator.abs and allows for constant folding with it. Partially applies to #116396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117442 Approved by: https://github.com/jansel, https://github.com/malfet	2024-01-13 21:38:55 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	1a790f5a61	[RELAND] Error grad mode op in export API (#117420 ) Summary: Title Test Plan: CI Differential Revision: D52706691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117420 Approved by: https://github.com/angelayi	2024-01-13 21:36:29 +00:00
Nikita Shulga	d6847c5977	[CI] Set correct permissions for auto_request_review (#117408 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117408 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2024-01-13 20:02:03 +00:00
Nikita Shulga	53f3361319	[BE] Use nested namespaces for sparse (#117415 ) C++17 is fu Pull Request resolved: https://github.com/pytorch/pytorch/pull/117415 Approved by: https://github.com/Skylion007	2024-01-13 19:51:28 +00:00
Wanchao Liang	d8bdb50379	[reland] pass shape/stride during tensor unflatten (#117340 ) Reland of https://github.com/pytorch/pytorch/pull/113547 as the previous PR reverted bc of torch.compile symbolic shape issue. Since we now disabled tensor unflatten with dynamo.disable, we should not hit this issue again Pull Request resolved: https://github.com/pytorch/pytorch/pull/117340 Approved by: https://github.com/Skylion007 ghstack dependencies: #117336	2024-01-13 19:33:47 +00:00
Wanchao Liang	eebf115686	[fsdp][2d] FSDP sync module states handle tensor subclass (#117336 ) This PR adds the ability to let FSDP sync module states kwarg to handle tensor subclass, because FSDP works on the "dp" mesh dimension, as long as FSDP works on a different device mesh dimension, we can safety let FSDP just broadcast the DTensor local shards. fixes https://github.com/pytorch/pytorch/issues/117126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117336 Approved by: https://github.com/awgu	2024-01-13 19:33:47 +00:00
Stephen Jia	fc044b5cdb	[pt-vulkan] Add build time flag to control descriptor pool sizes (#117398 ) Summary: ## Context When running large models with a lot of operators, the default descriptor pool allocated by the Vulkan compute API may run out of descriptor sets. This changeset introduces the `VULKAN_DESCRIPTOR_POOL_SIZE` build variable (which will default to `1024u`) which can allow for a larger descriptor pool to be allocated if necessary. ## Notes for Reviewers This is a simple stopgap solution until we have bandwidth to implement the more general solution, which would be to modify the `DescriptorPool` class defined in `api/Descriptor.[h,cpp]` to automatically allocate a new descriptor pool when memory runs out. However, I would consider this change to be low priority since with a delegate/graph mode of execution, the descriptor pool can often be allocated to exactly fit a model's requirements. Test Plan: There should be no functional changes under default build settings. Run `vulkan_api_test` to make sure everything works as before; CI should test for that as well. ``` # On devserver LD_LIBRARY_PATH=/home/ssjia/Github/swiftshader_prebuilt/swiftshader/build/bin/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*" ``` Reviewed By: yipjustin, jorgep31415 Differential Revision: D52742140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117398 Approved by: https://github.com/yipjustin	2024-01-13 13:11:00 +00:00
Jackie (Jiaqi) Xu	2c8975387d	[Optimus] fix batch layernorm numerical issue (#117404 ) Summary: Fix the numerical issue with addcmul. Found that torch.addcmul will generate different value from torch.add+torch.mul with 32 bit check. Mini repro: N4823658 Change addcmul tp torch.add+torch.mm Test Plan: buck test before change ``` the diff index is: 0 the diff index is: 1 the diff index is: 6 ``` after change numeric on par Differential Revision: D52745671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117404 Approved by: https://github.com/mengluy0125	2024-01-13 10:04:12 +00:00
voznesenskym	f008efa8e7	Reconstruct streams via global registration, temporary impl to unblock FSDP (#117386 ) This is a placeholder implementation for reconstructing streams via global storage to unblock FSDP, pending proper stream support design This PR does a few things: 1) fixes registration for devices with indices. We were only supporting "cuda", we now support "cuda:k" interfaces where k is # of gpu 2) Changes the stream objects in dynamo to take devices as device types, instead of strings, and updates the string based device APIs to gracefully take device types. 3) Introduces a reconstruct-by-global (using existing cleanup hook structures) to streams as a placeholder impl for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/117386 Approved by: https://github.com/jansel	2024-01-13 07:03:33 +00:00
Banit Agrawal	ef3217d9f7	[PyTorch] Mark USDT probes as noinline to avoid duplications in ThinLTO mode (#117381 ) Differential Revision: D52710343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117381 Approved by: https://github.com/chaekit	2024-01-13 06:18:01 +00:00
Kurman Karabukaev	302f931c25	Update Reviewers for PyTorch Distributed team (#116231 ) Update merge rule approver list under 'Distributed' section based on current PyTorch distributed team composition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116231 Approved by: https://github.com/fduwjj, https://github.com/XilunWu	2024-01-13 05:07:13 +00:00
atalman	96163eb010	Switch nightly binaries to oidc. Remove aws keys (#117416 ) This should fix all wheel nightly upload failures: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=upload Pull Request resolved: https://github.com/pytorch/pytorch/pull/117416 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-13 03:24:13 +00:00
Leon Gao	22ddf91dbb	[torch][fx] more strong typed codegen for partial specialized code on boolean (#117201 ) Summary: * in some fx partial specialized codegen via `concrete_args` on boolean arguments, we extend to further use the graphmodule on strong typed runtime like torchscript. * this diff fix the type annotation for boolean only and preserve argument mapping for leafing pytree nodes. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:fx -- --exact 'caffe2/test:fx - test_partial_trace (test_fx.TestFX)' Differential Revision: D52667883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117201 Approved by: https://github.com/houseroad	2024-01-13 03:10:02 +00:00
Yidi Wu	2bc7da1ab7	[HigherOrderOp] change signature of map_impl (#117161 ) Summary: X-link: https://github.com/pytorch/executorch/pull/1580 This PR changes the schema of map_impl from map_impl(f, num_mapped, *operands) to map_impl(f, mapped_args: Tuple, moperands: Tuple). This is to prepare for turning on dynamo for eager mode map, where we want to get rid of the num_mapped scalar. Test Plan: Existing tests. Differential Revision: D52495413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117161 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-01-13 02:50:46 +00:00
William Wen	f2f47c6848	[dynamo] realize LazyVT's in DICT_MERGE (#117282 ) Fixes https://github.com/pytorch/pytorch/issues/115029. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117282 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-13 01:50:39 +00:00
Jerry Zhang	3e397cefc5	Add uint1 to uint7 dtypes (#117208 ) Summary: These dtypes are added since we see more demand for these sub byte dtypes, especially with the popularity of LLMs (https://pytorch.org/blog/accelerating-generative-ai-2/#step-4-reducing-the-size-of-the-weights-even-more-with-int4-quantization-and-gptq-2021-toks) Note these are just placeholders, the operator support for these dtypes will be implemented with tensor subclass. e.g. torch.empty(..., dtype=torch.uint1) will return a tensor subclass of uint1, that supports different operations like bitwsise ops, add, mul etc. (will be added later) Also Note that these are not quantized data types, we'll implement quantization logic with tensor subclass backed up by these dtypes as well. e.g `Int4GroupedQuantization(torch.Tensor)` will be implemented with torch.uint4 Tensors (see https://github.com/pytorch-labs/ao/pull/13 as an example) Test Plan: CIs python test/test_quantization.py -k test_uint1_7_dtype Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117208 Approved by: https://github.com/ezyang	2024-01-13 01:09:23 +00:00
Huy Do	52575eb1bb	The permission id-token write needs to be set on rocm-test callers (#117422 ) All these workflows lack the necessary permission to run `_rocm-test` job after https://github.com/pytorch/pytorch/pull/117160, for example https://github.com/pytorch/pytorch/actions/runs/7508520071 ### Testing Confirm that trunk is back https://github.com/pytorch/pytorch/actions/runs/7508830196. Other workflows would be the same, i.e. rocm https://github.com/pytorch/pytorch/actions/runs/7508830137/job/20444989127. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117422 Approved by: https://github.com/atalman	2024-01-13 00:27:46 +00:00
angelayi	9746f36e50	[export] Minor fixes to serialization (#117374 ) * Checks that the input to torch.export.save is an ExportedProgram (https://github.com/pytorch/pytorch/issues/116952) * Fixes naming for serialized state dict from `serialized_state_dict.json` to `serialized_state_dict.pt` (https://github.com/pytorch/pytorch/issues/116949) * Moves some tests to be expectFailure rather than blocklisted Pull Request resolved: https://github.com/pytorch/pytorch/pull/117374 Approved by: https://github.com/ydwu4	2024-01-13 00:23:06 +00:00
Will Constable	7f1f0b1135	[C10D] Add duration_ms to flight recorder (#114817 ) Measures the duration of a collective operation using nccl start/end events and includes this duration (in ms) in the flight recorder data. duration_ms will be an optional field, since it only works when timing is enabled. Currently timing is enabled when flight recorder is enabled, but this is not a strict requirement. Duration is also not available for collectives not in a completed state. Note: computing duration can lead to a hang due to calling cudaEventDuration when the cuda driver queue is full. We don't ever want dump() api to hang, since we might want dump to help debug a hang. Hence, we only query durations from the watchdog thread, and it's possible during dump() call, some of the most recent collectives durations won't have been computed yet at time of dump. We make this tradeoff to ensure that dump() itself will never hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817 Approved by: https://github.com/fduwjj, https://github.com/zdevito ghstack dependencies: #116905	2024-01-12 23:34:11 +00:00
Edward Z. Yang	7a7535283f	Some basic support for uint{16,32,64} codegen in CPU inductor (#116810 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116810 Approved by: https://github.com/chenyang78, https://github.com/eellison, https://github.com/desertfire	2024-01-12 23:13:28 +00:00
Simon Fan	4b25948ee6	Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332 ) - Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp` - Append rank name to traces to avoid all ranks trying to create the same file - Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2024-01-12 22:41:09 +00:00
Jane Xu	c329eddcb9	Migrate the rest of state_dict testing to OptimizerInfo (#117186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117186 Approved by: https://github.com/albanD ghstack dependencies: #116509	2024-01-12 22:32:37 +00:00
Jane Xu	bcf1f312a0	Migrate nontensor step and CUDA params state_dict tests to OptimizerInfo (#116509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116509 Approved by: https://github.com/albanD	2024-01-12 22:32:37 +00:00
rzou	7b753cc7b8	Skip some slow tests (under Dynamo) (#117389 ) Otherwise these may cause timeouts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117389 Approved by: https://github.com/jerryzh168, https://github.com/voznesenskym ghstack dependencies: #117318, #117320	2024-01-12 22:18:07 +00:00
rzou	d73846689d	Rename test_legacy_vmap.py TestCase names (#117320 ) The problem is that the dynamo_test_failures logic recognizes tests by their TestClass.test_name. Unfortunately we have duplicate TestClass.test_name in test_legacy_vmap and test_vmap. This PR unduplicates them. Something more robust would have been to include the test file name in the dynamo_test_failures logic, but... it's a bit too late for that. We can fix it if it becomes more of a problem in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117320 Approved by: https://github.com/voznesenskym ghstack dependencies: #117318	2024-01-12 22:18:07 +00:00
rzou	06576d859d	Stop running ModuleInfo tests under Dynamo (#117318 ) This is a policy decision, similar to the OpInfo one. The problem is that they just take too long to run when we reset() before and after each. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117318 Approved by: https://github.com/voznesenskym	2024-01-12 22:17:59 +00:00
Will Constable	fbd9bccb75	[C10D](reland) Add GIL checker to NCCL watchdog monitor (#117312 ) Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck. One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time. If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully. Reland: uses a function pointer to avoid destructor ordering issues on dlclose. (Looks like the destructor for the std::function was being run later than the libtorchpython lib was unloaded, leading to a crash). Pull Request resolved: https://github.com/pytorch/pytorch/pull/117312 Approved by: https://github.com/zdevito	2024-01-12 21:48:45 +00:00
FFFrog	7b0926cc3e	Fix wrong class inheritance in pyi (#116404 ) As the title stated. `f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404 Approved by: https://github.com/ezyang, https://github.com/wconstab	2024-01-12 21:25:29 +00:00
Ting Lu	c167c34396	Skip unsupported tests on arm (#117344 ) add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below: ``` Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper method(args, *kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback c = gather_traceback(True, True, True) RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344 Approved by: https://github.com/malfet, https://github.com/drisspg	2024-01-12 21:12:11 +00:00
Ke Wen	384c4885fa	[ProcessGroup] Do not print NCCL_DEBUG before NCCL init (#117328 ) In case /etc/nccl.conf is used, `NCCL_DEBUG` is not set to sys env until NCCL inits. The deleted print point is before NCCL inits, hence may be inaccurate. This PR removes it and relies on the other print point which is after NCCL comm creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117328 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-01-12 20:46:50 +00:00
Peter Bell	18bd5c05bc	FFT: Handle noop fftn calls gracefully (#117368 ) Fixes #117252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117368 Approved by: https://github.com/malfet	2024-01-12 20:16:50 +00:00
Nikita Shulga	5cf481d1ac	[CI] Explicitly specify read-all permissions on the token (#117290 ) Would be nice to have it Pull Request resolved: https://github.com/pytorch/pytorch/pull/117290 Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/huydhn, https://github.com/atalman	2024-01-12 19:15:54 +00:00
Tejaswini Thokachichu	013a59acbd	Update `BCEWithLogitsLoss` documentation regarding `pos_weight` (#117046 ) Added clarification for the example provided for the pos_weight parameter in the BCEWithLogitsLoss class, particularly in multi-label binary classification context. This enhancement addresses potential misunderstandings about the application of 'binary' classification, which typically implies two classes, to scenarios involving multiple classes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117046 Approved by: https://github.com/mikaylagawarecki	2024-01-12 18:26:25 +00:00
Animesh Jain	e54b40e5eb	[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138 Approved by: https://github.com/jansel, https://github.com/mlazos	2024-01-12 18:21:14 +00:00
atalman	657545dbdd	Migrate rocm test to using oidc (#117160 ) Similar to Intel XPU, lets use OIDC for rocm runners. Refer to this PR: https://github.com/pytorch/pytorch/pull/116554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117160 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-01-12 17:57:26 +00:00
rzou	cb42bc705b	Make auto_functionalized HOP fallback in inductor (#117084 ) It looks like the inductor fallback previously worked with HOPs but no longer does, so I fixed that: - all HOPs are exposed under torch.ops.higher_order, so I changed how inductor looks them up - the inductor fallback assumed that an operator's signature was (args, *kwargs). This is true for all the OpOverloads but not HOPs. I rewrote the code to not rely on this. Test Plan: - existing tests - new test for auto_functionalized HOP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084 Approved by: https://github.com/williamwen42	2024-01-12 17:57:01 +00:00
YuqingJ	a97d00cca5	[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445 ) Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT. This fallback might not be efficient since it uses unbind, contiguous and split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445 Approved by: https://github.com/soulitzer	2024-01-12 17:30:40 +00:00
Nikita Shulga	21d370819b	[CI] Set permissions for stale workflow (#117371 ) Hopefully should fix failures one observes in HUD as default permissions for the repo were changed to read-only <img width="232" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/4047472c-ca3c-4288-add7-97f0ce43106a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117371 Approved by: https://github.com/clee2000	2024-01-12 16:44:15 +00:00
Jiong Gong	172dd13ecf	[inductor][cpp] improve vector contiguous checks for FloorDiv and ModularIndexing (#117221 ) Fix https://github.com/pytorch/pytorch/issues/114488 The PR tries to enable contiguous vector loads for cases where we can reduce `FloorDiv` and `ModularIndexing` in the vectorized loop. Take the index expression in test case `test_vec_contiguous_ModularIndexing` for example. `14336x0 + 256x1 + 128((x2//256)) + ModularIndexing(x2, 1, 128) + 7168ModularIndexing(x2, 128, 2)` can be reduced to `14336x0 + 256x1 + x2 + 128x2_div_c0 + 7168x2_mod_c0 + x2_mod_c1` where `x2` is a vectorized loop variable and the vector length is 16. This means we can do vectorized load for this index. Check the code comment for more details: https://github.com/pytorch/pytorch/pull/117221/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R317-R329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117221 Approved by: https://github.com/jansel	2024-01-12 15:20:36 +00:00
Valentine233	6c624aad37	[CPU] Disable floating-point contraction when compiling (#116318 ) Fixes #100775. For CPU inductor path, disable -ffp-contract, such as fma, from optimization flags to fix functional issues. ### Validation Validation on 3 benchmark suites. - [x] FP32: Negligible geomean change; No outlier models. <img width="582" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/7c14a8b8-eb6c-4794-bff9-2e1ae3a22781"> - [x] BF16: Negligible geomean change; No outlier models. <img width="589" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/cf558737-8cb2-411f-8761-27b9f8fc43af"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116318 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-12 14:09:05 +00:00
leslie-fang-intel	6ebb26d572	Fail Conv Binary Inplace check when act and accum are same tensor (#117331 ) Summary When a tensor is used as the act of conv and extra input of the binary add node, we shouldn't do conv binary inplace fusion. ``` a / \ conv \ add ``` TestPlan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117331 Approved by: https://github.com/jgong5 ghstack dependencies: #117330	2024-01-12 10:34:11 +00:00
leslie-fang-intel	19a9fdbf3a	Add more alias and mutation check for other input of Conv Binary Inplace fusion (#117330 ) Summary Fix the issue: https://github.com/pytorch/pytorch/issues/117108. Use the outplace conv binary fusion when other input is with type `TensorBox(View(ReinterpretView()))` since other input is a view of some other tensor. Test Plan ``` python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117330 Approved by: https://github.com/jgong5	2024-01-12 10:29:33 +00:00
Animesh Jain	f7d9047864	[inductor] Iterative percolate tags (#117306 ) Fixes https://github.com/pytorch/pytorch/issues/116581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117306 Approved by: https://github.com/aorenste, https://github.com/eellison	2024-01-12 07:52:32 +00:00
Catherine Lee	47c9d12ffd	Add super().setUp() to TestFFT1D (#117329 ) One day I'll move the check to be somewhere else so we don't need to worry about this anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/117329 Approved by: https://github.com/huydhn	2024-01-12 07:47:01 +00:00
Yu, Guangye	50049cfaa0	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-12 07:36:25 +00:00
Angela Yi	7dac2f9f2d	[export][ez] Fix getting meta["val"] (#117313 ) Summary: For integer inputs, they do not have a meta["val"]. Test Plan: `buck run @//mode/dev-nosan //executorch/examples/portable/scripts:export -- -m emformer_predict` passes the export step Differential Revision: D52716419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117313 Approved by: https://github.com/kirklandsign, https://github.com/tugsbayasgalan	2024-01-12 06:17:38 +00:00
Wei Wei	40f12cec93	Change predispatch tracing API (#117278 ) Summary: Change the API used in export for aotinductor Test Plan: buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_group_batch_fusion_fb Differential Revision: D52678653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117278 Approved by: https://github.com/angelayi, https://github.com/khabinov	2024-01-12 06:10:02 +00:00
haozhe.zhu	ec443089c7	enable fp16 mkldnn fusion/prepack in inductor (#117206 ) - Extend `linear/conv/rnn` packable with `float16`. - Extend `Unary fusion` to support `float16`. Test Case: Extend bfloat16 related test in `test_cpu_repro.py` and `test_mkldnn_pattern_matcher.py` to test both `fp16` and `bf16`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117206 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-01-12 06:08:42 +00:00
Avik Chaudhuri	9d5954e2a9	ignore ill-formed solution of reduce_inequalities (#117310 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/117033 Sometimes the solution returned by `sympy.solvers.inequalities.reduce_inequalities` can contain sub-expressions of the form `CRootOf(...)`, denoting the complex root of some equation in `x`, where `x` is an arbitrary symbol. We will now gracefully fail when this happens, like we already do when the solver itself fails. Test Plan: added a test Differential Revision: D52715578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117310 Approved by: https://github.com/ezyang	2024-01-12 06:01:13 +00:00
Aaron Orenstein	638f85fd67	Add default parameters to rrelu_with_noise() (#117141 ) Summary: rrelu_with_noise() was listed as having default parameters in the schema but the actual code definition didn't have them. The failing example was calling rrelu() which DOES have default parameters and it passes those defaulted values to C++. Under the covers the C code was calling the python version of rrelu_with_noise(). Although the C++ code was passing all the values to the python version of rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema and strips any parameters which match the schema's listed defaults so if the schema shows defaults that aren't in the code it will be a problem. Test Plan: I added a unit test for this specific case. It would probably be better to write a more general one to validate all the ops against their schemas - but I haven't learned enough about the test harness to do that yet. Fixes #115811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141 Approved by: https://github.com/yanboliang, https://github.com/oulgen	2024-01-12 05:32:13 +00:00
Thiago Crepaldi	d29bf0a37e	Fix ONNXProgram.save to use torch.load(..., mmap=True) for large models (#117295 ) During ONNXProgram.save, the implicit/explicit state_dict passed in must be loaded in memory in order to read each initializer and create an external tensor proto with them This PR ensures torch.load uses memory-map to support large models that cannot fit in memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/117295 Approved by: https://github.com/BowenBao ghstack dependencies: #117294	2024-01-12 04:38:27 +00:00
Thiago Crepaldi	b62ba82cdc	Update initializer path for ONNXProgram.save due to onnx.checker limitation (#117294 ) According to https://github.com/onnx/onnx/blob/main/docs/ExternalData.md#large-models-2gb when initializers are larger than 2GB, `onnx.checker` requires the model to be in the same directory as the initializer. Although not strictly necessary for the export and model save to succeed, it is desirable to have the `onnx.checker` to succeed when validation the resulting large model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117294 Approved by: https://github.com/BowenBao	2024-01-12 04:22:12 +00:00
PyTorch MergeBot	b3b585af64	Revert "[codemod] markDynamoStrictTest batch 16 (#117218 )" This reverts commit 47119785acbfe20d9ef6cf5d90887a441402f5c7. Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/zou3519 due to just felt like reverting this ([comment](https://github.com/pytorch/pytorch/pull/117218#issuecomment-1888360366))	2024-01-12 03:06:20 +00:00
PyTorch MergeBot	ac0bed01df	Revert "[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138 )" This reverts commit c278a1b39c8ae33feaa4a87b35b721fff7fdf19a. Reverted https://github.com/pytorch/pytorch/pull/117138 on behalf of https://github.com/zou3519 due to Broke jobs on main, I'm not sure why ([comment](https://github.com/pytorch/pytorch/pull/117138#issuecomment-1888290068))	2024-01-12 01:55:49 +00:00
Nikita Shulga	3214ada631	[MPS][BE] Better format nested ternary (#117198 ) - Replace double ternary with if + ternary - Replace deprecated `AT_ASSERT` with `TORCH_INTERNAL_ASSERT` - Replace regular asserts with `TORCH_CHECK` or `TORCH_INTERNAL_ASSERT` depending on context Pull Request resolved: https://github.com/pytorch/pytorch/pull/117198 Approved by: https://github.com/Skylion007	2024-01-12 01:29:17 +00:00
Shunting Zhang	04604eea8a	[inductor] check nan/inf for graph inputs (#117189 ) This is split out from #103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117189 Approved by: https://github.com/jansel	2024-01-12 00:59:32 +00:00
rzou	47119785ac	[codemod] markDynamoStrictTest batch 16 (#117218 ) [codemod] markDynamoStrictTest test_dataloader [codemod] markDynamoStrictTest test_public_bindings [codemod] markDynamoStrictTest test_namedtensor [codemod] markDynamoStrictTest test_fx [codemod] markDynamoStrictTest test_content_store [codemod] markDynamoStrictTest test_schema_check [codemod] markDynamoStrictTest lazy/test_ts_opinfo [codemod] markDynamoStrictTest functorch/test_ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218 Approved by: https://github.com/bdhirsh	2024-01-12 00:32:36 +00:00
Animesh Jain	c278a1b39c	[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138 Approved by: https://github.com/jansel	2024-01-11 23:26:25 +00:00
Khushi Agrawal	5d2d21a7be	[bfloat16][easy] kthvalue, median (#117279 ) Fixes #109991 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117279 Approved by: https://github.com/Skylion007	2024-01-11 22:44:07 +00:00
fduwjj	5c6e7962f4	[c10d][EZ] Add more logs in the destructor of ProcessGroupNCCL for better root cause investigation (#117291 ) Add logs to the place where we inspect whether a hang happens. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117291 Approved by: https://github.com/XilunWu, https://github.com/shuqiangzhang	2024-01-11 22:33:30 +00:00
Omkar Salpekar	53cba40651	[Distributed] Fix tests when CUDA not available (#117163 ) NCCL tests failed after https://github.com/pytorch/pytorch/pull/116217 when PyTorch was not built with CUDA. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117163 Approved by: https://github.com/malfet, https://github.com/wanchaol	2024-01-11 22:27:43 +00:00
PyTorch MergeBot	9f87760160	Revert "[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445 )" This reverts commit e55a778cbb518e54c5afa5b8107b352746d7f41a. Reverted https://github.com/pytorch/pytorch/pull/116445 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but i see it fails ROCm test in trunk due to an unsupported use case `e55a778cbb` ([comment](https://github.com/pytorch/pytorch/pull/116445#issuecomment-1888060036))	2024-01-11 22:21:45 +00:00
SS-JIA	0a5aa5c2d1	[pt-vulkan][ez] Remove reference to c10::MemoryFormat from `api/` folder (#117183 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset removes references to `c10::MemoryFormat` in `api/Tensor.[h,cpp]`; when constructing a `vTensor`, the `api::StorageType` (i.e. whether the tensor will be backed by buffer or texture storage) and `api::GPUMemoryLayout` (i.e. which dimension will be the fastest moving dimension) must be specified directly. Differential Revision: [D52662234](https://our.internmc.facebook.com/intern/diff/D52662234/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117183 Approved by: https://github.com/liuk22, https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178, #117179, #117180, #117181	2024-01-11 22:08:29 +00:00
Wei (Will) Feng	8b0bfb3aaa	[FSDP] remove unused flat_param_part_view (#117082 ) flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497 before that, the original code is below. Since flat_param is 1D, we do not need .view for reshaping ``` self.flat_param.data = padded_unsharded_flat_param[ : unsharded_size.numel() ].view( unsharded_size ) ``` unit test: pytest test/distributed/fsdp/test_fsdp_core.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082 Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007	2024-01-11 21:59:51 +00:00
SS-JIA	3c66c89057	[pt-vulkan] Replace `c10::ScalarType` with native equivalent (#117181 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset introduces `api::ScalarType` in `api/Types.h`, which is intended to function the same as `c10::ScalarType`; thus `api/Types.h` is the primary file of interest. The rest of the changes are straightforward replacements of `c10::ScalarType` with `api::ScalarType`. Differential Revision: [D52662237](https://our.internmc.facebook.com/intern/diff/D52662237/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117181 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178, #117179, #117180	2024-01-11 21:43:33 +00:00
SS-JIA	331ae7f89f	[pt-vulkan][ez] Replace `c10::overflows` with native equivalent (#117180 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset is very straightforward, as it simply copies the required components of `c10::overflows` from [`c10/util/Half.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Half.h#L477) into `api/Utils.h`. Differential Revision: [D52662236](https://our.internmc.facebook.com/intern/diff/D52662236/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117180 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178, #117179	2024-01-11 21:43:33 +00:00
SS-JIA	4205892be6	[pt-vulkan][ez] Replace `ArrayRef` with `std::vector<T>&` (#117179 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset replaces all instances of `c10::ArrayRef<T>` with `std::vector<T>&` and all instances of`c10::IntArrayRef` with `std::vector<int64_t>&`. There are a lot of changes in this changeset but that is simply due to the large number of callsites. All the changes are straightforward replacements. Differential Revision: [D52662235](https://our.internmc.facebook.com/intern/diff/D52662235/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117179 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177, #117178	2024-01-11 21:43:15 +00:00
SS-JIA	b209de6699	[pt-vulkan] Replace `TORCH_CHECK` and similar macros with native equivalents (#117178 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset introduces `api::Error` class in `api/Exception.h`, which is a more barebones copy of the `c10::Error` class from [`c10/util/Exception.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Exception.h). The macros `VK_CHECK_COND` (equivalent to `TORCH_CHECK(cond, msg)`) and `VK_THROW` (equivalent to `TORCH_CHECK(false, msg)` are introduced as well to replace calls to `TORCH_CHECK()` and similar macros. Although this is a large diff, the most meaningful changes are in the added files `api/Exception.[h,cpp]` and `api/StringUtil.[h,cpp]` (which is mostly adapted from [`c10/util/StringUtil.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/StringUtil.h)) which implements `api::Error` and the new macros. The rest of the diff is replacing calls to `TORCH_CHECK()` and similar macros with `VK_CHECK_COND()` and `VK_THROW()`. Differential Revision: [D52662233](https://our.internmc.facebook.com/intern/diff/D52662233/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117178 Approved by: https://github.com/yipjustin ghstack dependencies: #117176, #117177	2024-01-11 21:43:15 +00:00
SS-JIA	fe298e901a	[pt-vulkan][ez] Replace `ska::flat_hash_map`, `c10::get_hash` with `std::unordered_map`, `std::hash` (#117177 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers The majority of the changes in this changeset are: * Replacing instances of `ska::flat_hash_map` with `std::unordered_map` * `ska::flat_hash_map` is an optimized hash map, but the optimizations shouldn't be too impactful so `std::unordered_map` should suffice. Performance regression testing will be done at the final change in this stack to verify this. * Replacing `c10::get_hash` with `std::hash` where only one variable is getting hashed or the `utils::hash_combine()` function added to `api/Utils.h` (which was copied from `c10/util/hash.h`) Differential Revision: [D52662231](https://our.internmc.facebook.com/intern/diff/D52662231/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117177 Approved by: https://github.com/yipjustin ghstack dependencies: #117176	2024-01-11 21:43:15 +00:00
SS-JIA	57b76b970b	[pt-vulkan][ez] Miscellaneous small c10 deprecations (`c10::irange`, `C10_LIKELY`, `c10::SmallVector`, etc.) (#117176 ) ## Context This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch. ## Notes for Reviewers This changeset deprecates various easy-to-replace symbols from the `c10` library with either C++ STL equivalents or by using copying those `c10` symbols as native equivalents. The symbols that were impacted are: * `c10::irange` * removed and replaced with standard for loops * `C10_LIKELY` and `C10_UNLIKELY` * These macros allow for some branch re-ordering compiler optimizations when building with GCC. They aren't strictly necessary and their impact is likely minimal so these have simply been removed. * `c10::SmallVector<T, N>` * My understanding is that `c10::SmallVector<T, N>` is essentially a wrapper around `std::vector<T>` that is optimized for array sizes up to `N`. I don't believe that this optimization is worth creating a native equivalent, so I replaced instances this symbol with replaced with `std::vector<T>` * `c10::multiply_integers` * This function is simply a convenient wrapper around `std::accumulate`, so I copied it as a native equivalent in `api/Utils.h` This changeset comprises entirely of the replacements described above. Differential Revision: [D52662232](https://our.internmc.facebook.com/intern/diff/D52662232/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117176 Approved by: https://github.com/yipjustin	2024-01-11 21:42:24 +00:00
Jithun Nair	24c39bb5e5	Upgrade nightly wheels to rocm6.0 (#116983 ) Follow-up to https://github.com/pytorch/builder/pull/1647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116983 Approved by: https://github.com/jeffdaily, https://github.com/atalman	2024-01-11 20:36:00 +00:00
YuqingJ	e55a778cbb	[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445 ) Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT. This fallback might not be efficient since it uses unbind, contiguous and split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445 Approved by: https://github.com/soulitzer	2024-01-11 20:28:40 +00:00
Andrew Gu	92cc8ae172	[FSDP] Cloned unsharded tensor slice in optim state dict load (#117261 ) This takes the fix from https://github.com/pytorch/pytorch/issues/116553. Cloning the slice allows the base (much larger) tensor to be freed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117261 Approved by: https://github.com/wz337	2024-01-11 20:21:12 +00:00
Simon Fan	88bf84f106	[benchmark] add --compile-autograd to dynamo benchmarks (#117196 ) Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats e.g. accuracy_inductor.csv ``` dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1 cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0 cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0 cuda,LearningToPaint,4,pass,639,2,8,7,1,1 ... ``` e.g. speedup_inductor.csv ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1 cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196 Approved by: https://github.com/jansel	2024-01-11 20:12:58 +00:00
Doe Hyun Yoon	83c45a9931	Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064 ) …reference) (#109065) Summary: Modify the way we update gc_count in CUDACachingAlloctor to make it faster. Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions. Differential Revision: D49068760 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064 Approved by: https://github.com/zdevito	2024-01-11 19:47:05 +00:00
bhack	5bc896e5dc	Dockerfile; Add cuda bin to PATH (#117105 ) We need this to execute `nvidia-smi` in the officially released containers. We have already it in the Docker CI See `94db6578cc/.ci/docker/linter-cuda/Dockerfile (L35)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117105 Approved by: https://github.com/atalman	2024-01-11 18:10:19 +00:00
wangkang1	9e3580f793	Fix #117011 : add the TORCH_CHECK(grad_output) of upsample_nearest::backward() (#117100 ) add the TORCH_CHECK(grad_output) of upsample_nearest::backward() Fixes #117011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117100 Approved by: https://github.com/lezcano	2024-01-11 18:06:22 +00:00
Chien-Chin Huang	f89725fb41	[DCP][BC] Add the backward compatibility test (#116247 ) This PR adds a test to ensure all metadata is backward compatible with the older definination. Differential Revision: [D52357733](https://our.internmc.facebook.com/intern/diff/D52357733/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116247 Approved by: https://github.com/wz337 ghstack dependencies: #116245, #116246	2024-01-11 18:01:35 +00:00
Bin Bao	7e9cbc6834	[CI] Catch more exception types when running eager in PT2 tests (#117120 ) Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120 Approved by: https://github.com/huydhn	2024-01-11 17:46:11 +00:00
Edward Z. Yang	5b24877663	Improve uint{16,32,64} dlpack/numpy compatibility (#116808 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116808 Approved by: https://github.com/malfet, https://github.com/albanD	2024-01-11 17:01:54 +00:00
fduwjj	623b7fedc4	[c10d] Add comments to the rest environment variable within NCCLPG (#117092 ) Not every environment within NCCLPG has comments, let's add comments to each of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117092 Approved by: https://github.com/kwen2501 ghstack dependencies: #116545	2024-01-11 16:47:25 +00:00
Chien-Chin Huang	3d1869d0ae	[DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246 ) 1. Better typing 2. Remove 1-liner function Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246 Approved by: https://github.com/wz337 ghstack dependencies: #116245	2024-01-11 16:27:21 +00:00
Elias Ellison	4c7b602645	Add Support For Symbolic Shapes in Register_replacement, SDPA Pattern Matching (#115441 ) Many of our pattern matching replacements are specified as a `search_fn` and a `replacment_fn`. The search_fn's are traced out once with static shapes, converted to a pattern, and then matched on every graph compiled with inductor. The static shape patterns would not match with graphs that are traced out with dynamic shapes because SymInts would be added to the graph as `sym_size` fx nodes which added additional uses and prevented matching. The previous PR partially addresses this by deduping SymInts that are resolvable to graph inputs, as is the calling convention in aot autograd. This PR adjusts our matching of the `search_fn` by adding SymInts to the arguments we trace out the search_fn with so that their symint accesses are deduped. Later, if we have a match, we will trace out the replacement graph with the correct Tensors and corresponding symbolic shapes that will get added to the graph. Note: the replacement patterns will insert sym_size uses which could potentially be removed, but I'll leave that for follow up. Fix for https://github.com/pytorch/pytorch/issues/111190. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115441 Approved by: https://github.com/jansel ghstack dependencies: #116158	2024-01-11 15:58:37 +00:00
PyTorch MergeBot	bfc336308a	Revert "Error grad mode op in export API (#117187 )" This reverts commit 89ef426ba0d87091303f6a3c21c38749f9af72a3. Reverted https://github.com/pytorch/pytorch/pull/117187 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117187#issuecomment-1887363580))	2024-01-11 15:01:36 +00:00
PyTorch MergeBot	767e1b6349	Revert "Bring docstring to .pyi file (#114705 )" This reverts commit 0dd5deecedd136852c7ccc81630eaefbebe5be29. Reverted https://github.com/pytorch/pytorch/pull/114705 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114705#issuecomment-1887165326))	2024-01-11 13:30:44 +00:00
vfdev-5	7005a4bcb6	[dynamo] Added dyn shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866 ) Description: - Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ... ```python import math import torch def func(x, a, b): c = 0 c = c + math.sqrt(a) c = c + math.cos(a) c = c + math.cosh(a) c = c + math.sin(a) c = c + math.sinh(a) c = c + math.tan(a) c = c + math.tanh(a) c = c + math.asin(b) c = c + math.acos(b) c = c + math.atan(a) y = x + c return y cfunc = torch.compile(func, dynamic=True, fullgraph=True) device = "cpu" # or "cuda" x = torch.tensor([0, 1, 2, 3], dtype=torch.float32, device=device) a = 12 b = 1 out = cfunc(x, a, b) expected = func(x, a, b) torch.testing.assert_close(out, expected) ``` and the graph `TORCH_LOGS=+graph_code python check_math_ops.py`: <details> <summary> graph code </summary> ``` [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ===== __compiled_fn_0 ===== [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] <eval_with_key>.0 class GraphModule(torch.nn.Module): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor): [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_a_ = L_a_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] l_x_ = L_x_ [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:57, code: c = c + math.sqrt(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sqrt = torch.sym_sqrt(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add = 0 + sym_sqrt; sym_sqrt = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:58, code: c = c + math.cos(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cos = torch.sym_cos(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_1 = add + sym_cos; add = sym_cos = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:59, code: c = c + math.cosh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_cosh = torch.sym_cosh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_2 = add_1 + sym_cosh; add_1 = sym_cosh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:60, code: c = c + math.sin(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sin = torch.sym_sin(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_3 = add_2 + sym_sin; add_2 = sym_sin = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:61, code: c = c + math.sinh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_sinh = torch.sym_sinh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_4 = add_3 + sym_sinh; add_3 = sym_sinh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:62, code: c = c + math.tan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tan = torch.sym_tan(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_5 = add_4 + sym_tan; add_4 = sym_tan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:63, code: c = c + math.tanh(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_tanh = torch.sym_tanh(l_a_) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_6 = add_5 + sym_tanh; add_5 = sym_tanh = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:64, code: c = c + math.asin(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_7 = add_6 + 1.5707963267948966; add_6 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:65, code: c = c + math.acos(b) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_8 = add_7 + 0.0; add_7 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:66, code: c = c + math.atan(a) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] sym_atan = torch.sym_atan(l_a_); l_a_ = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] add_9 = add_8 + sym_atan; add_8 = sym_atan = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] # File: check_math_ops.py:67, code: y = x + c [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] y = l_x_ + add_9; l_x_ = add_9 = None [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] return (y,) [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] [2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] ``` </details> Generated code with `TORCH_LOGS=+output_code python check_math_ops.py`: <details> <summary> C++ code </summary> ``` [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] cpp_fused_add_0 = async_compile.cpp(''' [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #include "/tmp/torchinductor_root/2l/c2ljzlm4sosod7u6lyrroqdba6hmfcyijrric6p4t3fhbcmw6osp.h" [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] extern "C" void kernel(const float* in_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] float* out_ptr0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks0, [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] const long ks1) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #pragma GCC ivdep [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L)) [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] { [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp0 = in_ptr0[static_cast<long>(x0)]; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp1 = c10::convert<float>(1.57079632679490 + (std::sqrt(ks1)) + (std::atan(ks1)) + (std::cos(ks1)) + (std::cosh(ks1)) + (std::sin(ks1)) + (std::sinh(ks1)) + (std::tan(ks1)) + (std::tanh(ks1))); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] out_ptr0[static_cast<long>(x0)] = tmp2; [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] } [2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> <details> <summary> Triton code </summary> ``` [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @pointwise( [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] size_hints=[4], [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] filename=__file__, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] triton_meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1), equal_to_1=(), i ds_of_folded_args=(), divisible_by_8=())]}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_0', 'mutated_arg_names': []}, [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] min_elem_per_thread=0 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @triton.jit [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xoffset = tl.program_id(0) * XBLOCK [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xindex = xoffset + tl.arange(0, XBLOCK)[:] [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] xmask = xindex < xnumel [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] x0 = xindex [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp0 = tl.load(in_ptr0 + (x0), xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp1 = 1.57079632679490 + (tl.math.sqrt(ks0.to(tl.float32))) + (tl.math.atan((ks0).to(tl.float32))) + (tl.math.cos((ks0).to(tl.float32))) + (tl.math.cosh((ks0).to(tl.float32))) + (tl.math.sin((ks0) .to(tl.float32))) + (tl.math.sinh((ks0).to(tl.float32))) + (tl.math.tan((ks0).to(tl.float32))) + (tl.math.tanh((ks0).to(tl.float32))) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp2 = tmp1.to(tl.float32) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tmp3 = tmp0 + tmp2 [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] tl.store(out_ptr0 + (x0), tmp3, xmask) [2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''') ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114866 Approved by: https://github.com/peterbell10	2024-01-11 11:52:28 +00:00
cyy	2b5a201aa6	[Exception] [3/N] Replace torch::NotImplementedError and torch::LinAlgError with C10 counterparts. (#116824 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116824 Approved by: https://github.com/albanD	2024-01-11 11:27:04 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	89ef426ba0	Error grad mode op in export API (#117187 ) Summary: This is reland of https://github.com/pytorch/pytorch/pull/116339 Needed to some internal adjustments to make it work properly. Original credit goes to andrewlee302 Test Plan: CI Differential Revision: D52674706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117187 Approved by: https://github.com/suo	2024-01-11 09:06:59 +00:00
Shunting Zhang	0e1f43c44d	[inductor] don't access cluster_dims for too old version of triton (#117192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117192 Approved by: https://github.com/masnesral	2024-01-11 08:37:30 +00:00
Huy Do	3b2ddb6f71	Update TorchBench pinned commit (#117073 ) ~~To match their recent v4.36.2 release https://github.com/huggingface/transformers/commits/v4.36.2. This is to fix the KeyError showing on release branch https://github.com/pytorch/pytorch/actions/runs/7451512288/job/20279117324#step:16:1336. I think this can be updated in main too because the current pinned commit is already 4-month old.~~ Check with @desertfire, trying to update TorchBench pinned commit instead. The test is also failing in main https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1120, but for some reason, it doesn't surface as a failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117073 Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi, https://github.com/desertfire	2024-01-11 08:35:00 +00:00
haozhe.zhu	1cefc58905	init tls grad_mode/local_dispatch_key set while fork new thread in (#113246 ) TorchDynamo will guard grad_mode and the local dispatch key set. `3a429423fc/torch/csrc/dynamo/guards.cpp (L13-L16)` While using ThroughputBenchmark, those tls state will not be init as same as the main thread status. `3a429423fc/torch/csrc/utils/throughput_benchmark-inl.h (L64-L94)` Run following scripts ``` import torch linear = torch.nn.Linear(128, 128) compiled = torch.compile(linear) x = torch.rand(10, 128) with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): compiled(x) compiled(x) from torch._dynamo import config config.error_on_recompile = True from torch.utils import ThroughputBenchmark with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16): bench = ThroughputBenchmark(compiled) bench.add_input(x) stats = bench.benchmark( num_calling_threads=10, num_warmup_iters=100, num_iters=100, ) print(stats) ``` will lead to 2 re-compile reasons: ``` triggered by the following guard failure(s): ___check_global_state() triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch. ``` This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models. throughputbenchmark Pull Request resolved: https://github.com/pytorch/pytorch/pull/113246 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-01-11 08:31:46 +00:00
Sun, Jiayi	9f57cf502f	[inductor][cpu]disable pointwise_cat on CPU (#116313 ) We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr. This PR fix the following three regression issues: https://github.com/pytorch/pytorch/issues/115827 https://github.com/pytorch/pytorch/issues/112139 https://github.com/pytorch/pytorch/issues/114495 and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison	2024-01-11 08:00:00 +00:00
Elias Ellison	e3d4f4d14b	[ProxyTensor] dedupe symbolic shapes in tracing (#116158 ) Dedupes symbolic shapes in proxy tensor tracing. Reusing the existing sym shape avoids inserting spurious sym_size calls, which can interfere with pattern matching and graph passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116158 Approved by: https://github.com/ezyang	2024-01-11 07:15:11 +00:00
Chien-Chin Huang	6f9fcc79c2	[DCP][BE] Remove unused fields (#116245 ) As title Differential Revision: [D52357730](https://our.internmc.facebook.com/intern/diff/D52357730/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116245 Approved by: https://github.com/wz337	2024-01-11 06:03:09 +00:00
leslie-fang-intel	263cc12fab	Add Dynamo Reset in PT2E Quantization testing (#117200 ) Summary Fix https://github.com/pytorch/pytorch/issues/117012 by adding `torch._dynamo.reset()` in `PT2EQuantizationTestCase._quantize`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117200 Approved by: https://github.com/jerryzh168	2024-01-11 05:53:55 +00:00
titaiwangms	5ae221a214	[ONNX] Refactor op consistency tests (#116319 ) Fixes #105338 This PR changes the ops consistency tests from manual adding ops into testing list to automated testing all ops in registry. It also spots more complex dtype bugs in the converter. Overall, this PR provides: (1) Whole test coverage on ONNX registry (2) More completed complex supports (3) Only test the same dtypes as torchlib (4) Auto xfail unsupported nodes Follow-up issue: https://github.com/pytorch/pytorch/issues/117118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116319 Approved by: https://github.com/justinchuby	2024-01-11 05:17:40 +00:00
fduwjj	9b1fac694e	[c10d] Add extra sleep in waitForDumpOrTimeout to ensure enough time for all ranks dump debug info (#116545 ) We added an extra sleep and make it configurable so that users can set an extra wait to ensure all ranks have dumped the debug info. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116545 Approved by: https://github.com/wconstab	2024-01-11 04:39:57 +00:00
rzou	ca23c56efc	[codemod] markDynamoStrictTest batch 15 (#117139 ) [codemod] markDynamoStrictTest test_spectral_ops [codemod] markDynamoStrictTest test_fx_experimental [codemod] markDynamoStrictTest test_foreach [codemod] markDynamoStrictTest test_decomp Pull Request resolved: https://github.com/pytorch/pytorch/pull/117139 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127, #117128, #117129, #117133	2024-01-11 04:28:57 +00:00
rzou	9dbe4eae82	[codemod] markDynamoStrictTest batch 14 (#117133 ) [codemod] markDynamoStrictTest test_utils [codemod] markDynamoStrictTest test_unary_ufuncs [codemod] markDynamoStrictTest test_sparse_semi_structured [codemod] markDynamoStrictTest test_sparse_csr [codemod] markDynamoStrictTest test_sparse [codemod] markDynamoStrictTest test_reductions [codemod] markDynamoStrictTest test_proxy_tensor [codemod] markDynamoStrictTest test_prims [codemod] markDynamoStrictTest test_maskedtensor [codemod] markDynamoStrictTest test_masked [codemod] markDynamoStrictTest test_legacy_vmap [codemod] markDynamoStrictTest test_binary_ufuncs Pull Request resolved: https://github.com/pytorch/pytorch/pull/117133 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127, #117128, #117129	2024-01-11 04:28:57 +00:00
rzou	a526d0a926	Skip all OpInfo-based test when running with PYTORCH_TEST_WITH_DYNAMO (#117129 ) This is a policy decision. These tests: - are flaky, and fixing the flakiness is unfeasible at the moment - are highly redundant Pull Request resolved: https://github.com/pytorch/pytorch/pull/117129 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127, #117128	2024-01-11 04:28:42 +00:00
blzheng	dc43ad4286	add is_grad_enabled check in runtime_wrapper before running with torch.no_grad (#117089 ) We observed that `with torch.no_grad()` in runtime_wrapper introduced ~10% (0.06ms->0.066ms) inference performance regression on lennard_jones on cpu. For inference tasks in benchmark, grad has been disabled, but in the current runtime_wrapper, no_grad is set again and its time is counted into the running time. Therefore, we add `is_grad_enabled` check in runtime_wrapper before running with torch.no_grad. If grad has been disabled, there is no need to set no_grad. Before this pr: 1.043x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,lennard_jones,1,1.043427,0.068366,4.756151,0.941846,45.056819,47.838822,9,1,0,0 After this pr: 1.146x dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cpu,lennard_jones,1,1.146190,0.061844,4.468380,0.936456,44.427264,47.441920,9,1,0,0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117089 Approved by: https://github.com/jgong5, https://github.com/bdhirsh	2024-01-11 03:37:45 +00:00
voznesenskym	203430a778	[dynamo] easy - better assert message for EQUALS_MATCH guard (#117006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117006 Approved by: https://github.com/lezcano ghstack dependencies: #116723	2024-01-11 03:14:43 +00:00
angelayi	79de14546d	[export] Add TORCH_LOGS=export (#116993 ) Adds TORCH_LOGS=export which currently includes dynamo/dynamic logs. In the future if we add any logs under the torch/export directory it will also show up in the TORCH_LOGS=export Pull Request resolved: https://github.com/pytorch/pytorch/pull/116993 Approved by: https://github.com/avikchaudhuri	2024-01-11 03:02:23 +00:00
vmoens	6f0f4f12ca	[BugFix] Prevent LSTM to run with wrong input shape (#115542 ) Fixes #114874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115542 Approved by: https://github.com/mikaylagawarecki	2024-01-11 02:57:09 +00:00
Will Constable	10509dac85	[C10D] Rename flightrecorder key vars to avoid confusion (#116905 ) Key vars are strings used as dict keys (e.g. duration_s was a string "duration_ms") _s confused me with time (seconds) since duration_s was a key string and duration_ms is another variable holding a time value. Now duration_key is "duration_ms". Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905 Approved by: https://github.com/zdevito	2024-01-11 02:57:04 +00:00
PyTorch MergeBot	1174e82bde	Revert "Add _assert_scalar and teach Inductor to codegen it (#114148 )" This reverts commit b6028acfa46363c1d3262a1522741a06c307843f. Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))	2024-01-11 02:30:22 +00:00
vasiliy	0f10a706f6	add a docblock for torch._scaled_mm (#117190 ) Summary: Describes the arguments in more detail. Not in user facing docs for now, but a step towards getting there eventually. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117190 Approved by: https://github.com/drisspg	2024-01-11 02:22:44 +00:00
Edward Z. Yang	edec54b9de	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #113397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang	2024-01-11 01:32:44 +00:00
Catherine Lee	71343507cd	Add super().setup in test_numeric (#117148 ) Call super().setUp() so that it will check the disabled test json (and also reset seeds etc) Test: Check that test_all_any is skipped in dynamo shard - success Pull Request resolved: https://github.com/pytorch/pytorch/pull/117148 Approved by: https://github.com/huydhn	2024-01-11 01:03:46 +00:00
cyy	2f17a21b2b	[Reland] [13/N] Enable clang-tidy on headers of torch/csrc (#117088 ) Reland of #116560 and fixes the issued reported by #116695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117088 Approved by: https://github.com/albanD	2024-01-10 23:58:04 +00:00
Mengwei Liu	8783fe9cf3	[export] Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097 ) Summary: As titled. #115913 added `_scaled_dot_product_flash_attention_for_cpu` and the export result of `scaled_dot_product_attention` includes this op. Adding this decomposition so that it's being decomposed the same way as `_scaled_dot_product_attention_math`. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097 Approved by: https://github.com/lezcano	2024-01-10 23:46:14 +00:00
Joel Schlosser	f70aeb4ffd	Fix backward for reshape() on jagged layout NT (#117137 ) Provides symbolic C++-side `reshape_as()` / `reshape()` decomps for jagged layout NTs to make the backwards pass work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117137 Approved by: https://github.com/soulitzer	2024-01-10 23:35:07 +00:00
soulitzer	e10cfdd895	Update matmul requires_grad checks (#117067 ) Fixes https://github.com/pytorch/pytorch/issues/116099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117067 Approved by: https://github.com/lezcano, https://github.com/albanD ghstack dependencies: #116523, #116710	2024-01-10 23:16:42 +00:00
rzou	7e6a04e542	Allow unMarkDynamoStrictTest to work on tests (instead of just classes) (#117128 ) Tested locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117128 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114, #117127	2024-01-10 22:25:40 +00:00
rzou	1b8ebb6c42	[codemod] markDynamoStrictTest batch 13 (#117127 ) [codemod] markDynamoStrictTest test_overrides [codemod] markDynamoStrictTest test_namedtuple_return_api [codemod] markDynamoStrictTest test_jiterator [codemod] markDynamoStrictTest test_jit_disabled [codemod] markDynamoStrictTest test_jit_autocast [codemod] markDynamoStrictTest test_fx_reinplace_pass [codemod] markDynamoStrictTest test_fx_passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/117127 Approved by: https://github.com/voznesenskym ghstack dependencies: #117114	2024-01-10 22:25:40 +00:00
rzou	79e6d2ae9d	Remove incorrect usages of skipIfTorchDynamo (#117114 ) Using `@skipifTorchDynamo` is wrong, the correct usage is `@skipIfTorchDynamo()` or `@skipIfTorchDynamo("msg")`. This would cause tests to stop existing. Added an assertion for this and fixed the incorrect callsites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117114 Approved by: https://github.com/voznesenskym	2024-01-10 22:25:31 +00:00
Elias Ellison	d6540038c0	Fix 0-dim Index in Index Copy decomp (#117065 ) Fix for https://github.com/pytorch/pytorch/issues/115931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117065 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-01-10 22:13:43 +00:00
lcskrishna	b9293e74a2	[ROCm] Fixes for hipblasLt for mm use case. (#116537 ) This PR fixes the accuracy issues for hipblasLT for mm case on ROCm. This PR is a follow up to the integration PR https://github.com/pytorch/pytorch/pull/114329 and https://github.com/pytorch/pytorch/pull/114890 The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue. Added a unit-test case for this issue (bias=None) case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116537 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-01-10 22:13:18 +00:00
Aaron Shi	7e37f63e5e	[Reference Cycle Detector] Ignore FakeTensor in cycle leak detection (#117116 ) Summary: Skip FakeTensors since these tensors are not actually using GPU memory. Reference Cycle Detector does not need to generate plots for these tensors. Test Plan: CI and internal testing. Differential Revision: D52637209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117116 Approved by: https://github.com/zdevito, https://github.com/tianfengfrank	2024-01-10 21:33:56 +00:00
atalman	3e9bb8d4de	Run docker release build on final tag (#117131 ) To be successful, the docker release workflow needs to run on final tag, after the Release to conda and pypi are complete. Please refer to: https://github.com/pytorch/pytorch/blob/main/Dockerfile#L76 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117131 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2024-01-10 21:00:45 +00:00
fduwjj	73990c37e6	[c10d] To make ProcessGroupNCCL to use globalStore for coordination (#117075 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117075 Approved by: https://github.com/wconstab ghstack dependencies: #117074	2024-01-10 20:39:53 +00:00
fduwjj	180425df9b	[c10d] Add a recursive method to get the inner most store (#117074 ) In c10d PG initialization, we wrap TCPStore with multiple layers of PrefixStore which adds layers of prefix. One example is: "default_pg/0//cuda//timeout_dump" When initialized the default PG, because there is no store passed. We first add the prefix "default_pg" to the TCPStore returned from rendezvous: `bdeaaad70c/torch/distributed/distributed_c10d.py (L1240)` We then add pg_name (aka 0) `bdeaaad70c/torch/distributed/distributed_c10d.py (L1376)` and device (aka cuda) `bdeaaad70c/torch/distributed/distributed_c10d.py (L1387)` to the prefix. Then when we call store_->set("timeout_dump"). The actual key used for writing into TCPStore is "default_pg/0//cuda//timeout_dump". For sub-PG, things get even interesting, we put the store wrapped with default pg name to a cache: `bdeaaad70c/torch/distributed/distributed_c10d.py (L1517)` And when creating each subPG, it is append its PG name right after the cached store. The example keys are: 'default_pg/0//10//cuda//timeout_dump', 'default_pg/0//12//cuda//timeout_dump', 'default_pg/0//38//cuda//timeout_dump', 'default_pg/0//39//cuda//timeout_dump'. (10, 12, 38 and 39 are all PG names of each subPG created) The reason why the number in the name is bumped up so high is because for each subPG creation, all ranks have to call the API together and the global variable used for PG name will be bumped up monolithically: `bdeaaad70c/torch/distributed/distributed_c10d.py (L3666)` Similar things happen for using hashing for PG names. This has a potential issue, because each sub-PG has an instance of ProcessGroupNCCL, and if we want to set something global to notify all sub-PGs (and all ranks). This added prefix causes bugs. For example, if on sub-PG 1, we set a value to TCPStore with key ('default_pg/0//1//cuda//timeout_dump'), while we use the default PG instances to check the TCPStore, which are using the key ('default_pg/0//cuda//timeout_dump'), default PG instances will never get the notified signals. So in this PR, we added a new API in PrefixStore which we get the innermost non-PrefixStore for set and check. The next PR will make changes in NCCL watchdog. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117074 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-01-10 20:22:55 +00:00
Jason Ansel	6f8fc42dba	[inductor] Add support for tl.make_block_ptr (#116079 ) On A100 this is a small regression: ![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171) So I will leave it disabled by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079 Approved by: https://github.com/shunting314	2024-01-10 20:02:49 +00:00
Catherine Lee	9bf9586c6d	Pytest do not rewrite assertions by default (#117060 ) From https://pytest.org/en/7.4.x/how-to/assert.html#advanced-assertion-introspection pytest only rewrites test modules directly discovered by its test collection process, so asserts in supporting modules which are not themselves test modules will not be rewritten. In CI we usually call the test file (`python test_ops.py`), which then calls run_test which then calls pytest.main, so the test module is already imported as `__main__`, so pytest does not import the test module itself and relies on the already imported module. (#95844) However, calling `pytest test_ops.py` will rely on pytest to import the module, resulting in asserts being rewritten, so I add --assert=plain by default into the opts so we don't have to worry about this anymore. Another way to make pytest stop assertion rewriting in a file is to include `PYTEST_DONT_REWRITE` somewhere in the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117060 Approved by: https://github.com/zou3519	2024-01-10 20:02:45 +00:00
Bin Bao	fad7734fa7	[AOTI] Remove caching for compiled model.so (#117087 ) Summary: Oleg found the model.so caching does not compute hash key with model weights included, which can cause incorrect model.so reuse. Since caching is not really necessary in the AOT mode, let's just remove it. Test Plan: CI Differential Revision: D52647555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117087 Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov	2024-01-10 19:53:27 +00:00
Wei (Will) Feng	e4e80dc9b3	[FSDP] sharded grad scaler: copy found_inf after waiting on async reduce_all (#115710 ) Expected behavior: when rank 0 have inf grad, rank 1...k should get `found_inf=1` after `dist.reduce_all` Bug addressed in this PR: for cpu offloaded param.grad, when rank 0 have inf, rank 1...k would not have found_inf=1. This is because `found_inf` was copied before `future.wait` on async `dist.reduce_all` repro the bug using the newly added unit test: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf` ``` File "/data/users/weif/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 320, in _test_sharded_grad_scaler_found_inf self.assertEqual( File "/data/users/weif/pytorch/torch/testing/_internal/common_utils.py", line 3576, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Scalars are not close! Expected 1.0 but got 2.0. Absolute difference: 1.0 (up to 1e-05 allowed) Relative difference: 1.0 (up to 1.3e-06 allowed) rank: 0 iter: 0 expect origin scale 2.0 to be backed off by 0.5 but got 2.0 ``` verify the bug is fixed: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf` ``` test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py dist init r=1, world=8 dist init r=3, world=8 dist init r=7, world=8 dist init r=4, world=8 dist init r=6, world=8 dist init r=2, world=8 dist init r=0, world=8 dist init r=5, world=8 NCCL version 2.19.3+cuda12.0 . [100%] ====================================================================== 1 passed, 19 deselected in 27.43s ========================= ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115710 Approved by: https://github.com/awgu	2024-01-10 19:17:25 +00:00
Simon Fan	9eb842cbd6	Compiled autograd: Lift autograd functions' backward and provide default key for custom autograd functions (#115573 ) This PR adds support for torch.autograd.Function subclasses in compiled autograd. We do this by: - Creating a uid for all torch.autograd.Function via its metaclass. This uid is used in the compiled autograd key, which is a subset of the cache key to the compiled graph - "Lifting" the backward/saved_tensors, having them as input arguments in the compiled graph - Creating proxies to track the backward's inputs and outputs. Since the backward's outputs (grads) have to match the forward's inputs, we pass the node's `input_info` (forward's input sizes) to build the proxies tracking the backward's outputs. - Use a `FakeContext` class as a replacement for the autograd node's context object (`BackwardCFunction`) during tracing, only support passing saved_tensors from the forward to the backward - Index each backward, to support multiple torch.autograd.Functions in the same graph - Special case for `CompiledFunctionBackward`, lifting CompiledFunction will fail 4 tests and requires some skipfiles changes that I'd rather do that in a separate PR Example graph: test_custom_fn_saved_multiple_tensors (eager fw + compiled autograd) ```python class MyFn(torch.autograd.Function): @staticmethod def forward(ctx, x, y): ctx.save_for_backward(x, y) return torch.sin(x), torch.sin(y) @staticmethod def backward(ctx, gO_x, gO_y): (x, y) = ctx.saved_tensors return gO_x * torch.cos(x), gO_y * torch.cos(y) ``` The backwards is lifted via `getitem_5` and `call_backward` ```python # Compiled autograd graph ===== Compiled autograd graph ===== <eval_with_key>.0 class CompiledAutograd(torch.nn.Module): def forward(self, inputs, sizes, hooks): # No stacktrace found for following nodes getitem: "f32[]" = inputs[0] getitem_1: "f32[10]" = inputs[1] getitem_2: "f32[10]" = inputs[2] getitem_3: "f32[10]" = inputs[3] getitem_4: "f32[10]" = inputs[4]; inputs = None expand: "f32[10]" = torch.ops.aten.expand.default(getitem, [10]); getitem = None mul: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None getitem_5 = hooks[0]; hooks = None call_backward = torch__dynamo_external_utils_call_backward(getitem_5, (getitem_3, getitem_4), mul_1, mul); getitem_5 = mul_1 = mul = None getitem_6: "f32[10]" = call_backward[0] getitem_7: "f32[10]" = call_backward[1]; call_backward = None accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7); getitem_4 = getitem_7 = None accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6); getitem_3 = getitem_6 = None return [] ``` then is later inlined by dynamo ```python # Dynamo graph ===== __compiled_fn_0 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, L_inputs_0_ : torch.Tensor, L_inputs_1_ : torch.Tensor, L_inputs_2_ : torch.Tensor, L_inputs_3_ : torch.Tensor, L_inputs_4_ : torch.Tensor): getitem = L_inputs_0_ getitem_1 = L_inputs_1_ getitem_2 = L_inputs_2_ x = L_inputs_3_ y = L_inputs_4_ # File: <eval_with_key>.0:10, code: expand = torch.ops.aten.expand.default(getitem, [10]); getitem = None expand = torch.ops.aten.expand.default(getitem, [10]); getitem = None # File: <eval_with_key>.0:11, code: mul = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None mul = torch.ops.aten.mul.Tensor(expand, getitem_2); getitem_2 = None # File: <eval_with_key>.0:12, code: mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1); expand = getitem_1 = None # File: /data/users/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py:412, code: return gO_x * torch.cos(x), gO_y * torch.cos(y) cos = torch.cos(x) getitem_6 = mul_1 * cos; mul_1 = cos = None cos_1 = torch.cos(y) getitem_7 = mul * cos_1; mul = cos_1 = None # File: <eval_with_key>.0:17, code: accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7); getitem_4 = getitem_7 = None accumulate_grad__default = torch.ops.inductor.accumulate_grad_.default(y, getitem_7); y = getitem_7 = None # File: <eval_with_key>.0:18, code: accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6); getitem_3 = getitem_6 = None accumulate_grad__default_1 = torch.ops.inductor.accumulate_grad_.default(x, getitem_6); x = getitem_6 = None return () ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115573 Approved by: https://github.com/jansel	2024-01-10 18:01:28 +00:00
Edward Yang	b4a35632f9	Add function to materialize COW storages (#117053 ) Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems) Test Plan: sandcastle, OSS CI Differential Revision: D52610522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053 Approved by: https://github.com/malfet, https://github.com/kurtamohler	2024-01-10 15:34:16 +00:00
min-jean-cho	ec98df70f3	[CPU] _vec_softmax_backward, _vec_log_softmax_backward, _vec_logsoftmax: fix CHUNK_SIZE to avoid unnecessarily large allocation (#117029 ) Similar to https://github.com/pytorch/pytorch/pull/116990, fixes `CHUNK_SIZE` in `_vec_softmax_backward`, `_vec_log_softmax_backward`, `_vec_logsoftmax`, where `CHUNK_SIZE` is set as ```cpp int64_t BLOCK_SIZE = 128 * 1024; int64_t CHUNK_SIZE = std::max<int64_t>(BLOCK_SIZE / dim_size / sizeof(scalar_t), Vec::size()); CHUNK_SIZE = CHUNK_SIZE / Vec::size() * Vec::size(); ``` where `BLOCK_SIZE / dim_size / sizeof(scalar_t)` computes the maximum number of inner dim that can fit into L2 cache, assuming L2 cache = 128KB, and `CHUNK_SIZE / Vec::size() * Vec::size()` is to make `CHUNK_SIZE` a multiple of `Vec::size()`. Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `inner_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer. ```cpp auto buffer = std::make_unique<scalar_t []>(CHUNK_SIZE * 2); scalar_t* input_max_data = buffer.get(); scalar_t* tmp_sum_data = buffer.get() + CHUNK_SIZE; ``` ### Performance Perf data of `_vec_logsoftmax` collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `BLOCK_SIZE / dim_size / sizeof(scalar_t)` for all values of `outer_size`. Tested on 28 physical cores/socket, 1 socket on Skylake. \| dim_size \| BLOCK_SIZE / dim_size / sizeof(scalar_t) \| input shape: (dim_size, inner_size) \| Baseline (original implementation) \| Optimized \| Speedup Ratio (Baseline/Optimized) \| \|-------------- \|---------------------------------------------- \|----------------------------------------- \|---------------------------------------- \|--------------- \|---------------------------------------- \| \| 1 \| 32768 \| (1, 1) \| 0.012578964 \| 0.003523827 \| 3.569689 \| \| \| \| (1, 2) \| 0.012645721 \| 0.003550053 \| 3.562122 \| \| \| \| (1, 4) \| 0.01303196 \| 0.003521442 \| 3.700745 \| \| \| \| (1, 8) \| 0.01275301 \| 0.003552437 \| 3.589933 \| \| 2 \| 16384 \| (2, 1) \| 0.008230209 \| 0.003688335 \| 2.231416 \| \| \| \| (2, 2) \| 0.00821352 \| 0.003502369 \| 2.345133 \| \| \| \| (2, 4) \| 0.008280277 \| 0.003442764 \| 2.405125 \| \| \| \| (2, 8) \| 0.0086236 \| 0.003490448 \| 2.470628 \| \| 4 \| 8192 \| (4, 1) \| 0.005865097 \| 0.003454685 \| 1.697723 \| \| \| \| (4, 2) \| 0.005846024 \| 0.003490448 \| 1.674863 \| \| \| \| (4, 4) \| 0.006036758 \| 0.0035429 \| 1.703903 \| \| \| \| (4, 8) \| 0.005993843 \| 0.003669262 \| 1.633528 \| \| 8 \| 4096 \| (8, 1) \| 0.00469923 \| 0.003535748 \| 1.329063 \| \| \| \| (8, 2) \| 0.004696846 \| 0.003600121 \| 1.304636 \| \| \| \| (8, 4) \| 0.005483627 \| 0.003721714 \| 1.473414 \| \| \| \| (8, 8) \| 0.005180836 \| 0.00389576 \| 1.329865 \| \| 16 \| 2048 \| (16, 1) \| 0.00446558 \| 0.003738403 \| 1.194515 \| \| \| \| (16, 2) \| 0.004258156 \| 0.00382185 \| 1.114161 \| \| \| \| (16, 4) \| 0.004422665 \| 0.004007816 \| 1.10351 \| \| \| \| (16, 8) \| 0.004923344 \| 0.004308224 \| 1.142778 \| \| 32 \| 1024 \| (32 , 1) \| 0.004467964 \| 0.00402689 \| 1.109532 \| \| \| \| (32, 2) \| 0.004336834 \| 0.004196167 \| 1.033523 \| \| \| \| (32, 4) \| 0.004661083 \| 0.004513264 \| 1.032752 \| \| \| \| (32, 8) \| 0.005385876 \| 0.005121231 \| 1.051676 \| \| 64 \| 512 \| (64, 1) \| 0.004725456 \| 0.00462532 \| 1.021649 \| \| \| \| (64, 2) \| 0.005085468 \| 0.004930496 \| 1.031431 \| \| \| \| (64, 4) \| 0.005791187 \| 0.005600452 \| 1.034057 \| \| \| \| (64, 8) \| 0.007030964 \| 0.006783009 \| 1.036555 \| \| 128 \| 256 \| (128, 1) \| 0.005710125 \| 0.005786419 \| _0.986815_ \| \| \| \| (128, 2) \| 0.006377697 \| 0.006473064 \| _0.985267_ \| \| \| \| (128, 4) \| 0.00754118 \| 0.007488728 \| 1.007004 \| \| \| \| (128, 8) \| 0.009772778 \| 0.009725094 \| 1.004903 \| \| 256 \| 128 \| (256 , 1) \| 0.007708073 \| 0.007715225 \| _0.999073_ \| \| \| \| (256, 2) \| 0.008938313 \| 0.009071827 \| _0.985283_ \| \| \| \| (256, 4) \| 0.011227131 \| 0.011045933 \| 1.016404 \| \| \| \| (256, 8) \| 0.016131401 \| 0.016396046 \| _0.983859_ \| \| 512 \| 64 \| (512, 1) \| 0.011544228 \| 0.011487007 \| 1.004981 \| \| \| \| (512, 2) \| 0.014071465 \| 0.014281273 \| _0.985309_ \| \| \| \| (512, 4) \| 0.019016266 \| 0.018930435 \| 1.004534 \| \| \| \| (512, 8) \| 0.028913021 \| 0.028159618 \| 1.026755 \| Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16, 32), we observe significant speedups (greater than 5% better, bolded) as smaller the `dim_size`, larger the `BLOCK_SIZE / dim_size / sizeof(scalar_t)`, hence larger the unnecessary allocation. For larger `dim_size` (64, 128, 256, 512), we also observe insignificantly better (less than 5% better, unbolded) performance. For some shapes such as {128, 1}, we also observe insignificantly worse (less than 5% worse, _italicized_) performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117029 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-10 15:04:34 +00:00
rzou	e0da05e1ba	[codemod] markDynamoStrictTest dynamo/* (#117077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117077 Approved by: https://github.com/bdhirsh ghstack dependencies: #117076	2024-01-10 14:37:52 +00:00
rzou	04f788f925	Unflake test_auto_functionalize (#117076 ) feat better cleanup of the custom op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117076 Approved by: https://github.com/bdhirsh	2024-01-10 14:37:52 +00:00
Jack Taylor	5046b4981d	[ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329 ) Disabling layout optimisation in inductor for ROCm (https://github.com/pytorch/pytorch/pull/111474) was a bit shortsighted. If there are workloads that heavily use NHWC we will see a perf drop from additional transpose ops. Instead of disabling this entirely on ROCm this is now an opt-in feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116329 Approved by: https://github.com/jansel, https://github.com/eellison	2024-01-10 13:56:27 +00:00
Xia, Weiwen	94db6578cc	[Quant] Add dynamic quantization config for x86 inductor backend (#115337 ) Description Add dynamic quantization config for x86 inductor backend. To support the QKV structure in self-attention, we removed an assertion in port-metadata-pass that requires single dequantize node after quantize node. Test plan ``` python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_dynamic_quant_linear python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_qat_dynamic_quant_linear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115337 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-01-10 11:33:37 +00:00
Michael Lazos	558cc69641	Fix torch function kwarg dispatch (#117083 ) Previously, kwargs were incorrectly dispatched by passing them as the true kwargs to the torch function call. To fix, the kwargs of the original torch op need to be stored in a dictionary and passed as an argument to the torch function implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117083 Approved by: https://github.com/drisspg	2024-01-10 10:55:10 +00:00
PyTorch MergeBot	e88d0648ed	Revert "[export] Error grad mode op in export API (#116339 )" This reverts commit 943179852102ac0be27aeae5a2c0272e25ccf90e. Reverted https://github.com/pytorch/pytorch/pull/116339 on behalf of https://github.com/tugsbayasgalan due to PR below this in the stack broke torchrec/sigmoid tests ([comment](https://github.com/pytorch/pytorch/pull/116339#issuecomment-1884599027))	2024-01-10 10:42:33 +00:00
PyTorch MergeBot	77ecb3d725	Revert "[export] Exempt autograd ops for predispatch export (#116527 )" This reverts commit af2ded23eb398e14cf380b39d46bfa786d26b3ee. Reverted https://github.com/pytorch/pytorch/pull/116527 on behalf of https://github.com/tugsbayasgalan due to Need to revert this to revert the bottom diff ([comment](https://github.com/pytorch/pytorch/pull/116527#issuecomment-1884592658))	2024-01-10 10:38:27 +00:00
Davide Italiano	20f394f10a	[LLVM/TensorExpr] Update for an API change in LLVM 18. (#117086 ) `registerPassBuilderCallbacks` takes now an extra bool argument to print extra information. Currently initialized to false to not change functional behaviour. Relevant LLVM commit: `ffb1f20e0d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117086 Approved by: https://github.com/bertmaher	2024-01-10 09:08:42 +00:00
cyy	20f769544c	[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486 ) This PR follows #116751. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486 Approved by: https://github.com/albanD	2024-01-10 08:48:14 +00:00
Jane Xu	90df7c008a	Migrate state_dict bc test to OptimizerInfo, increase coverage (#116500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116500 Approved by: https://github.com/albanD	2024-01-10 08:19:27 +00:00

8323 changed files with 141368 additions and 60226 deletions

									
										17

.ci/docker/build.sh
									
												View File
												
				@ -204,7 +204,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.6

				    ROCM_VERSION=5.7

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -215,7 +215,7 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.7

				    ROCM_VERSION=6.0

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				@ -277,6 +277,7 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    UNINSTALL_DILL=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-executorch)

				    ANACONDA_PYTHON_VERSION=3.10

				@ -296,6 +297,15 @@ case "$image" in

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-aarch64-py3.10-gcc11)

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=11

				    ACL=yes

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    ;;

				  *)

				    # Catch-all for builds that are not hardcoded.

				    PROTOBUF=yes

				@ -349,7 +359,7 @@ if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				fi

				# Build image

				docker build \

				DOCKER_BUILDKIT=1 docker build \

				       --no-cache \

				       --progress=plain \

				       --build-arg "BUILD_ENVIRONMENT=${image}" \

				@ -387,6 +397,7 @@ docker build \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \

				       --build-arg "ACL=${ACL:-}" \

				       -f $(dirname ${DOCKERFILE})/Dockerfile \

				       -t "$tmp_tag" \

				       "$@" \

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 fe7dc518c04adf3d2ee5ccb7d99f41ade4
 e2a8f9548aecb62a68e264607174a7d207ed2929

2

.ci/docker/ci_commit_pins/huggingface.txt

View File

 @ -1 +1 @@
 c26faa159b79a42d7fa46cb66e2d21523351987
 e186efbf7fb93328dd6b34927a4e8c8f24395

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

 @ -1 +1 @@
 dafe1459823b9549417ed95e9720f1b594fab329
 d08e16b738ab550c3af51305df624d5c823dc445

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 e28a256d71f3cf2bcc7b69d6bda73a9b855e385e
 c6c9b209a5692b9a895398f4f3a033f8f80415

									
										16

.ci/docker/common/install_acl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,16 @@

				set -euo pipefail

				readonly version=v23.08

				readonly src_host=https://review.mlplatform.org/ml

				readonly src_repo=ComputeLibrary

				# Clone ACL

				[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git

				cd ${src_repo}

				git checkout $version

				# Build with scons

				scons -j8  Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \

				  os=linux arch=armv8a build=native multi_isa=1 \

				  fixed_format_kernels=1 openmp=1 cppthreads=0

									
										2

.ci/docker/common/install_base.sh
									
												View File
												
				@ -153,7 +153,7 @@ wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2

				tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2

				cd valgrind-${VALGRIND_VERSION}

				./configure --prefix=/usr/local

				make -j6

				make -j$[$(nproc) - 2]

				sudo make install

				cd ../../

				rm -rf valgrind_build

									
										53

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -9,10 +9,19 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				if [[ $(uname -m) == "aarch64" ]]; then

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				  case "$MAJOR_PYTHON_VERSION" in

				    2)

				      CONDA_FILE="Miniconda2-latest-Linux-x86_64.sh"

				    3)

				      CONDA_FILE="Miniforge3-Linux-aarch64.sh"

				    ;;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				else

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				    ;;

				@ -21,6 +30,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				      exit 1

				      ;;

				  esac

				fi

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -47,16 +57,40 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Uncomment the below when resolved to track the latest conda update

				  # as_jenkins conda update -y -n base conda

				  if [[ $(uname -m) == "aarch64" ]]; then

				    export SYSROOT_DEP="sysroot_linux-aarch64=2.17"

				  else

				    export SYSROOT_DEP="sysroot_linux-64=2.17"

				  fi

				  # Install correct Python version

				  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"

				  # Also ensure sysroot is using a modern GLIBC to match system compilers

				  as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\

				             python="$ANACONDA_PYTHON_VERSION" \

				             ${SYSROOT_DEP}

				  # libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30

				  # which is provided in libstdcxx 12 and up.

				  conda_install libstdcxx-ng=12.3.0 -c conda-forge

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  if [[ $(uname -m) == "aarch64" ]]; then

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}

				    fi

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				  if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then

				    conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				    fi

				  fi

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				  # and libpython-static for torch deploy

				@ -89,14 +123,5 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				    pip_install -r /opt/conda/requirements-docs.txt

				  fi

				  # HACK HACK HACK

				  # gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu

				  # Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda

				  # So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0

				  # Same is true for gcc-12 from Ubuntu-22.04

				  if grep -e [12][82].04.[623] /etc/issue >/dev/null; then

				    rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6

				  fi

				  popd

				fi

1

.ci/docker/common/install_executorch.sh

View File

 @ -48,7 +48,6 @@ setup_executorch() {
   install_flatc_from_source
   pip_install .
   build_executorch_runner "cmake"
   # Make sure that all the newly generate files are owned by Jenkins
   chown -R jenkins .

									
										9

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -26,13 +26,14 @@ pip_install \

				  pytest-cov==4.0.0 \

				  pytest-subtests==0.10.0 \

				  tabulate==0.9.0 \

				  transformers==4.32.1

				  transformers==4.36.2

				pip_install coloredlogs packaging

				retry pip_install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ --no-cache-dir --no-input ort-nightly==1.17.0.dev20231005006

				pip_install -i https://test.pypi.org/simple/ onnx==1.15.0rc2

				pip_install onnxscript==0.1.0.dev20231128 --no-deps

				pip_install onnxruntime==1.17.0

				pip_install onnx==1.15.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240301 --no-deps

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										3

.ci/docker/common/install_openssl.sh
									
												View File
												
				@ -9,7 +9,8 @@ tar xf "${OPENSSL}.tar.gz"

				cd "${OPENSSL}"

				./config --prefix=/opt/openssl -d '-Wl,--enable-new-dtags,-rpath,$(LIBRPATH)'

				# NOTE: openssl install errors out when built with the -j option

				make -j6; make install_sw

				NPROC=$[$(nproc) - 2]

				make -j${NPROC}; make install_sw

				# Link the ssl libraries to the /usr/lib folder.

				sudo ln -s /opt/openssl/lib/lib* /usr/lib

				cd ..

									
										42

.ci/docker/common/install_protobuf.sh
									
												View File
												
				@ -2,8 +2,6 @@

				set -ex

				# This function installs protobuf 3.17

				install_protobuf_317() {

				pb_dir="/usr/temp_pb_install_dir"

				mkdir -p $pb_dir

				@ -14,43 +12,7 @@ install_protobuf_317() {

				curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3

				tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz

				  # -j6 to balance memory usage and speed.

				  # naked `-j` seems to use too much memory.

				  pushd "$pb_dir" && ./configure && make -j6 && make -j6 check && sudo make -j6 install && sudo ldconfig

				NPROC=$[$(nproc) - 2]

				pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig

				popd

				rm -rf $pb_dir

				}

				install_ubuntu() {

				  # Ubuntu 14.04 has cmake 2.8.12 as the default option, so we will

				  # install cmake3 here and use cmake3.

				  apt-get update

				  if [[ "$UBUNTU_VERSION" == 14.04 ]]; then

				    apt-get install -y --no-install-recommends cmake3

				  fi

				  # Cleanup

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				  install_protobuf_317

				}

				install_centos() {

				  install_protobuf_317

				}

				# Install base packages depending on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				case "$ID" in

				  ubuntu)

				    install_ubuntu

				    ;;

				  centos)

				    install_centos

				    ;;

				  *)

				    echo "Unable to determine OS..."

				    exit 1

				    ;;

				esac

									
										16

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -80,6 +80,14 @@ install_ubuntu() {

				        fi

				    fi

				    # ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then

				        for kdb in /opt/rocm/share/miopen/db/*.kdb

				        do

				            sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				        done

				    fi

				    # Cleanup

				    apt-get autoclean && apt-get clean

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				@ -151,6 +159,14 @@ install_centos() {

				      fi

				  fi

				  # ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime

				  if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then

				      for kdb in /opt/rocm/share/miopen/db/*.kdb

				      do

				          sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"

				      done

				  fi

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

									
										2

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -7,7 +7,7 @@ git clone https://bitbucket.org/icl/magma.git

				pushd magma

				# Version 2.7.2 + ROCm related updates

				git checkout 823531632140d0edcb7e77c3edc0e837421471c5

				git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

									
										3

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -64,5 +64,6 @@ if [ -n "${CONDA_CMAKE}" ]; then

				  # latest numpy version, which fails ASAN tests with the following import error: Numba

				  # needs NumPy 1.20 or less.

				  conda_reinstall cmake="${CMAKE_VERSION}"

				  conda_reinstall numpy="${NUMPY_VERSION}"

				  # Note that we install numpy with pip as conda might not have the version we want

				  pip_install --force-reinstall numpy=="${NUMPY_VERSION}"

				fi

									
										7

.ci/docker/common/install_ucc.sh
									
												View File
												
				@ -36,7 +36,12 @@ function install_ucc() {

				  git submodule update --init --recursive

				  ./autogen.sh

				  ./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-cuda=$with_cuda

				  # We only run distributed tests on Tesla M60 and A10G

				  NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"

				  ./configure --prefix=$UCC_HOME          \

				    --with-ucx=$UCX_HOME                  \

				    --with-cuda=$with_cuda                \

				    --with-nvcc-gencode="${NVCC_GENCODE}"

				  time make -j

				  sudo make install

44

.ci/docker/requirements-ci.txt

View File

 @ -15,7 +15,7 @@ click
 #Pinned versions:
 #test that import:
 coremltools==5.0b5
 coremltools==5.0b5 ; python_version < "3.12"
 #Description: Apple framework for ML integration
 #Pinned versions: 5.0b5
 #test that import:
 @ -25,6 +25,11 @@ coremltools==5.0b5
 #Pinned versions:
 #test that import:
 dill==0.3.7
 #Description: dill extends pickle with serializing and de-serializing for most built-ins
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.1.6
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 @ -47,6 +52,11 @@ junitparser==2.1.1
 #Pinned versions: 2.1.1
 #test that import:
 lark==0.12.0
 #Description: parser
 #Pinned versions: 0.12.0
 #test that import:
 librosa>=0.6.2 ; python_version < "3.11"
 #Description: A python package for music and audio analysis
 #Pinned versions: >=0.6.2
 @ -66,7 +76,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Description: A testing library that allows you to replace parts of your
 #system under test with mock objects
 #Pinned versions:
 #test that import: test_module_init.py, test_modules.py, test_nn.py,
 #test that import: test_modules.py, test_nn.py,
 #test_testing.py
 #MonkeyType # breaks pytorch-xla-linux-bionic-py3.7-clang8
 @ -75,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.7.0
 mypy==1.8.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.7.0
 #Pinned versions: 1.8.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -137,9 +147,9 @@ optree==0.9.1
 #test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
 #test_fake_tensor.py, test_mps.py
 pillow==10.0.1
 pillow==10.2.0
 #Description:  Python Imaging Library fork
 #Pinned versions: 10.0.1
 #Pinned versions: 10.2.0
 #test that import:
 protobuf==3.20.2
 @ -162,11 +172,6 @@ pytest-xdist==3.3.1
 #Pinned versions:
 #test that import:
 pytest-shard==0.1.2
 #Description: plugin spliting up tests in pytest
 #Pinned versions:
 #test that import:
 pytest-flakefinder==1.1.0
 #Description: plugin for rerunning tests a fixed number of times in pytest
 #Pinned versions: 1.1.0
 @ -243,7 +248,8 @@ tb-nightly==2.13.0a20230426
 #Pinned versions:
 #test that import:
 #typing-extensions
 # needed by torchgen utils
 typing-extensions
 #Description: type hints for python
 #Pinned versions:
 #test that import:
 @ -258,7 +264,8 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #Pinned versions:
 #test that import:
 lintrunner==0.10.7
 #wheel not found on aarch64, and source build requires rust
 lintrunner==0.10.7 ; platform_machine == "x86_64"
 #Description: all about linters!
 #Pinned versions: 0.10.7
 #test that import:
 @ -268,14 +275,14 @@ rockset==1.0.3
 #Pinned versions: 1.0.3
 #test that import:
 ghstack==0.7.1
 ghstack==0.8.0
 #Description: ghstack tool
 #Pinned versions: 0.7.1
 #Pinned versions: 0.8.0
 #test that import:
 jinja2==3.1.2
 jinja2==3.1.3
 #Description: jinja2 template engine
 #Pinned versions: 3.1.2
 #Pinned versions: 3.1.3
 #test that import:
 pytest-cpp==2.3.0
 @ -293,7 +300,8 @@ tensorboard==2.13.0
 #Pinned versions:
 #test that import: test_tensorboard
 pywavelets==1.4.1
 pywavelets==1.4.1 ; python_version < "3.12"
 pywavelets==1.5.0 ; python_version >= "3.12"
 #Description: This is a requirement of scikit-image, we need to pin
 # it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
 #Pinned versions: 1.4.1

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .2.0
 .3.0

									
										8

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -37,6 +37,7 @@ COPY requirements-ci.txt requirements-docs.txt /opt/conda/

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt

				RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi

				# Install gcc

				ARG GCC_VERSION

				@ -160,6 +161,13 @@ COPY ./common/install_onnx.sh ./common/common_utils.sh ./

				RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi

				RUN rm install_onnx.sh common_utils.sh

				# (optional) Build ACL

				ARG ACL

				COPY ./common/install_acl.sh install_acl.sh

				RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi

				RUN rm install_acl.sh

				ENV INSTALLED_ACL ${ACL}

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										18

.ci/pytorch/build.sh
									
												View File
												
				@ -82,6 +82,19 @@ if ! which conda; then

				  fi

				else

				  export CMAKE_PREFIX_PATH=/opt/conda

				  # Workaround required for MKL library linkage

				  # https://github.com/pytorch/pytorch/issues/119557

				  if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				    export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"

				    export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"

				  fi

				fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				  export USE_MKLDNN=1

				  export USE_MKLDNN_ACL=1

				  export ACL_ROOT_DIR=/ComputeLibrary

				fi

				if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then

				@ -242,6 +255,11 @@ else

				    # or building non-XLA tests.

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0 release candidate for builds

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				      fi

				      WERROR=1 python setup.py bdist_wheel

				    else

				      python setup.py bdist_wheel

									
										7

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -158,6 +158,11 @@ function install_torchvision() {

				  fi

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.5"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

				function install_torchrec_and_fbgemm() {

				  local torchrec_commit

				  torchrec_commit=$(get_pinned_commit torchrec)

				@ -173,7 +178,7 @@ function install_torchrec_and_fbgemm() {

				function clone_pytorch_xla() {

				  if [[ ! -d ./xla ]]; then

				    git clone --recursive --quiet https://github.com/pytorch/xla.git

				    git clone --recursive -b r2.3 https://github.com/pytorch/xla.git

				    pushd xla

				    # pin the xla hash so that we don't get broken by changes to xla

				    git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

									
										2

.ci/pytorch/macos-common.sh
									
												View File
												
				@ -9,7 +9,7 @@ sysctl -a | grep machdep.cpu

				# These are required for both the build job and the test job.

				# In the latter to test cpp extensions.

				export MACOSX_DEPLOYMENT_TARGET=11.0

				export MACOSX_DEPLOYMENT_TARGET=11.1

				export CXX=clang++

				export CC=clang

									
										2

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -149,6 +149,8 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				install_tlparse

				if [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_python_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

									
										4

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -34,7 +34,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test

				# functional collective tests

				time python test/run_test.py --verbose -i distributed/test_functional_api

				# DTensor tests

				time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops

				time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile

				@ -49,6 +48,7 @@ time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_ex

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

				time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype

				time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				assert_git_not_dirty

									
										139

.ci/pytorch/test.sh
									
												View File
												
				@ -130,6 +130,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"

				elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"

				  # setting PYTHON_TEST_EXTRA_OPTION

				  export PYTHON_TEST_EXTRA_OPTION="--xpu"

				fi

				if [[ "$TEST_CONFIG" == *crossref* ]]; then

				@ -137,6 +139,8 @@ if [[ "$TEST_CONFIG" == *crossref* ]]; then

				fi

				if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				  # regression in ROCm 6.0 on MI50 CI runners due to hipblaslt; remove in 6.1

				  export VALGRIND=OFF

				  # Print GPU info

				  rocminfo

				  rocminfo | grep -E 'Name:.*\sgfx|Marketing'

				@ -159,6 +163,8 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then

				  export PATH="$HOME/.local/bin:$PATH"

				fi

				install_tlparse

				# DANGER WILL ROBINSON.  The LD_PRELOAD here could cause you problems

				# if you're not careful.  Check this if you made some changes and the

				# ASAN test is not working

				@ -250,14 +256,14 @@ test_python_shard() {

				  # Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly

				  # shellcheck disable=SC2086

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				}

				test_python() {

				  # shellcheck disable=SC2086

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				}

				@ -268,34 +274,13 @@ test_dynamo_shard() {

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  # Temporarily disable test_fx for dynamo pending the investigation on TTS

				  # regression in https://github.com/pytorch/torchdynamo/issues/784

				  # PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.

				  # Instead, use @skipIfTorchDynamo on your tests.

				  time python test/run_test.py --dynamo \

				    --exclude-inductor-tests \

				    --exclude-jit-executor \

				    --exclude-distributed-tests \

				    --exclude \

				      test_ao_sparsity \

				      test_autograd \

				      test_jit \

				      test_proxy_tensor \

				      test_quantization \

				      test_public_bindings \

				      test_dataloader \

				      test_reductions \

				      test_namedtensor \

				      test_namedtuple_return_api \

				      profiler/test_profiler \

				      profiler/test_profiler_tree \

				      test_overrides \

				      test_python_dispatch \

				      test_fx \

				      test_package \

				      test_legacy_vmap \

				      test_custom_ops \

				      test_content_store \

				      export/test_db \

				      functorch/test_dims \

				      functorch/test_aotdispatch \

				    --exclude-torch-export-tests \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  assert_git_not_dirty

				@ -307,8 +292,16 @@ test_inductor_distributed() {

				  pytest test/inductor/test_torchinductor.py -k test_multi_gpu

				  pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device

				  pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices

				  pytest test/distributed/test_c10d_functional_native.py

				  pytest test/distributed/_tensor/test_dtensor_compile.py

				  pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing

				  pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp

				  pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py

				  pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype

				  pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				  # with if required # gpus aren't available

				@ -330,6 +323,14 @@ test_inductor() {

				  fi

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				# For example 'dynamic_aot_eager_torchbench' TEST_CONFIG means we run

				# the benchmark script with '--dynamic-shapes --backend aot_eager --device cuda'

				@ -422,7 +423,7 @@ test_perf_for_dashboard() {

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then

				        python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"

				      fi

				@ -466,6 +467,11 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" \

				@ -522,7 +528,7 @@ test_inductor_torchbench_smoketest_perf() {

				  # The threshold value needs to be actively maintained to make this check useful

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \

				    --export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"

				  # The threshold value needs to be actively maintained to make this check useful

				  # The perf number of nanogpt seems not very stable, e.g.

				@ -543,6 +549,50 @@ test_inductor_torchbench_smoketest_perf() {

				  done

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  #set jemalloc

				  JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  CORES=$(lscpu | grep Core | awk '{print $4}')

				  export OMP_NUM_THREADS=$CORES

				  end_core=$(( CORES-1 ))

				  MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv

				  grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg

				  do

				    local model_name=${model_cfg[0]}

				    local data_type=${model_cfg[1]}

				    local speedup_target=${model_cfg[4]}

				    if [[ ${model_cfg[3]} == "cpp" ]]; then

				      export TORCHINDUCTOR_CPP_WRAPPER=1

				    else

				      unset TORCHINDUCTOR_CPP_WRAPPER

				    fi

				    local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"

				    if [[ ${model_cfg[2]} == "dynamic" ]]; then

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \

				        --dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"

				    else

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \

				        --freezing --timeout 9000 --backend=inductor --output "$output_name"

				    fi

				    cat "$output_name"

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				  done

				}

				test_python_gloo_with_tls() {

				  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"

				  assert_git_not_dirty

				@ -693,9 +743,8 @@ test_xpu_bin(){

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*

				  do

				    if [[ "$xpu_case" != *"*"* ]]; then

				  for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*; do

				    if [[ "$xpu_case" != *"*"* && "$xpu_case" != *.so && "$xpu_case" != *.a ]]; then

				      case_name=$(basename "$xpu_case")

				      echo "Testing ${case_name} ..."

				      "$xpu_case" --gtest_output=xml:"$TEST_REPORTS_DIR"/"$case_name".xml

				@ -943,7 +992,8 @@ test_bazel() {

				    tools/bazel test --config=cpu-only --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests

				  else

				    tools/bazel test --test_output=errors \

				    # Increase the test timeout to 480 like CPU tests because modules_test frequently timeout

				    tools/bazel test --test_timeout=480 --test_output=errors \

				      //:any_test \

				      //:autograd_test \

				      //:dataloader_test \

				@ -1038,14 +1088,17 @@ test_docs_test() {

				}

				test_executorch() {

				  echo "Install torchvision and torchaudio"

				  install_torchvision

				  install_torchaudio

				  pushd /executorch

				  echo "Install torchvision and torchaudio"

				  # TODO(huydhn): Switch this to the pinned commits on ExecuTorch once they are

				  # there.  These libraries need to be built here, and not part of the Docker

				  # image because they require the target version of torch to be installed first

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git"

				  # NB: We need to build ExecuTorch runner here and not inside the Docker image

				  # because it depends on PyTorch

				  # shellcheck disable=SC1091

				  source .ci/scripts/utils.sh

				  build_executorch_runner "cmake"

				  echo "Run ExecuTorch regression tests for some models"

				  # NB: This is a sample model, more can be added here

				@ -1114,6 +1167,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      shufflenet_v2_x1_0 hf_GPT2

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  else

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				@ -1123,6 +1181,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				  fi

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then

				  install_torchvision

				  test_inductor

									
										13

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -16,11 +16,6 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol

				set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers

				call %INSTALLER_DIR%\install_mkl.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				call %INSTALLER_DIR%\install_magma.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				@ -35,6 +30,10 @@ call %INSTALLER_DIR%\activate_miniconda3.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				:: Override VS env here

				pushd .

				if "%VC_VERSION%" == "" (

				@ -89,8 +88,8 @@ set SCCACHE_IGNORE_SERVER_IO_ERROR=1

				sccache --stop-server

				sccache --start-server

				sccache --zero-stats

				set CC=sccache-cl

				set CXX=sccache-cl

				set CMAKE_C_COMPILER_LAUNCHER=sccache

				set CMAKE_CXX_COMPILER_LAUNCHER=sccache

				set CMAKE_GENERATOR=Ninja

									
										14

.ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat
									
												View File
											
				@ -1,14 +0,0 @@

				if "%REBUILD%"=="" (

				  if "%BUILD_ENVIRONMENT%"=="" (

				    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z

				  ) else (

				    aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet

				  )

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				  7z x -aoa %TMP_DIR_WIN%\mkl.7z -o%TMP_DIR_WIN%\mkl

				  if errorlevel 1 exit /b

				  if not errorlevel 0 exit /b

				)

				set CMAKE_INCLUDE_PATH=%TMP_DIR_WIN%\mkl\include

				set LIB=%TMP_DIR_WIN%\mkl\lib;%LIB%

									
										13

.ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat
									
												View File
												
				@ -1,18 +1,13 @@

				mkdir %TMP_DIR_WIN%\bin

				if "%REBUILD%"=="" (

				  :check_sccache

				  %TMP_DIR_WIN%\bin\sccache.exe --show-stats || (

				  IF EXIST %TMP_DIR_WIN%\bin\sccache.exe (

				    taskkill /im sccache.exe /f /t || ver > nul

				    del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul

				    del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul

				  )

				  if "%BUILD_ENVIRONMENT%"=="" (

				      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe

				      curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe

				    curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-v0.7.4.exe --output %TMP_DIR_WIN%\bin\sccache.exe

				  ) else (

				      aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe

				      aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe

				    )

				    goto :check_sccache

				    aws s3 cp s3://ossci-windows/sccache-v0.7.4.exe %TMP_DIR_WIN%\bin\sccache.exe

				  )

				)

									
										466

.circleci/README.md
									
												View File
												
				@ -1,468 +1,4 @@

				Warning

				=======

				Contents may be out of date. Our CircleCI workflows are gradually being migrated to Github actions.

				Structure of CI

				===============

				setup job:

				1. Does a git checkout

				2. Persists CircleCI scripts (everything in `.circleci`) into a workspace.  Why?

				   We don't always do a Git checkout on all subjobs, but we usually

				   still want to be able to call scripts one way or another in a subjob.

				   Persisting files this way lets us have access to them without doing a

				   checkout.  This workspace is conventionally mounted on `~/workspace`

				   (this is distinguished from `~/project`, which is the conventional

				   working directory that CircleCI will default to starting your jobs

				   in.)

				3. Write out the commit message to `.circleci/COMMIT_MSG`.  This is so

				   we can determine in subjobs if we should actually run the jobs or

				   not, even if there isn't a Git checkout.

				CircleCI configuration generator

				================================

				One may no longer make changes to the `.circleci/config.yml` file directly.

				Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.

				Usage

				----------

				1. Make changes to these scripts.

				2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.

				You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.

				Motivation

				----------

				These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.

				The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.

				Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate

				multiple parts of the file.

				* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets

				Also see https://github.com/pytorch/pytorch/issues/17038

				Future direction

				----------------

				### Declaring sparse config subsets

				See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):

				In contrast with a full recursive tree traversal of configuration dimensions,

				> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.

				----------------

				----------------

				# How do the binaries / nightlies / releases work?

				### What is a binary?

				A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.

				A **binary configuration** is a collection of

				* release or nightly

				    * releases are stable, nightlies are beta and built every night

				* python version

				    * linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)

				    * macos: 3.7, 3.8

				    * windows: 3.7, 3.8

				* cpu version

				    * cpu, cuda 9.0, cuda 10.0

				    * The supported cuda versions occasionally change

				* operating system

				    * Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu

				    * MacOS

				    * Windows - these are built on Azure pipelines

				* devtoolset version (gcc compiler version)

				    * This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string

				### Where are the binaries?

				The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.

				We have 3 types of binary packages

				* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)

				* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix

				* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only

				    * shared with dependencies (the only supported option for Windows)

				    * static with dependencies

				    * shared without dependencies

				    * static without dependencies

				All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)

				# CircleCI structure of the binaries

				Some quick vocab:

				* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.

				* **jobs** are a sequence of '**steps**'

				* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*

				* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.

				## How are the workflows structured?

				The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build, test, and upload) per binary configuration

				1. binary_builds

				    1. every day midnight EST

				    2. linux: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    3. macos: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a

				        1. binary_linux_conda_3.7_cpu_build

				            1. Builds the build. On linux jobs this uses the 'docker executor'.

				            2. Persists the package to the workspace

				        2. binary_linux_conda_3.7_cpu_test

				            1. Loads the package to the workspace

				            2. Spins up a docker image (on Linux), mapping the package and code repos into the docker

				            3. Runs some smoke tests in the docker

				            4. (Actually, for macos this is a step rather than a separate job)

				        3. binary_linux_conda_3.7_cpu_upload

				            1. Logs in to aws/conda

				            2. Uploads the package

				2. update_s3_htmls

				    1. every day 5am EST

				    2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml

				    3. See below for what these are for and why they're needed

				    4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3

				3. binarysmoketests

				    1. every day

				    2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a

				        1. smoke_linux_conda_3.7_cpu

				            1. Downloads the package from the cloud, e.g. using the official pip or conda instructions

				            2. Runs the smoke tests

				## How are the jobs structured?

				The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .

				* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    * binary_linux_build.sh

				    * binary_linux_test.sh

				    * binary_linux_upload.sh

				* MacOS jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    * binary_macos_build.sh

				    * binary_macos_test.sh

				    * binary_macos_upload.sh

				* Update html jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml

				    * These delegate from the pytorch/builder repo

				    * https://github.com/pytorch/builder/blob/main/cron/update_s3_htmls.sh

				    * https://github.com/pytorch/builder/blob/main/cron/upload_binary_sizes.sh

				* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    * These delegate from the pytorch/builder repo

				    * https://github.com/pytorch/builder/blob/main/run_tests.sh

				    * https://github.com/pytorch/builder/blob/main/smoke_test.sh

				    * https://github.com/pytorch/builder/blob/main/check_binary.sh

				* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml

				    * binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh

				    * binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.

				    * binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables

				    * binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image

				### **Why do the steps all refer to scripts?**

				CircleCI creates a  final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.

				### **What is binary_run_in_docker for?**

				So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus

				* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor

				* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs

				* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use

				* linux smoke test jobs use the machine executor for the same reason as the linux test jobs

				binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs

				### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**

				We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.

				# Code structure of the binaries (circleci agnostic)

				## Overview

				The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is

				```

				# All code needed to set-up environments for build code to run in,

				# but only code that is specific to the current CI system

				pytorch/pytorch

				- .circleci/                # Folder that holds all circleci related stuff

				  - config.yml              # GENERATED file that actually controls all circleci behavior

				  - verbatim-sources        # Used to generate job/workflow sections in ^

				  - scripts/                # Code needed to prepare circleci environments for binary build scripts

				- setup.py                  # Builds pytorch. This is wrapped in pytorch/builder

				- cmake files               # used in normal building of pytorch

				# All code needed to prepare a binary build, given an environment

				# with all the right variables/packages/paths.

				pytorch/builder

				# Given an installed binary and a proper python env, runs some checks

				# to make sure the binary was built the proper way. Checks things like

				# the library dependencies, symbols present, etc.

				- check_binary.sh

				# Given an installed binary, runs python tests to make sure everything

				# is in order. These should be de-duped. Right now they both run smoke

				# tests, but are called from different places. Usually just call some

				# import statements, but also has overlap with check_binary.sh above

				- run_tests.sh

				- smoke_test.sh

				# Folders that govern how packages are built. See paragraphs below

				- conda/

				  - build_pytorch.sh          # Entrypoint. Delegates to proper conda build folder

				  - switch_cuda_version.sh    # Switches activate CUDA installation in Docker

				  - pytorch-nightly/          # Build-folder

				- manywheel/

				  - build_cpu.sh              # Entrypoint for cpu builds

				  - build.sh                  # Entrypoint for CUDA builds

				  - build_common.sh           # Actual build script that ^^ call into

				- wheel/

				  - build_wheel.sh            # Entrypoint for wheel builds

				- windows/

				  - build_pytorch.bat         # Entrypoint for wheel builds on Windows

				```

				Every type of package has an entrypoint build script that handles the all the important logic.

				## Conda

				Linux, MacOS and Windows use the same code flow for the conda builds.

				Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html

				Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.

				tl;dr on conda-build is

				1. Creates a brand new conda environment, based off of deps in the meta.yaml

				    1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml

				    2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.

				2. Calls build.sh in the environment

				3. Copies the finished package to a new conda env, also specified by the meta.yaml

				4. Runs some simple import tests (if specified in the meta.yaml)

				5. Saves the finished package as a tarball

				The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.

				The entrypoint file `builder/conda/build_conda.sh` is complicated because

				* It works for Linux, MacOS and Windows

				    * The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed

				* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.

				## Manywheels (linux pip and libtorch packages)

				Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.

				`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`

				The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because

				* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This also builds libtorch packages

				    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.

				* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.

				## Wheels (MacOS pip and libtorch packages)

				The entrypoint file `builder/wheel/build_wheel.sh` is complicated because

				* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This also builds libtorch packages

				    * Ditto the comment above. This should definitely be separated out.

				Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.

				## Windows Wheels (Windows pip and libtorch packages)

				The entrypoint file `builder/windows/build_pytorch.bat` is complicated because

				* This used to handle building for several different python versions at the same time. This is why there are loops everywhere

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff

				    * The script is never used this way anymore. This extra machinery could be removed.

				* This also builds libtorch packages

				    * This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.

				Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.

				## General notes

				### Note on run_tests.sh, smoke_test.sh, and check_binary.sh

				* These should all be consolidated

				* These must run on all OS types: MacOS, Linux, and Windows

				* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.

				* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.

				### Note on libtorch

				Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this

				* It’s confusing. Most of those scripts deal with python specifics.

				* The extra conditionals everywhere severely complicate the wheel build scripts

				* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)

				### Note on docker images / Dockerfiles

				All linux builds occur in docker images. The docker images are

				* pytorch/conda-cuda

				    * Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds

				    * Also used for cpu builds

				* pytorch/manylinux-cuda90

				* pytorch/manylinux-cuda100

				    * Also used for cpu builds

				The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.

				### General Python

				* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2

				# How to manually rebuild the binaries

				tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159

				Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.

				## How to test changes to the binaries via .circleci

				Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.

				```sh

				# Make your changes

				touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml

				# Regenerate the yaml, has to be in python 3.7

				.circleci/regenerate.sh

				# Make a commit

				git add .circleci *

				git commit -m "My real changes"

				git push origin my_branch

				# Now hardcode the jobs that you want in the .circleci/config.yml workflows section

				# Also eliminate ensure-consistency and should_run_job checks

				# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d

				# Make a commit you won't keep

				git add .circleci

				git commit -m "[DO NOT LAND] testing binaries for above changes"

				git push origin my_branch

				# Now you need to make some changes to the first commit.

				git rebase -i HEAD~2 # mark the first commit as 'edit'

				# Make the changes

				touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml

				.circleci/regenerate.sh

				# Ammend the commit and recontinue

				git add .circleci

				git commit --amend

				git rebase --continue

				# Update the PR, need to force since the commits are different now

				git push origin my_branch --force

				```

				The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.

				## How to build a binary locally

				### Linux

				You can build Linux binaries locally easily using docker.

				```sh

				# Run the docker

				# Use the correct docker image, pytorch/conda-cuda used here as an example

				#

				# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the

				#    machine that you're running the command on) accessible to the docker

				#    container at path/to/bar. So if you then run `touch path/to/bar/baz`

				#    in the docker container then you will see path/to/foo/baz on your local

				#    machine. You could also clone the pytorch and builder repos in the docker.

				#

				# If you know how, add ccache as a volume too and speed up everything

				docker run \

				    -v your/pytorch/repo:/pytorch \

				    -v your/builder/repo:/builder \

				    -v where/you/want/packages/to/appear:/final_pkgs \

				    -it pytorch/conda-cuda /bin/bash

				# Export whatever variables are important to you. All variables that you'd

				# possibly need are in .circleci/scripts/binary_populate_env.sh

				# You should probably always export at least these 3 variables

				export PACKAGE_TYPE=conda

				export DESIRED_PYTHON=3.7

				export DESIRED_CUDA=cpu

				# Call the entrypoint

				# `|& tee foo.log` just copies all stdout and stderr output to foo.log

				# The builds generate lots of output so you probably need this when

				# building locally.

				/builder/conda/build_pytorch.sh |& tee build_output.log

				```

				**Building CUDA binaries on docker**

				You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).

				For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.

				### MacOS

				There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci    :/

				But if you want to try, then I’d recommend

				```sh

				# Create a new terminal

				# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you

				# know how to do

				# Install a new miniconda

				# First remove any other python or conda installation from your PATH

				# Always install miniconda 3, even if building for Python <3

				new_conda="~/my_new_conda"

				conda_sh="$new_conda/install_miniconda.sh"

				curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

				chmod +x "$conda_sh"

				"$conda_sh" -b -p "$MINICONDA_ROOT"

				rm -f "$conda_sh"

				export PATH="~/my_new_conda/bin:$PATH"

				# Create a clean python env

				# All MacOS builds use conda to manage the python env and dependencies

				# that are built with, even the pip packages

				conda create -yn binary python=2.7

				conda activate binary

				# Export whatever variables are important to you. All variables that you'd

				# possibly need are in .circleci/scripts/binary_populate_env.sh

				# You should probably always export at least these 3 variables

				export PACKAGE_TYPE=conda

				export DESIRED_PYTHON=3.7

				export DESIRED_CUDA=cpu

				# Call the entrypoint you want

				path/to/builder/wheel/build_wheel.sh

				```

				N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that

				1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.

				2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`

				3. Now say you (or some code that you ran) call python executable `foo`

				    1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.

				    2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!

				Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.

				### Windows

				TODO: fill in

				PyTorch migration from CircleCI to github actions has been completed. All continuous integration & deployment workflows are defined in  `.github/workflows` folder

									
										69

.circleci/scripts/binary_checkout.sh
									
												View File
											
				@ -1,69 +0,0 @@

				#!/bin/bash

				set -eux -o pipefail

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				}

				# This step runs on multiple executors with different envfile locations

				if [[ "$(uname)" == Darwin ]]; then

				  # macos executor (builds and tests)

				  workdir="/Users/distiller/project"

				elif [[ "$OSTYPE" == "msys" ]]; then

				  # windows executor (builds and tests)

				  rm -rf /c/w

				  ln -s "/c/Users/circleci/project" /c/w

				  workdir="/c/w"

				elif [[ -d "/home/circleci/project" ]]; then

				  # machine executor (binary tests)

				  workdir="/home/circleci/project"

				else

				  # docker executor (binary builds)

				  workdir="/"

				fi

				# It is very important that this stays in sync with binary_populate_env.sh

				if [[ "$OSTYPE" == "msys" ]]; then

				  # We need to make the paths as short as possible on Windows

				  export PYTORCH_ROOT="$workdir/p"

				  export BUILDER_ROOT="$workdir/b"

				else

				  export PYTORCH_ROOT="$workdir/pytorch"

				  export BUILDER_ROOT="$workdir/builder"

				fi

				# Try to extract PR number from branch if not already set

				if [[ -z "${CIRCLE_PR_NUMBER:-}" ]]; then

				  CIRCLE_PR_NUMBER="$(echo ${CIRCLE_BRANCH} | sed -E -n 's/pull\/([0-9]*).*/\1/p')"

				fi

				# Clone the Pytorch branch

				retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"

				pushd "$PYTORCH_ROOT"

				if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then

				  # "smoke" binary build on PRs

				  git fetch --force origin "pull/${CIRCLE_PR_NUMBER}/head:remotes/origin/pull/${CIRCLE_PR_NUMBER}"

				  git reset --hard "$CIRCLE_SHA1"

				  git checkout -q -B "$CIRCLE_BRANCH"

				  git reset --hard "$CIRCLE_SHA1"

				elif [[ -n "${CIRCLE_SHA1:-}" ]]; then

				  # Scheduled workflows & "smoke" binary build on trunk on PR merges

				  DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"

				  git reset --hard "$CIRCLE_SHA1"

				  git checkout -q -B $DEFAULT_BRANCH

				else

				  echo "Can't tell what to checkout"

				  exit 1

				fi

				retry git submodule update --init --recursive

				echo "Using Pytorch from "

				git --no-pager log --max-count 1

				popd

				# Clone the Builder main repo

				retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"

				pushd "$BUILDER_ROOT"

				echo "Using builder from "

				git --no-pager log --max-count 1

				popd

									
										44

.circleci/scripts/binary_install_miniconda.sh
									
												View File
											
				@ -1,44 +0,0 @@

				#!/bin/bash

				set -eux -o pipefail

				# This step runs on multiple executors with different envfile locations

				if [[ "$(uname)" == Darwin ]]; then

				  envfile="/Users/distiller/project/env"

				elif [[ -d "/home/circleci/project" ]]; then

				  # machine executor (binary tests)

				  envfile="/home/circleci/project/env"

				else

				  # docker executor (binary builds)

				  envfile="/env"

				fi

				# TODO this is super hacky and ugly. Basically, the binary_update_html job does

				# not have an env file, since it does not call binary_populate_env.sh, since it

				# does not have a BUILD_ENVIRONMENT. So for this one case, which we detect by a

				# lack of an env file, we manually export the environment variables that we

				# need to install miniconda

				if [[ ! -f "$envfile" ]]; then

				  MINICONDA_ROOT="/home/circleci/project/miniconda"

				  workdir="/home/circleci/project"

				  retry () {

				      $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)

				  }

				  export -f retry

				else

				  source "$envfile"

				fi

				conda_sh="$workdir/install_miniconda.sh"

				if [[ "$(uname)" == Darwin ]]; then

				  curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh

				else

				  curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

				fi

				chmod +x "$conda_sh"

				"$conda_sh" -b -p "$MINICONDA_ROOT"

				rm -f "$conda_sh"

				# We can't actually add miniconda to the PATH in the envfile, because that

				# breaks 'unbuffer' in Mac jobs. This is probably because conda comes with

				# a tclsh, which then gets inserted before the tclsh needed in /usr/bin

									
										4

.circleci/scripts/binary_macos_build.sh
									
												View File
												
				@ -4,10 +4,6 @@ set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"

				mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"

				if [[ -z "${GITHUB_ACTIONS:-}" ]]; then

				  export PATH="${workdir:-${HOME}}/miniconda/bin:${PATH}"

				fi

				# Build

				export USE_PYTORCH_METAL_EXPORT=1

				export USE_COREML_DELEGATE=1

									
										65

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -3,17 +3,9 @@ set -eux -o pipefail

				export TZ=UTC

				tagged_version() {

				  # Grabs version from either the env variable CIRCLE_TAG

				  # or the pytorch git described version

				  if [[ "$OSTYPE" == "msys" &&  -z "${GITHUB_ACTIONS:-}" ]]; then

				    GIT_DIR="${workdir}/p/.git"

				  else

				  GIT_DIR="${workdir}/pytorch/.git"

				  fi

				  GIT_DESCRIBE="git --git-dir ${GIT_DIR} describe --tags --match v[0-9]*.[0-9]*.[0-9]*"

				  if [[ -n "${CIRCLE_TAG:-}" ]]; then

				    echo "${CIRCLE_TAG}"

				  elif [[ ! -d "${GIT_DIR}" ]]; then

				  if [[ ! -d "${GIT_DIR}" ]]; then

				    echo "Abort, abort! Git dir ${GIT_DIR} does not exists!"

				    kill $$

				  elif ${GIT_DESCRIBE} --exact >/dev/null; then

				@ -59,6 +51,7 @@ PIP_UPLOAD_FOLDER='nightly/'

				# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it

				export DATE="$(date -u +%Y%m%d)"

				BASE_BUILD_VERSION="$(cat ${PYTORCH_ROOT}/version.txt|cut -da -f1).dev${DATE}"

				# Change BASE_BUILD_VERSION to git tag when on a git tag

				# Use 'git -C' to make doubly sure we're in the correct directory for checking

				# the git tag

				@ -78,6 +71,35 @@ fi

				export PYTORCH_BUILD_NUMBER=1

				# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.12 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				      TRITON_REQUIREMENT="pytorch-triton==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				  fi

				  export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				    else

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				    fi

				fi

				JAVA_HOME=

				BUILD_JNI=OFF

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				@ -123,12 +145,13 @@ if [[ "${OSTYPE}" == "msys" ]]; then

				else

				  export DESIRED_DEVTOOLSET="${DESIRED_DEVTOOLSET:-}"

				fi

				export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"

				export DATE="$DATE"

				export NIGHTLIES_DATE_PREAMBLE=1.14.0.dev

				export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"

				export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"

				export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"

				export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"

				# TODO: We don't need this anymore IIUC

				export TORCH_PACKAGE_NAME='torch'

				@ -161,28 +184,6 @@ if [[ "$(uname)" != Darwin ]]; then

				EOL

				fi

				if [[ -z "${GITHUB_ACTIONS:-}" ]]; then

				  cat >>"$envfile" <<EOL

				  export workdir="$workdir"

				  export MAC_PACKAGE_WORK_DIR="$workdir"

				  if [[ "$OSTYPE" == "msys" ]]; then

				    export PYTORCH_ROOT="$workdir/p"

				    export BUILDER_ROOT="$workdir/b"

				  else

				    export PYTORCH_ROOT="$workdir/pytorch"

				    export BUILDER_ROOT="$workdir/builder"

				  fi

				  export MINICONDA_ROOT="$workdir/miniconda"

				  export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"

				  export CIRCLE_TAG="${CIRCLE_TAG:-}"

				  export CIRCLE_SHA1="$CIRCLE_SHA1"

				  export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"

				  export CIRCLE_BRANCH="$CIRCLE_BRANCH"

				  export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"

				EOL

				fi

				echo 'retry () {' >> "$envfile"

				echo '    $*  || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)' >> "$envfile"

				echo '}' >> "$envfile"

									
										29

.circleci/scripts/binary_run_in_docker.sh
									
												View File
											
				@ -1,29 +0,0 @@

				#!/bin/bash

				# This section is used in the binary_test and smoke_test jobs. It expects

				# 'binary_populate_env' to have populated /home/circleci/project/env and it

				# expects another section to populate /home/circleci/project/ci_test_script.sh

				# with the code to run in the docker

				# Expect all needed environment variables to be written to this file

				source /home/circleci/project/env

				echo "Running the following code in Docker"

				cat /home/circleci/project/ci_test_script.sh

				echo

				echo

				set -eux -o pipefail

				# Expect actual code to be written to this file

				chmod +x /home/circleci/project/ci_test_script.sh

				VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"

				# Run the docker

				if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then

				  export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")

				else

				  export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")

				fi

				# Execute the test script that was populated by an earlier section

				export COMMAND='((echo "source /circleci_stuff/env && /circleci_stuff/ci_test_script.sh") | docker exec -i "$id" bash) 2>&1'

				echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

									
										111

.circleci/scripts/setup_ci_environment.sh
									
												View File
											
				@ -1,111 +0,0 @@

				#!/usr/bin/env bash

				set -ex -o pipefail

				# Remove unnecessary sources

				sudo rm -f /etc/apt/sources.list.d/google-chrome.list

				sudo rm -f /etc/apt/heroku.list

				sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list

				sudo rm -f /etc/apt/partner.list

				# To increase the network reliability, let apt decide which mirror is best to use

				sudo sed -i -e 's/http:\/\/.*archive/mirror:\/\/mirrors/' -e 's/\/ubuntu\//\/mirrors.txt/' /etc/apt/sources.list

				retry () {

				  $*  || $* || $* || $* || $*

				}

				# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading

				# (with use of tee to avoid permissions problems)

				# This is better than retrying the whole apt-get command

				echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries

				retry sudo apt-get update -qq

				retry sudo apt-get -y install \

				  moreutils \

				  expect-dev

				echo "== DOCKER VERSION =="

				docker version

				if ! command -v aws >/dev/null; then

				  retry sudo pip3 -q install awscli==1.19.64

				fi

				if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then

				  DRIVER_FN="NVIDIA-Linux-x86_64-515.76.run"

				  wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

				  sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)

				  nvidia-smi

				  # Taken directly from https://github.com/NVIDIA/nvidia-docker

				  # Add the package repositories

				  distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")

				  curl -s -L --retry 3 --retry-all-errors https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

				  curl -s -L --retry 3 --retry-all-errors "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

				  retry sudo apt-get update -qq

				  # Necessary to get the `--gpus` flag to function within docker

				  retry sudo apt-get install -y nvidia-container-toolkit

				  sudo systemctl restart docker

				else

				  # Explicitly remove nvidia docker apt repositories if not building for cuda

				  sudo rm -rf /etc/apt/sources.list.d/nvidia-docker.list

				fi

				add_to_env_file() {

				  local name=$1

				  local value=$2

				  case "$value" in

				    *\ *)

				      # BASH_ENV should be set by CircleCI

				      echo "${name}='${value}'" >> "${BASH_ENV:-/tmp/env}"

				      ;;

				    *)

				      echo "${name}=${value}" >> "${BASH_ENV:-/tmp/env}"

				      ;;

				  esac

				}

				add_to_env_file CI_MASTER "${CI_MASTER:-}"

				add_to_env_file COMMIT_SOURCE "${CIRCLE_BRANCH:-}"

				add_to_env_file BUILD_ENVIRONMENT "${BUILD_ENVIRONMENT}"

				add_to_env_file CIRCLE_PULL_REQUEST "${CIRCLE_PULL_REQUEST}"

				if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then

				  add_to_env_file SCCACHE_BUCKET ossci-compiler-cache-circleci-v2

				  SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))

				  MEMORY_LIMIT_MAX_JOBS=8  # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM

				  MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))

				  add_to_env_file MAX_JOBS "${MAX_JOBS}"

				  if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then

				    add_to_env_file TORCH_CUDA_ARCH_LIST 5.2

				  fi

				  if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then

				    # This IAM user allows write access to S3 bucket for sccache & bazels3cache

				    set +x

				    add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"

				    add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"

				    add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"

				    set -x

				  else

				    # This IAM user allows write access to S3 bucket for sccache

				    set +x

				    add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"

				    add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"

				    add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"

				    set -x

				  fi

				fi

				# This IAM user only allows read-write access to ECR

				set +x

				export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}

				export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}

				export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")

				export AWS_REGION=us-east-1

				aws ecr get-login-password --region $AWS_REGION|docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

				set -x

									
										50

.circleci/scripts/setup_linux_system_environment.sh
									
												View File
											
				@ -1,50 +0,0 @@

				#!/usr/bin/env bash

				set -eux -o pipefail

				# Set up CircleCI GPG keys for apt, if needed

				curl --retry 3 --retry-all-errors -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -

				# Stop background apt updates.  Hypothetically, the kill should not

				# be necessary, because stop is supposed to send a kill signal to

				# the process, but we've added it for good luck.  Also

				# hypothetically, it's supposed to be unnecessary to wait for

				# the process to block.  We also have that line for good luck.

				# If you like, try deleting them and seeing if it works.

				sudo systemctl stop apt-daily.service || true

				sudo systemctl kill --kill-who=all apt-daily.service || true

				sudo systemctl stop unattended-upgrades.service || true

				sudo systemctl kill --kill-who=all unattended-upgrades.service || true

				# wait until `apt-get update` has been killed

				while systemctl is-active --quiet apt-daily.service

				do

				    sleep 1;

				done

				while systemctl is-active --quiet unattended-upgrades.service

				do

				    sleep 1;

				done

				# See if we actually were successful

				systemctl list-units --all | cat

				# For good luck, try even harder to kill apt-get

				sudo pkill apt-get || true

				# For even better luck, purge unattended-upgrades

				sudo apt-get purge -y unattended-upgrades || true

				cat /etc/apt/sources.list

				# For the bestest luck, kill again now

				sudo pkill apt || true

				sudo pkill dpkg || true

				# Try to detect if apt/dpkg is stuck

				if ps auxfww | grep '[a]pt'; then

				  echo "WARNING: There are leftover apt processes; subsequent apt update will likely fail"

				fi

				if ps auxfww | grep '[d]pkg'; then

				  echo "WARNING: There are leftover dpkg processes; subsequent apt update will likely fail"

				fi

1

.clang-tidy

View File

 @ -42,7 +42,6 @@ misc-*,
 -misc-non-private-member-variables-in-classes,
 -misc-confusable-identifiers,
 modernize-*,
 -modernize-concat-nested-namespaces,
 -modernize-macro-to-enum,
 -modernize-return-braced-init-list,
 -modernize-use-auto,

									
										2

.devcontainer/Dockerfile
									
												View File
												
				@ -30,5 +30,5 @@ RUN if [ -n "$CLANG_VERSION" ]; then \

				# Install cuda if version is specified

				ARG CUDA_VERSION

				RUN if [ -n "$CUDA_VERSION" ]; then \

				       conda install cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \

				       conda install -y cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \

				    fi

									
										2

.devcontainer/README.md
									
												View File
												
				@ -46,7 +46,7 @@ If you are using [Visual Studio Code Remote - SSH](https://code.visualstudio.com

				## Step 6: Open in DevContainer

				1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Remote-Containers: Open Folder in Container..." command.

				1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Dev Containers: Open Folder in Container..." command.

				2. You will be prompted with two options: CPU dev container or CUDA dev container. Choose the one you want to run.

				## Step 7: Wait for Building the Environment

22

.flake8

View File

 @ -2,7 +2,7 @@
 # NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
 # before we can fully move to use ruff
 enable-extensions = G
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 @ -27,6 +27,9 @@ ignore =
     # TODO(kit1980): fix all TOR102 issues
     # `torch.load` without `weights_only` parameter is unsafe
     TOR102,
     # TODO(kit1980): resolve all TOR003 issues
     # pass `use_reentrant` explicitly to `checkpoint`.
     TOR003
 per-file-ignores =
     __init__.py: F401
     test/**: F821
 @ -34,6 +37,23 @@ per-file-ignores =
     torch/utils/cpp_extension.py: B950
     torchgen/api/types/__init__.py: F401,F403
     torchgen/executorch/api/types/__init__.py: F401,F403
     test/dynamo/test_higher_order_ops.py: B950
     torch/testing/_internal/dynamo_test_failures.py: B950
     # TOR901 is only for test, we want to ignore it for everything else.
     # It's not easy to configure this without affecting other per-file-ignores,
     # so we explicitly list every file where it's violated outside of test.
     torch/__init__.py: F401,TOR901
     torch/_custom_op/impl.py: TOR901
     torch/_export/serde/upgrade.py: TOR901
     torch/_functorch/vmap.py: TOR901
     torch/_inductor/test_operators.py: TOR901
     torch/_library/abstract_impl.py: TOR901
     torch/_meta_registrations.py: TOR901
     torch/_prims/__init__.py: F401,TOR901
     torch/_prims/rng_prims.py: TOR901
     torch/ao/quantization/fx/_decomposed.py: TOR901
     torch/distributed/_functional_collectives.py: TOR901
     torch/distributed/_spmd/data_parallel.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

									
										2

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -19,7 +19,7 @@ self-hosted-runner:

				    - windows.g5.4xlarge.nvidia.gpu

				    - bm-runner

				    - linux.rocm.gpu

				    - macos-m1-12

				    - macos-m1-stable

				    - macos-m1-13

				    - macos-12-xl

				    - macos-12

									
										29

.github/actions/download-td-artifacts/action.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,29 @@

				name: Download TD Artifacts

				description: Download artifacts from target_determination.yml

				inputs:

				  use-gha:

				    description: If set to any value, use GHA to download the artifact. Otherwise use s3.

				    required: false

				runs:

				  using: composite

				  steps:

				    - name: Download TD Artifacts from S3

				      if: ${{ !inputs.use-gha }}

				      uses: seemethere/download-artifact-s3@v4

				      with:

				        name: td_results

				    - name: Download TD Artifacts from GHA

				      if: inputs.use-gha

				      uses: actions/download-artifact@v3

				      with:

				        name: td_results.json

				    - name: Move artifacts to .additional_ci_files folder

				      shell: bash

				      run: |

				        mkdir -p .additional_ci_files

				        mv td_results.json .additional_ci_files/td_results.json

									
										11

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -26,11 +26,20 @@ outputs:

				    description: True if the filtered test configs matrix is empty. False otherwise.

				    value: ${{ steps.filter.outputs.is-test-matrix-empty }}

				  keep-going:

				    description: True if keep-going label was on PR.

				    description: True if keep-going label was on PR or [keep-going] in PR body.

				    value: ${{ steps.filter.outputs.keep-going }}

				  reenabled-issues:

				    description: Comma separated list of issue numbers that should correspond to disable test issues that the PR fixes

				    value: ${{ steps.filter.outputs.reenabled-issues }}

				  ci-verbose-test-logs:

				    description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.

				    value: ${{ steps.filter.outputs.ci-verbose-test-logs }}

				  ci-no-test-timeout:

				    description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-test-timeout }}

				  ci-no-td:

				    description: True if ci-no-td label was on PR or [ci-no-td] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-td }}

				runs:

				  using: composite

									
										10

.github/actions/setup-rocm/action.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,16 @@ runs:

				      shell: bash

				      run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"

				    - name: Remove leftover Docker config file

				      shell: bash

				      continue-on-error: true

				      run: |

				        set -ex

				        cat ~/.docker/config.json || true

				        # https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not

				        rm -f ~/.docker/config.json

				    - name: Stop all running docker containers

				      if: always()

				      shell: bash

									
										59

.github/actions/update-commit-hash/action.yml
									
										vendored
									
												View File
											
				@ -1,59 +0,0 @@

				name: Update commit hash

				inputs:

				  repo-owner:

				    required: false

				    type: string

				    description: Name of repository's owner.

				    default: pytorch

				  repo-name:

				    required: true

				    type: string

				    description: Name of the repository we're updating commit hash for.

				  branch:

				    required: true

				    type: string

				    description: Branch to fetch commit of

				  pin-folder:

				    type: string

				    description: Path to folder with commit pin

				    required: false

				    default: .github/ci_commit_pins

				  updatebot-token:

				    required: true

				    type: string

				    description: update bot token

				  pytorchbot-token:

				    required: true

				    type: string

				    description: update bot token

				description: update commit hash

				runs:

				  using: composite

				  steps:

				    - name: Checkout repo

				      uses: actions/checkout@v3

				      with:

				        fetch-depth: 1

				        submodules: false

				        token: ${{ inputs.updatebot-token }}

				    - name: Checkout

				      shell: bash

				      run: |

				        git clone https://github.com/${{ inputs.repo-owner }}/${{ inputs.repo-name }}.git --quiet

				    - name: Check if there already exists a PR

				      shell: bash

				      env:

				        REPO_NAME: ${{ inputs.repo-name }}

				        BRANCH: ${{ inputs.branch }}

				        PIN_FOLDER: ${{ inputs.pin-folder }}

				        UPDATEBOT_TOKEN: ${{ inputs.updatebot-token }}

				        PYTORCHBOT_TOKEN: ${{ inputs.pytorchbot-token }}

				        NEW_BRANCH_NAME: update-${{ inputs.repo-name }}-commit-hash/${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}

				      run: |

				        # put this here instead of the script to prevent accidentally changing the config when running the script locally

				        git config --global user.name "PyTorch UpdateBot"

				        git config --global user.email "pytorchupdatebot@users.noreply.github.com"

				        python .github/scripts/update_commit_hashes.py --repo-name "${REPO_NAME}" --branch "${BRANCH}" --pin-folder "${PIN_FOLDER}"

									
										1

.github/auto_request_review.yml
									
										vendored
									
												View File
												
				@ -6,7 +6,6 @@ reviewers:

				      - albanD

				      - miladm

				      - bdhirsh

				      - voznesenskym

				  per_author:

				    symbolic-shapes:

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 e3efbc2d9094685dd2d4ae143853941f82f167af
 aeb554d3e2f7855b7abe5120c282f59648ed7a

2

.github/ci_commit_pins/torchbench.txt vendored

View File

 @ -1 +1 @@
 a2fb8624947f9c0e2edc898ff42a16124da
 d6015d42d9a1834bc7595c4bd6852562fb80b30b

2

.github/ci_commit_pins/vision.txt vendored

View File

 @ -1 +1 @@
 d23430765b5df76cd1267f438f129f51b7d6e3e1
 c127da8b5e2e8f44b50994c6cb931bcca267cfe

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 e1c94dfa5a74331a376537c23bf74a2c367f24bd
 r2.3

									
										6

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -26,6 +26,11 @@

				- .github/ci_commit_pins/**

				- c10/core/Sym*

				- torch/fx/experimental/symbolic_shapes.py

				- torch/fx/experimental/recording.py

				- torch/fx/experimental/sym_node.py

				- torch/fx/experimental/validator.py

				- torch/fx/experimental/_sym_dispatch_mode.py

				- torch/fx/experimental/proxy_tensor.py

				- test/distributed/_tensor/test_dtensor_compile.py

				- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

				- torch/distributed/_tensor/**

				@ -39,6 +44,7 @@

				- aten/src/ATen/native/mkldnn/**

				- torch/cpu/**

				- torch/utils/mkldnn.py

				- torch/utils/_sympy/**

				- test/test_mkldnn.py

				"module: mkldnn":

									
										9

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -275,17 +275,20 @@

				  - wanchaol

				  - fduwjj

				  - H-Huang

				  - aazzolini

				  - kwen2501

				  - XilunWu

				  - wz337

				  - awgu

				  - fegin

				  - kumpera

				  - yhcharles

				  - kurman

				  - LucasLLC

				  - sanketpurandare

				  - shuqiangzhang

				  - tianyu-l

				  - kiukchung

				  - d4l3k

				  - shuqiangzhang

				  - weifengpy

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

1

.github/requirements-gha-cache.txt vendored

View File

 @ -1,6 +1,5 @@
 # This file is to cache other dependencies not specified elsewhere in:
 #   requirement.txt
 #   requirements-flake8.txt
 #   docs/requirements.txt
 #   docs/cpp/requirements.txt
 #   functorch/docs/requirements.txt

4

.github/requirements/conda-env-Linux-X64.txt vendored

View File

 @ -4,6 +4,6 @@ mkl-include=2022.1.0
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 requests=2.28.1
 setuptools=65.5.0
 requests=2.31.0
 setuptools=68.2.2
 typing-extensions=4.3.0

4

.github/requirements/conda-env-iOS.txt vendored

View File

 @ -3,6 +3,6 @@ cmake=3.22.1
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 requests=2.28.1
 setuptools=63.4.1
 requests=2.31.0
 setuptools=68.2.2
 typing-extensions=4.3.0

4

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -16,7 +16,6 @@ pytest==7.3.2
 pytest-xdist==3.3.1
 pytest-rerunfailures==10.3
 pytest-flakefinder==1.1.0
 pytest-shard==0.1.2
 scipy==1.10.1
 sympy==1.11.1
 unittest-xml-reporting<=3.2.0,>=2.0.0
 @ -28,3 +27,6 @@ rockset==1.0.3
 z3-solver==4.12.2.0
 tensorboard==2.13.0
 optree==0.9.1
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2

									
										14

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -10,9 +10,6 @@ from typing import Optional

				SCRIPT_DIR = Path(__file__).parent

				REPO_DIR = SCRIPT_DIR.parent.parent

				# TODO: Remove me once Triton version is again in sync for vanilla and ROCm

				ROCM_TRITION_VERSION = "2.1.0"

				def read_triton_pin(rocm_hash: bool = False) -> str:

				    triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"

				@ -99,7 +96,14 @@ def build_triton(

				            triton_repo = "https://github.com/openai/triton"

				            triton_pkg_name = "pytorch-triton"

				        check_call(["git", "clone", triton_repo], cwd=tmpdir)

				        if release:

				            ver, rev, patch = version.split(".")

				            check_call(

				                ["git", "checkout", f"release/{ver}.{rev}.x"], cwd=triton_basedir

				            )

				        else:

				            check_call(["git", "checkout", commit_hash], cwd=triton_basedir)

				        if build_conda:

				            with open(triton_basedir / "meta.yaml", "w") as meta:

				                print(

				@ -155,7 +159,7 @@ def build_triton(

				        patch_init_py(

				            triton_pythondir / "triton" / "__init__.py",

				            version=f"{version}",

				            expected_version=ROCM_TRITION_VERSION if build_rocm else None,

				            expected_version=None,

				        )

				        if build_rocm:

				@ -164,7 +168,7 @@ def build_triton(

				                triton_pythondir / "setup.py",

				                name=triton_pkg_name,

				                version=f"{version}",

				                expected_version=ROCM_TRITION_VERSION,

				                expected_version=None,

				            )

				            check_call("scripts/amd/setup_rocm_libs.sh", cwd=triton_basedir, shell=True)

				            print("ROCm libraries setup for triton installation...")

									
										223

.github/scripts/cherry_pick.py
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,223 @@

				#!/usr/bin/env python3

				import json

				import os

				import re

				from typing import Any, Optional

				from urllib.error import HTTPError

				from github_utils import gh_fetch_url, gh_post_pr_comment

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import get_pr_commit_sha, GitHubPR

				# This is only a suggestion for now, not a strict requirement

				REQUIRES_ISSUE = {

				    "regression",

				    "critical",

				    "fixnewfeature",

				}

				def parse_args() -> Any:

				    from argparse import ArgumentParser

				    parser = ArgumentParser("cherry pick a landed PR onto a release branch")

				    parser.add_argument(

				        "--onto-branch", type=str, required=True, help="the target release branch"

				    )

				    parser.add_argument(

				        "--github-actor", type=str, required=True, help="all the world’s a stage"

				    )

				    parser.add_argument(

				        "--classification",

				        choices=["regression", "critical", "fixnewfeature", "docs", "release"],

				        required=True,

				        help="the cherry pick category",

				    )

				    parser.add_argument("pr_num", type=int)

				    parser.add_argument(

				        "--fixes",

				        type=str,

				        default="",

				        help="the GitHub issue that the cherry pick fixes",

				    )

				    parser.add_argument("--dry-run", action="store_true")

				    return parser.parse_args()

				def get_merge_commit_sha(repo: GitRepo, pr: GitHubPR) -> Optional[str]:

				    """

				    Return the merge commit SHA iff the PR has been merged. For simplicity, we

				    will only cherry pick PRs that have been merged into main

				    """

				    commit_sha = get_pr_commit_sha(repo, pr)

				    return commit_sha if pr.is_closed() else None

				def cherry_pick(

				    github_actor: str,

				    repo: GitRepo,

				    pr: GitHubPR,

				    commit_sha: str,

				    onto_branch: str,

				    classification: str,

				    fixes: str,

				    dry_run: bool = False,

				) -> None:

				    """

				    Create a local branch to cherry pick the commit and submit it as a pull request

				    """

				    current_branch = repo.current_branch()

				    cherry_pick_branch = create_cherry_pick_branch(

				        github_actor, repo, pr, commit_sha, onto_branch

				    )

				    try:

				        if not dry_run:

				            org, project = repo.gh_owner_and_name()

				            cherry_pick_pr = submit_pr(repo, pr, cherry_pick_branch, onto_branch)

				            msg = f"The cherry pick PR is at {cherry_pick_pr}"

				            if fixes:

				                msg += f" and it is linked with issue {fixes}"

				            elif classification in REQUIRES_ISSUE:

				                msg += f" and it is recommended to link a {classification} cherry pick PR with an issue"

				            post_comment(org, project, pr.pr_num, msg)

				    finally:

				        if current_branch:

				            repo.checkout(branch=current_branch)

				def create_cherry_pick_branch(

				    github_actor: str, repo: GitRepo, pr: GitHubPR, commit_sha: str, onto_branch: str

				) -> str:

				    """

				    Create a local branch and cherry pick the commit. Return the name of the local

				    cherry picking branch.

				    """

				    repo.checkout(branch=onto_branch)

				    repo._run_git("submodule", "update", "--init", "--recursive")

				    # Remove all special characters if we want to include the actor in the branch name

				    github_actor = re.sub("[^0-9a-zA-Z]+", "_", github_actor)

				    cherry_pick_branch = f"cherry-pick-{pr.pr_num}-by-{github_actor}"

				    repo.create_branch_and_checkout(branch=cherry_pick_branch)

				    # We might want to support ghstack later

				    repo._run_git("cherry-pick", "-x", "-X", "theirs", commit_sha)

				    repo.push(branch=cherry_pick_branch, dry_run=False)

				    return cherry_pick_branch

				def submit_pr(

				    repo: GitRepo,

				    pr: GitHubPR,

				    cherry_pick_branch: str,

				    onto_branch: str,

				) -> str:

				    """

				    Submit the cherry pick PR and return the link to the PR

				    """

				    org, project = repo.gh_owner_and_name()

				    default_msg = f"Cherry pick #{pr.pr_num} onto {onto_branch} branch"

				    title = pr.info.get("title", default_msg)

				    body = pr.info.get("body", default_msg)

				    try:

				        response = gh_fetch_url(

				            f"https://api.github.com/repos/{org}/{project}/pulls",

				            method="POST",

				            data={

				                "title": title,

				                "body": body,

				                "head": cherry_pick_branch,

				                "base": onto_branch,

				            },

				            headers={"Accept": "application/vnd.github.v3+json"},

				            reader=json.load,

				        )

				        cherry_pick_pr = response.get("html_url", "")

				        if not cherry_pick_pr:

				            raise RuntimeError(

				                f"Fail to find the cherry pick PR: {json.dumps(response)}"

				            )

				        return str(cherry_pick_pr)

				    except HTTPError as error:

				        msg = f"Fail to submit the cherry pick PR: {error}"

				        raise RuntimeError(msg) from error

				def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:

				    """

				    Post a comment on the PR itself to point to the cherry picking PR when success

				    or print the error when failure

				    """

				    internal_debugging = ""

				    run_url = os.getenv("GH_RUN_URL")

				    # Post a comment to tell folks that the PR is being cherry picked

				    if run_url is not None:

				        internal_debugging = "\n".join(

				            line

				            for line in (

				                "<details><summary>Details for Dev Infra team</summary>",

				                f'Raised by <a href="{run_url}">workflow job</a>\n',

				                "</details>",

				            )

				            if line

				        )

				    comment = "\n".join(

				        (f"### Cherry picking #{pr_num}", f"{msg}", "", f"{internal_debugging}")

				    )

				    gh_post_pr_comment(org, project, pr_num, comment)

				def main() -> None:

				    args = parse_args()

				    pr_num = args.pr_num

				    repo = GitRepo(get_git_repo_dir(), get_git_remote_name())

				    org, project = repo.gh_owner_and_name()

				    pr = GitHubPR(org, project, pr_num)

				    try:

				        commit_sha = get_merge_commit_sha(repo, pr)

				        if not commit_sha:

				            raise RuntimeError(

				                f"Refuse to cherry pick #{pr_num} because it hasn't been merged yet"

				            )

				        cherry_pick(

				            args.github_actor,

				            repo,

				            pr,

				            commit_sha,

				            args.onto_branch,

				            args.classification,

				            args.fixes,

				            args.dry_run,

				        )

				    except RuntimeError as error:

				        if not args.dry_run:

				            post_comment(org, project, pr_num, str(error))

				        else:

				            raise error

				if __name__ == "__main__":

				    main()

									
										274

.github/scripts/delete_old_branches.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,274 @@

				# Delete old branches

				import os

				import re

				from datetime import datetime

				from pathlib import Path

				from typing import Any, Callable, Dict, List, Set

				from github_utils import gh_fetch_json_dict, gh_graphql

				from gitutils import GitRepo

				SEC_IN_DAY = 24 * 60 * 60

				CLOSED_PR_RETENTION = 30 * SEC_IN_DAY

				NO_PR_RETENTION = 1.5 * 365 * SEC_IN_DAY

				PR_WINDOW = 90 * SEC_IN_DAY  # Set to None to look at all PRs (may take a lot of tokens)

				REPO_OWNER = "pytorch"

				REPO_NAME = "pytorch"

				ESTIMATED_TOKENS = [0]

				TOKEN = os.environ["GITHUB_TOKEN"]

				if not TOKEN:

				    raise Exception("GITHUB_TOKEN is not set")

				REPO_ROOT = Path(__file__).parent.parent.parent

				# Query for all PRs instead of just closed/merged because it's faster

				GRAPHQL_ALL_PRS_BY_UPDATED_AT = """

				query ($owner: String!, $repo: String!, $cursor: String) {

				  repository(owner: $owner, name: $repo) {

				    pullRequests(

				      first: 100

				      after: $cursor

				      orderBy: {field: UPDATED_AT, direction: DESC}

				    ) {

				      totalCount

				      pageInfo {

				        hasNextPage

				        endCursor

				      }

				      nodes {

				        headRefName

				        number

				        updatedAt

				        state

				      }

				    }

				  }

				}

				"""

				GRAPHQL_OPEN_PRS = """

				query ($owner: String!, $repo: String!, $cursor: String) {

				  repository(owner: $owner, name: $repo) {

				    pullRequests(

				      first: 100

				      after: $cursor

				      states: [OPEN]

				    ) {

				      totalCount

				      pageInfo {

				        hasNextPage

				        endCursor

				      }

				      nodes {

				        headRefName

				        number

				        updatedAt

				        state

				      }

				    }

				  }

				}

				"""

				GRAPHQL_NO_DELETE_BRANCH_LABEL = """

				query ($owner: String!, $repo: String!, $cursor: String) {

				  repository(owner: $owner, name: $repo) {

				    label(name: "no-delete-branch") {

				      pullRequests(first: 100, after: $cursor) {

				        totalCount

				        pageInfo {

				          hasNextPage

				          endCursor

				        }

				        nodes {

				          headRefName

				          number

				          updatedAt

				          state

				        }

				      }

				    }

				  }

				}

				"""

				def is_protected(branch: str) -> bool:

				    try:

				        ESTIMATED_TOKENS[0] += 1

				        res = gh_fetch_json_dict(

				            f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/branches/{branch}"

				        )

				        return bool(res["protected"])

				    except Exception as e:

				        print(f"[{branch}] Failed to fetch branch protections: {e}")

				        return True

				def convert_gh_timestamp(date: str) -> float:

				    return datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ").timestamp()

				def get_branches(repo: GitRepo) -> Dict[str, Any]:

				    # Query locally for branches, group by branch base name (e.g. gh/blah/base -> gh/blah), and get the most recent branch

				    git_response = repo._run_git(

				        "for-each-ref",

				        "--sort=creatordate",

				        "--format=%(refname) %(committerdate:iso-strict)",

				        "refs/remotes/origin",

				    )

				    branches_by_base_name: Dict[str, Any] = {}

				    for line in git_response.splitlines():

				        branch, date = line.split(" ")

				        re_branch = re.match(r"refs/remotes/origin/(.*)", branch)

				        assert re_branch

				        branch = branch_base_name = re_branch.group(1)

				        if x := re.match(r"(gh\/.+)\/(head|base|orig)", branch):

				            branch_base_name = x.group(1)

				        date = datetime.fromisoformat(date).timestamp()

				        if branch_base_name not in branches_by_base_name:

				            branches_by_base_name[branch_base_name] = [date, [branch]]

				        else:

				            branches_by_base_name[branch_base_name][1].append(branch)

				            if date > branches_by_base_name[branch_base_name][0]:

				                branches_by_base_name[branch_base_name][0] = date

				    return branches_by_base_name

				def paginate_graphql(

				    query: str,

				    kwargs: Dict[str, Any],

				    termination_func: Callable[[List[Dict[str, Any]]], bool],

				    get_data: Callable[[Dict[str, Any]], List[Dict[str, Any]]],

				    get_page_info: Callable[[Dict[str, Any]], Dict[str, Any]],

				) -> List[Any]:

				    hasNextPage = True

				    endCursor = None

				    data: List[Dict[str, Any]] = []

				    while hasNextPage:

				        ESTIMATED_TOKENS[0] += 1

				        res = gh_graphql(query, cursor=endCursor, **kwargs)

				        data.extend(get_data(res))

				        hasNextPage = get_page_info(res)["hasNextPage"]

				        endCursor = get_page_info(res)["endCursor"]

				        if termination_func(data):

				            break

				    return data

				def get_recent_prs() -> Dict[str, Any]:

				    now = datetime.now().timestamp()

				    # Grab all PRs updated in last CLOSED_PR_RETENTION days

				    pr_infos: List[Dict[str, Any]] = paginate_graphql(

				        GRAPHQL_ALL_PRS_BY_UPDATED_AT,

				        {"owner": "pytorch", "repo": "pytorch"},

				        lambda data: (

				            PR_WINDOW is not None

				            and (now - convert_gh_timestamp(data[-1]["updatedAt"]) > PR_WINDOW)

				        ),

				        lambda res: res["data"]["repository"]["pullRequests"]["nodes"],

				        lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],

				    )

				    # Get the most recent PR for each branch base (group gh together)

				    prs_by_branch_base = {}

				    for pr in pr_infos:

				        pr["updatedAt"] = convert_gh_timestamp(pr["updatedAt"])

				        branch_base_name = pr["headRefName"]

				        if x := re.match(r"(gh\/.+)\/(head|base|orig)", branch_base_name):

				            branch_base_name = x.group(1)

				        if branch_base_name not in prs_by_branch_base:

				            prs_by_branch_base[branch_base_name] = pr

				        else:

				            if pr["updatedAt"] > prs_by_branch_base[branch_base_name]["updatedAt"]:

				                prs_by_branch_base[branch_base_name] = pr

				    return prs_by_branch_base

				def get_branches_with_magic_label_or_open_pr() -> Set[str]:

				    pr_infos: List[Dict[str, Any]] = paginate_graphql(

				        GRAPHQL_NO_DELETE_BRANCH_LABEL,

				        {"owner": "pytorch", "repo": "pytorch"},

				        lambda data: False,

				        lambda res: res["data"]["repository"]["label"]["pullRequests"]["nodes"],

				        lambda res: res["data"]["repository"]["label"]["pullRequests"]["pageInfo"],

				    )

				    pr_infos.extend(

				        paginate_graphql(

				            GRAPHQL_OPEN_PRS,

				            {"owner": "pytorch", "repo": "pytorch"},

				            lambda data: False,

				            lambda res: res["data"]["repository"]["pullRequests"]["nodes"],

				            lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],

				        )

				    )

				    # Get the most recent PR for each branch base (group gh together)

				    branch_bases = set()

				    for pr in pr_infos:

				        branch_base_name = pr["headRefName"]

				        if x := re.match(r"(gh\/.+)\/(head|base|orig)", branch_base_name):

				            branch_base_name = x.group(1)

				        branch_bases.add(branch_base_name)

				    return branch_bases

				def delete_branch(repo: GitRepo, branch: str) -> None:

				    repo._run_git("push", "origin", "-d", branch)

				def delete_branches() -> None:

				    now = datetime.now().timestamp()

				    git_repo = GitRepo(str(REPO_ROOT), "origin", debug=True)

				    branches = get_branches(git_repo)

				    prs_by_branch = get_recent_prs()

				    keep_branches = get_branches_with_magic_label_or_open_pr()

				    delete = []

				    # Do not delete if:

				    # * associated PR is open, closed but updated recently, or contains the magic string

				    # * no associated PR and branch was updated in last 1.5 years

				    # * is protected

				    # Setting different values of PR_WINDOW will change how branches with closed

				    # PRs are treated depending on how old the branch is.  The default value of

				    # 90 will allow branches with closed PRs to be deleted if the PR hasn't been

				    # updated in 90 days and the branch hasn't been updated in 1.5 years

				    for base_branch, (date, sub_branches) in branches.items():

				        print(f"[{base_branch}] Updated {(now - date) / SEC_IN_DAY} days ago")

				        if base_branch in keep_branches:

				            print(f"[{base_branch}] Has magic label or open PR, skipping")

				            continue

				        pr = prs_by_branch.get(base_branch)

				        if pr:

				            print(

				                f"[{base_branch}] Has PR {pr['number']}: {pr['state']}, updated {(now - pr['updatedAt']) / SEC_IN_DAY} days ago"

				            )

				            if (

				                now - pr["updatedAt"] < CLOSED_PR_RETENTION

				                or (now - date) < CLOSED_PR_RETENTION

				            ):

				                continue

				        elif now - date < NO_PR_RETENTION:

				            continue

				        print(f"[{base_branch}] Checking for branch protections")

				        if any(is_protected(sub_branch) for sub_branch in sub_branches):

				            print(f"[{base_branch}] Is protected")

				            continue

				        for sub_branch in sub_branches:

				            print(f"[{base_branch}] Deleting {sub_branch}")

				            delete.append(sub_branch)

				        if ESTIMATED_TOKENS[0] > 400:

				            print("Estimated tokens exceeded, exiting")

				            break

				    print(f"To delete ({len(delete)}):")

				    for branch in delete:

				        print(f"About to delete branch {branch}")

				        delete_branch(git_repo, branch)

				if __name__ == "__main__":

				    delete_branches()

									
										139

.github/scripts/fetch_latest_green_commit.py
									
										vendored
									
												View File
											
				@ -1,139 +0,0 @@

				import os

				import re

				import sys

				from typing import Any, cast, Dict, List, NamedTuple, Tuple

				import rockset  # type: ignore[import]

				from gitutils import _check_output

				def eprint(msg: str) -> None:

				    print(msg, file=sys.stderr)

				class WorkflowCheck(NamedTuple):

				    workflowName: str

				    name: str

				    jobName: str

				    conclusion: str

				def get_latest_commits() -> List[str]:

				    latest_viable_commit = _check_output(

				        [

				            "git",

				            "log",

				            "-n",

				            "1",

				            "--pretty=format:%H",

				            "origin/viable/strict",

				        ],

				        encoding="ascii",

				    )

				    commits = _check_output(

				        [

				            "git",

				            "rev-list",

				            f"{latest_viable_commit}^..HEAD",

				            "--remotes=*origin/main",

				        ],

				        encoding="ascii",

				    ).splitlines()

				    return commits

				def query_commits(commits: List[str]) -> List[Dict[str, Any]]:

				    rs = rockset.RocksetClient(

				        host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]

				    )

				    params = [{"name": "shas", "type": "string", "value": ",".join(commits)}]

				    res = rs.QueryLambdas.execute_query_lambda(

				        # https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query

				        query_lambda="commit_jobs_batch_query",

				        version="19c74e10819104f9",

				        workspace="commons",

				        parameters=params,

				    )

				    return cast(List[Dict[str, Any]], res.results)

				def print_commit_status(commit: str, results: Dict[str, Any]) -> None:

				    print(commit)

				    for check in results["results"]:

				        if check["sha"] == commit:

				            print(f"\t{check['conclusion']:>10}: {check['name']}")

				def get_commit_results(

				    commit: str, results: List[Dict[str, Any]]

				) -> List[Dict[str, Any]]:

				    workflow_checks = []

				    for check in results:

				        if check["sha"] == commit:

				            workflow_checks.append(

				                WorkflowCheck(

				                    workflowName=check["workflowName"],

				                    name=check["name"],

				                    jobName=check["jobName"],

				                    conclusion=check["conclusion"],

				                )._asdict()

				            )

				    return workflow_checks

				def isGreen(commit: str, results: List[Dict[str, Any]]) -> Tuple[bool, str]:

				    workflow_checks = get_commit_results(commit, results)

				    regex = {

				        "pull": False,

				        "trunk": False,

				        "lint": False,

				        "linux-binary": False,

				    }

				    for check in workflow_checks:

				        jobName = check["jobName"]

				        # Ignore result from unstable job, be it success or failure

				        if "unstable" in jobName:

				            continue

				        workflowName = check["workflowName"]

				        conclusion = check["conclusion"]

				        for required_check in regex:

				            if re.match(required_check, workflowName, flags=re.IGNORECASE):

				                if conclusion not in ["success", "skipped"]:

				                    return (False, workflowName + " checks were not successful")

				                else:

				                    regex[required_check] = True

				    missing_workflows = [x for x in regex.keys() if not regex[x]]

				    if len(missing_workflows) > 0:

				        return (False, "missing required workflows: " + ", ".join(missing_workflows))

				    return (True, "")

				def get_latest_green_commit(commits: List[str], results: List[Dict[str, Any]]) -> Any:

				    for commit in commits:

				        eprint(f"Checking {commit}")

				        is_green, msg = isGreen(commit, results)

				        if is_green:

				            eprint("GREEN")

				            return commit

				        else:

				            eprint("RED: " + msg)

				    return None

				def main() -> None:

				    commits = get_latest_commits()

				    results = query_commits(commits)

				    latest_viable_commit = get_latest_green_commit(commits, results)

				    print(latest_viable_commit)

				if __name__ == "__main__":

				    main()

									
										20

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -62,9 +62,9 @@ SUPPORTED_PERIODICAL_MODES: Dict[str, Callable[[Optional[str]], bool]] = {

				}

				# The link to the published list of disabled jobs

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"

				DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=qO7aEr.Og33PtLXfNq0j0yj.bbLC7SzR"

				# and unstable jobs

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"

				UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=7NhgpqKTtGXVUnL1C79KboTW_5qQx8y5"

				# Some constants used to handle disabled and unstable jobs

				JOB_NAME_SEP = "/"

				@ -474,6 +474,10 @@ def get_reenabled_issues(pr_body: str = "") -> List[str]:

				    return parse_reenabled_issues(pr_body) + parse_reenabled_issues(commit_messages)

				def check_for_setting(labels: Set[str], body: str, setting: str) -> bool:

				    return setting in labels or f"[{setting}]" in body

				def perform_misc_tasks(

				    labels: Set[str], test_matrix: Dict[str, List[Any]], job_name: str, pr_body: str

				) -> None:

				@ -481,7 +485,15 @@ def perform_misc_tasks(

				    In addition to apply the filter logic, the script also does the following

				    misc tasks to set keep-going and is-unstable variables

				    """

				    set_output("keep-going", "keep-going" in labels)

				    set_output("keep-going", check_for_setting(labels, pr_body, "keep-going"))

				    set_output(

				        "ci-verbose-test-logs",

				        check_for_setting(labels, pr_body, "ci-verbose-test-logs"),

				    )

				    set_output(

				        "ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")

				    )

				    set_output("ci-no-td", check_for_setting(labels, pr_body, "ci-no-td"))

				    # Obviously, if the job name includes unstable, then this is an unstable job

				    is_unstable = job_name and IssueType.UNSTABLE.value in job_name

				@ -577,7 +589,7 @@ def main() -> None:

				        labels=labels,

				        test_matrix=filtered_test_matrix,

				        job_name=args.job_name,

				        pr_body=pr_body,

				        pr_body=pr_body if pr_body else "",

				    )

				    # Set the filtered test matrix as the output

									
										6

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -22,7 +22,7 @@ CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8"}

				ROCM_ARCHES = ["5.6", "5.7"]

				ROCM_ARCHES = ["5.7", "6.0"]

				CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]

				@ -42,7 +42,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.19.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.1": (

				@ -55,7 +55,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.19.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				}

									
										39

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -274,42 +274,6 @@ WINDOWS_BINARY_SMOKE_WORKFLOWS = [

				]

				MACOS_BINARY_BUILD_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.MACOS,

				        package_type="wheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.MACOS

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.MACOS,

				        package_type="conda",

				        build_configs=generate_binary_build_matrix.generate_conda_matrix(

				            OperatingSystem.MACOS

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.MACOS,

				        package_type="libtorch",

				        abi_version=generate_binary_build_matrix.CXX11_ABI,

				        build_configs=generate_binary_build_matrix.generate_libtorch_matrix(

				            OperatingSystem.MACOS,

				            generate_binary_build_matrix.CXX11_ABI,

				            libtorch_variants=["shared-with-deps"],

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.MACOS_ARM64,

				        package_type="libtorch",

				@ -342,7 +306,8 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				    BinaryBuildWorkflow(

				        os=OperatingSystem.MACOS_ARM64,

				        package_type="conda",

				        cross_compile_arm64=True,

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        build_configs=generate_binary_build_matrix.generate_conda_matrix(

				            OperatingSystem.MACOS_ARM64

				        ),

									
										2

.github/scripts/generate_docker_release_matrix.py
									
										vendored
									
												View File
												
				@ -4,7 +4,7 @@

				Will output a condensed version of the matrix. Will include fllowing:

				    * CUDA version short

				    * CUDA full verison

				    * CUDA full version

				    * CUDNN version short

				    * Image type either runtime or devel

				    * Platform linux/arm64,linux/amd64

									
										13

.github/scripts/get_aws_session_tokens.py
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,13 @@

				#!/usr/bin/env python3

				import boto3  # type: ignore[import]

				def main() -> None:

				    creds_dict = boto3.Session().get_credentials().get_frozen_credentials()._asdict()

				    print(f"export AWS_ACCESS_KEY_ID={creds_dict['access_key']}")

				    print(f"export AWS_SECRET_ACCESS_KEY={creds_dict['secret_key']}")

				    print(f"export AWS_SESSION_TOKEN={creds_dict['token']}")

				if __name__ == "__main__":

				    main()

									
										13

.github/scripts/github_utils.py
									
										vendored
									
												View File
												
				@ -119,6 +119,19 @@ def gh_fetch_json_dict(

				    return cast(Dict[str, Any], _gh_fetch_json_any(url, params, data))

				def gh_graphql(query: str, **kwargs: Any) -> Dict[str, Any]:

				    rc = gh_fetch_url(

				        "https://api.github.com/graphql",

				        data={"query": query, "variables": kwargs},

				        reader=json.load,

				    )

				    if "errors" in rc:

				        raise RuntimeError(

				            f"GraphQL query {query}, args {kwargs} failed: {rc['errors']}"

				        )

				    return cast(Dict[str, Any], rc)

				def _gh_post_comment(

				    url: str, comment: str, dry_run: bool = False

				) -> List[Dict[str, Any]]:

									
										10

.github/scripts/gitutils.py
									
										vendored
									
												View File
												
				@ -155,12 +155,19 @@ class GitRepo:

				        )

				        return [x.strip() for x in rc.split("\n") if x.strip()] if len(rc) > 0 else []

				    def current_branch(self) -> str:

				    def current_branch(self) -> Optional[str]:

				        try:

				            return self._run_git("symbolic-ref", "--short", "HEAD").strip()

				        except RuntimeError:

				            # we are in detached HEAD state

				            return None

				    def checkout(self, branch: str) -> None:

				        self._run_git("checkout", branch)

				    def create_branch_and_checkout(self, branch: str) -> None:

				        self._run_git("checkout", "-b", branch)

				    def fetch(self, ref: Optional[str] = None, branch: Optional[str] = None) -> None:

				        if branch is None and ref is None:

				            self._run_git("fetch", self.remote)

				@ -273,6 +280,7 @@ class GitRepo:

				    def cherry_pick_commits(self, from_branch: str, to_branch: str) -> None:

				        orig_branch = self.current_branch()

				        assert orig_branch is not None, "Must be on a branch"

				        self.checkout(to_branch)

				        from_commits, to_commits = self.compute_branch_diffs(from_branch, to_branch)

				        if len(from_commits) == 0:

BIN
.github/scripts/gql_mocks.json.gz vendored

View File

Binary file not shown.

									
										12

.github/scripts/label_utils.py
									
										vendored
									
												View File
												
				@ -74,15 +74,23 @@ def gh_get_labels(org: str, repo: str) -> List[str]:

				def gh_add_labels(

				    org: str, repo: str, pr_num: int, labels: Union[str, List[str]]

				    org: str, repo: str, pr_num: int, labels: Union[str, List[str]], dry_run: bool

				) -> None:

				    if dry_run:

				        print(f"Dryrun: Adding labels {labels} to PR {pr_num}")

				        return

				    gh_fetch_url_and_headers(

				        url=f"https://api.github.com/repos/{org}/{repo}/issues/{pr_num}/labels",

				        data={"labels": labels},

				    )

				def gh_remove_label(org: str, repo: str, pr_num: int, label: str) -> None:

				def gh_remove_label(

				    org: str, repo: str, pr_num: int, label: str, dry_run: bool

				) -> None:

				    if dry_run:

				        print(f"Dryrun: Removing {label} from PR {pr_num}")

				        return

				    gh_fetch_url_and_headers(

				        url=f"https://api.github.com/repos/{org}/{repo}/issues/{pr_num}/labels/{label}",

				        method="DELETE",

									
										44

.github/scripts/lintrunner.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,44 @@

				#!/usr/bin/env bash

				set -ex

				# The generic Linux job chooses to use base env, not the one setup by the image

				CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")

				eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"

				conda activate "${CONDA_ENV}"

				CACHE_DIRECTORY="/tmp/.lintbin"

				# Try to recover the cached binaries

				if [[ -d "${CACHE_DIRECTORY}" ]]; then

				    # It's ok to fail this as lintrunner init would download these binaries

				    # again if they do not exist

				    cp -r "${CACHE_DIRECTORY}" . || true

				fi

				# This has already been cached in the docker image

				lintrunner init 2> /dev/null

				# Do build steps necessary for linters

				if [[ "${CLANG}" == "1" ]]; then

				    python3 -m tools.linter.clang_tidy.generate_build_files

				fi

				python3 -m tools.generate_torch_version --is_debug=false

				python3 -m tools.pyi.gen_pyi \

				    --native-functions-path aten/src/ATen/native/native_functions.yaml \

				    --tags-path aten/src/ATen/native/tags.yaml \

				    --deprecated-functions-path "tools/autograd/deprecated.yaml"

				RC=0

				# Run lintrunner on all files

				if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then

				    echo ""

				    echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"

				    echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"

				    RC=1

				fi

				# Use jq to massage the JSON lint output into GitHub Actions workflow commands.

				jq --raw-output \

				    '"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))' \

				    lint.json || true

				exit $RC

									
										51

.github/scripts/s390x-ci/README.md
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,51 @@

				# Configuring the builder.

				## Install prerequisites.

				```

				$ sudo dnf install docker

				```

				## Add services.

				```

				$ sudo cp self-hosted-builder/*.service /etc/systemd/system/

				$ sudo systemctl daemon-reload

				```

				## Download qemu-user-static image

				```

				# sudo docker pull docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1

				```

				## Autostart the x86_64 emulation support.

				```

				$ sudo systemctl enable --now qemu-user-static

				```

				## Rebuild the image

				In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the

				latest OS security fixes, use the following commands:

				```

				$ cd self-hosted-builder

				$ sudo docker build \

				      --build-arg repo=<owner>/<name> \

				      --build-arg token=<***> \

				      --pull \

				      -f actions-runner.Dockerfile \

				      -t iiilinuxibmcom/actions-runner \

				      .

				```

				If it fails, ensure that selinux doesn't prevent it from working.

				In worst case, selinux can be disabled with `setenforce 0`.

				## Autostart the runner.

				```

				$ sudo systemctl enable --now actions-runner@$NAME

				```

									
										66

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,66 @@

				# Self-Hosted IBM Z Github Actions Runner.

				# Temporary image: amd64 dependencies.

				FROM docker.io/amd64/ubuntu:22.04 as ld-prefix

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3

				# Main image.

				FROM docker.io/s390x/ubuntu:22.04

				# Packages for pytorch building and testing.

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get update && apt-get -y install \

				        cmake \

				        curl \

				        gcc \

				        git \

				        jq \

				        libxml2-dev \

				        libxslt-dev \

				        ninja-build \

				        python-is-python3 \

				        python3 \

				        python3-dev \

				        python3-pip \

				        pybind11-dev \

				        python3-numpy \

				        libopenblas-dev \

				        liblapack-dev \

				        libgloo-dev \

				        python3-yaml \

				        python3-scipy \

				        virtualenv

				# amd64 dependencies.

				COPY --from=ld-prefix / /usr/x86_64-linux-gnu/

				RUN ln -fs ../lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/x86_64-linux-gnu/lib64/

				RUN ln -fs /etc/resolv.conf /usr/x86_64-linux-gnu/etc/

				ENV QEMU_LD_PREFIX=/usr/x86_64-linux-gnu

				# Scripts.

				COPY fs/ /

				RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint

				# amd64 Github Actions Runner.

				RUN useradd -m actions-runner

				USER actions-runner

				WORKDIR /home/actions-runner

				RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz

				# repository

				ARG repo

				# repository token

				ARG token

				RUN ./config.sh \

				        --unattended \

				        --url "https://github.com/${repo}" \

				        --token "${token}" \

				        --no-default-labels \

				        --labels self-hosted,linux.s390x

				ENTRYPOINT ["/usr/bin/entrypoint"]

				CMD ["/usr/bin/actions-runner"]

									
										22

.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,22 @@

				[Unit]

				Description=Self-Hosted IBM Z Github Actions Runner

				Wants=qemu-user-static

				After=qemu-user-static

				StartLimitIntervalSec=0

				[Service]

				Type=simple

				Restart=always

				ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i

				ExecStart=/usr/bin/docker run \

				              --init \

				              --interactive \

				              --name=actions-runner.%i \

				              --rm \

				              iiilinuxibmcom/actions-runner

				ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"

				ExecStop=/bin/sh -c "docker wait actions-runner.%i"

				ExecStop=/bin/sh -c "docker rm actions-runner.%i"

				[Install]

				WantedBy=multi-user.target

6

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored Normal file

View File

 @ -0,0 +1,6 @@
 #!/usr/bin/env bash
 set -e -u
 # Run one job.
 ./run.sh --once

30

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/entrypoint vendored Normal file

View File

 @ -0,0 +1,30 @@
 #!/usr/bin/env bash
 #
 # Container entrypoint that waits for all spawned processes.
 #
 set -e -u
 # Create a FIFO and start reading from its read end.
 tempdir=$(mktemp -d "/tmp/done.XXXXXXXXXX")
 trap 'rm -r "$tempdir"' EXIT
 done="$tempdir/pipe"
 mkfifo "$done"
 cat "$done" & waiter=$!
 # Start the workload. Its descendants will inherit the FIFO's write end.
 status=0
 if [ "$#" -eq 0 ]; then
     bash 9>"$done" || status=$?
 else
     "$@" 9>"$done" || status=$?
 fi
 # When the workload and all of its descendants exit, the FIFO's write end will
 # be closed and `cat "$done"` will exit. Wait until it happens. This is needed
 # in order to handle SelfUpdater, which the workload may start in background
 # before exiting.
 wait "$waiter"
 exit "$status"

									
										11

.github/scripts/s390x-ci/self-hosted-builder/qemu-user-static.service
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,11 @@

				[Unit]

				Description=Support for transparent execution of non-native binaries with QEMU user emulation

				[Service]

				Type=oneshot

				# The source code for iiilinuxibmcom/qemu-user-static is at https://github.com/iii-i/qemu-user-static/tree/v6.1.0-1

				# TODO: replace it with multiarch/qemu-user-static once version >6.1 is available

				ExecStart=/usr/bin/docker run --rm --interactive --privileged docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1 --reset -p yes

				[Install]

				WantedBy=multi-user.target

									
										148

.github/scripts/test_fetch_latest_green_commit.py
									
										vendored
									
												View File
											
				@ -1,148 +0,0 @@

				from typing import Any, Dict, List

				from unittest import main, mock, TestCase

				from fetch_latest_green_commit import isGreen, WorkflowCheck

				workflowNames = [

				    "pull",

				    "trunk",

				    "Lint",

				    "linux-binary-libtorch-pre-cxx11",

				    "android-tests",

				    "windows-binary-wheel",

				    "periodic",

				    "docker-release-builds",

				    "nightly",

				    "pr-labels",

				    "Close stale pull requests",

				    "Update S3 HTML indices for download.pytorch.org",

				    "Create Release",

				]

				def set_workflow_job_status(

				    workflow: List[Dict[str, Any]], name: str, status: str

				) -> List[Dict[str, Any]]:

				    for check in workflow:

				        if check["workflowName"] == name:

				            check["conclusion"] = status

				    return workflow

				class TestChecks:

				    def make_test_checks(self) -> List[Dict[str, Any]]:

				        workflow_checks = []

				        for i in range(len(workflowNames)):

				            workflow_checks.append(

				                WorkflowCheck(

				                    workflowName=workflowNames[i],

				                    name="test/job",

				                    jobName="job",

				                    conclusion="success",

				                )._asdict()

				            )

				        return workflow_checks

				class TestPrintCommits(TestCase):

				    @mock.patch(

				        "fetch_latest_green_commit.get_commit_results",

				        return_value=TestChecks().make_test_checks(),

				    )

				    def test_all_successful(self, mock_get_commit_results: Any) -> None:

				        "Test with workflows are successful"

				        workflow_checks = mock_get_commit_results()

				        self.assertTrue(isGreen("sha", workflow_checks)[0])

				    @mock.patch(

				        "fetch_latest_green_commit.get_commit_results",

				        return_value=TestChecks().make_test_checks(),

				    )

				    def test_necessary_successful(self, mock_get_commit_results: Any) -> None:

				        "Test with necessary workflows are successful"

				        workflow_checks = mock_get_commit_results()

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, workflowNames[8], "failed"

				        )

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, workflowNames[9], "failed"

				        )

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, workflowNames[10], "failed"

				        )

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, workflowNames[11], "failed"

				        )

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, workflowNames[12], "failed"

				        )

				        self.assertTrue(isGreen("sha", workflow_checks)[0])

				    @mock.patch(

				        "fetch_latest_green_commit.get_commit_results",

				        return_value=TestChecks().make_test_checks(),

				    )

				    def test_necessary_skipped(self, mock_get_commit_results: Any) -> None:

				        "Test with necessary job (ex: pull) skipped"

				        workflow_checks = mock_get_commit_results()

				        workflow_checks = set_workflow_job_status(workflow_checks, "pull", "skipped")

				        result = isGreen("sha", workflow_checks)

				        self.assertTrue(result[0])

				    @mock.patch(

				        "fetch_latest_green_commit.get_commit_results",

				        return_value=TestChecks().make_test_checks(),

				    )

				    def test_skippable_skipped(self, mock_get_commit_results: Any) -> None:

				        "Test with skippable jobs (periodic and docker-release-builds skipped"

				        workflow_checks = mock_get_commit_results()

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, "periodic", "skipped"

				        )

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, "docker-release-builds", "skipped"

				        )

				        self.assertTrue(isGreen("sha", workflow_checks))

				    @mock.patch(

				        "fetch_latest_green_commit.get_commit_results",

				        return_value=TestChecks().make_test_checks(),

				    )

				    def test_necessary_failed(self, mock_get_commit_results: Any) -> None:

				        "Test with necessary job (ex: Lint) failed"

				        workflow_checks = mock_get_commit_results()

				        workflow_checks = set_workflow_job_status(workflow_checks, "Lint", "failed")

				        result = isGreen("sha", workflow_checks)

				        self.assertFalse(result[0])

				        self.assertEqual(result[1], "Lint checks were not successful")

				    @mock.patch(

				        "fetch_latest_green_commit.get_commit_results",

				        return_value=TestChecks().make_test_checks(),

				    )

				    def test_skippable_failed(self, mock_get_commit_results: Any) -> None:

				        "Test with failing skippable jobs (ex: docker-release-builds) should pass"

				        workflow_checks = mock_get_commit_results()

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, "periodic", "skipped"

				        )

				        workflow_checks = set_workflow_job_status(

				            workflow_checks, "docker-release-builds", "failed"

				        )

				        result = isGreen("sha", workflow_checks)

				        self.assertTrue(result[0])

				    @mock.patch("fetch_latest_green_commit.get_commit_results", return_value={})

				    def test_no_workflows(self, mock_get_commit_results: Any) -> None:

				        "Test with missing workflows"

				        workflow_checks = mock_get_commit_results()

				        result = isGreen("sha", workflow_checks)

				        self.assertFalse(result[0])

				        self.assertEqual(

				            result[1],

				            "missing required workflows: pull, trunk, lint, linux-binary",

				        )

				if __name__ == "__main__":

				    main()

									
										73

.github/scripts/test_filter_test_configs.py
									
										vendored
									
												View File
												
				@ -636,55 +636,108 @@ class TestConfigFilter(TestCase):

				    @mock.patch("subprocess.check_output")

				    def test_perform_misc_tasks(self, mocked_subprocess: Any) -> None:

				        def _gen_expected_string(

				            keep_going: bool = False,

				            ci_verbose_test_logs: bool = False,

				            ci_no_test_timeout: bool = False,

				            ci_no_td: bool = False,

				            is_unstable: bool = False,

				            reenabled_issues: str = "",

				        ) -> str:

				            return (

				                f"keep-going={keep_going}\n"

				                f"ci-verbose-test-logs={ci_verbose_test_logs}\n"

				                f"ci-no-test-timeout={ci_no_test_timeout}\n"

				                f"ci-no-td={ci_no_td}\n"

				                f"is-unstable={is_unstable}\n"

				                f"reenabled-issues={reenabled_issues}\n"

				            )

				        mocked_subprocess.return_value = b""

				        testcases: List[Dict[str, Any]] = [

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "expected": "keep-going=False\nis-unstable=False\nreenabled-issues=\n",

				                "expected": _gen_expected_string(),

				                "description": "No keep-going, no is-unstable",

				            },

				            {

				                "labels": {"keep-going"},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "expected": "keep-going=True\nis-unstable=False\nreenabled-issues=\n",

				                "expected": _gen_expected_string(keep_going=True),

				                "description": "Has keep-going, no is-unstable",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "pr_body": "[keep-going]",

				                "expected": _gen_expected_string(keep_going=True),

				                "description": "Keep-going in PR body",

				            },

				            {

				                "labels": {"ci-verbose-test-logs"},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "pr_body": "[ci-no-test-timeout]",

				                "expected": _gen_expected_string(

				                    ci_verbose_test_logs=True, ci_no_test_timeout=True

				                ),

				                "description": "No pipe logs label and no test timeout in PR body",

				            },

				            {

				                "labels": {"ci-no-test-timeout"},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "pr_body": "[ci-verbose-test-logs]",

				                "expected": _gen_expected_string(

				                    ci_verbose_test_logs=True, ci_no_test_timeout=True

				                ),

				                "description": "No pipe logs in PR body and no test timeout in label (same as the above but swapped)",

				            },

				            {

				                "labels": {"ci-no-td"},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "pr_body": "",

				                "expected": _gen_expected_string(ci_no_td=True),

				                "description": "No pipe logs in PR body and no test timeout in label (same as the above but swapped)",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": None,

				                "expected": "keep-going=False\nis-unstable=False\nreenabled-issues=\n",

				                "expected": _gen_expected_string(),

				                "description": "No job name",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12, unstable)",

				                "expected": "keep-going=False\nis-unstable=True\nreenabled-issues=\n",

				                "job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable, unstable)",

				                "expected": _gen_expected_string(is_unstable=True),

				                "description": "Unstable job",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12, unstable)",

				                "expected": "keep-going=False\nis-unstable=True\nreenabled-issues=\n",

				                "job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable, unstable)",

				                "expected": _gen_expected_string(is_unstable=True),

				                "description": "Unstable job",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "1", unstable: "unstable"}, {config: "2", unstable: "unstable"}]}',

				                "job_name": "macos-12-py3-arm64 / build",

				                "expected": "keep-going=False\nis-unstable=True\nreenabled-issues=\n",

				                "expected": _gen_expected_string(is_unstable=True),

				                "description": "All configs are unstable",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "1", unstable: "unstable"}, {config: "2"}]}',

				                "job_name": "macos-12-py3-arm64 / build",

				                "expected": "keep-going=False\nis-unstable=False\nreenabled-issues=\n",

				                "expected": _gen_expected_string(is_unstable=False),

				                "description": "Only mark some configs as unstable",

				            },

				            {

				@ -692,7 +745,7 @@ class TestConfigFilter(TestCase):

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "pr_body": "resolves #123 fixes #234",

				                "expected": "keep-going=False\nis-unstable=False\nreenabled-issues=123,234\n",

				                "expected": _gen_expected_string(reenabled_issues="123,234"),

				                "description": "Reenable some issues",

				            },

				        ]

									
										15

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -16,6 +16,8 @@ from typing import Any, Dict, List, Optional

				from unittest import main, mock, skip, TestCase

				from urllib.error import HTTPError

				from github_utils import gh_graphql

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import (

				@ -26,7 +28,6 @@ from trymerge import (

				    get_drci_classifications,

				    get_rockset_results,

				    gh_get_team_members,

				    gh_graphql,

				    GitHubPR,

				    JobCheckState,

				    main as trymerge_main,

				@ -140,11 +141,14 @@ def mock_parse_args(revert: bool = False, force: bool = False) -> Any:

				            self.comment_id = 0

				            self.reason = "this is for testing"

				            self.ignore_current = False

				            self.check_mergeability = False

				    return Object()

				def mock_remove_label(org: str, repo: str, pr_num: str, label: str) -> None:

				def mock_remove_label(

				    org: str, repo: str, pr_num: str, label: str, dry_run: bool

				) -> None:

				    pass

				@ -431,6 +435,13 @@ class TestTryMerge(TestCase):

				        assert pr._reviews is not None  # to pacify mypy

				        self.assertGreater(len(pr._reviews), 100)

				    def get_co_authors(self, *args: Any) -> None:

				        """Tests that co-authors are recognized"""

				        pr = GitHubPR("pytorch", "pytorch", 118347)

				        authors = pr.get_authors()

				        self.assertIn("kit1980", authors)

				        self.assertIn("Co-authored-by:", pr.gen_commit_message())

				    def test_get_checkruns_many_runs(self, *args: Any) -> None:

				        """Tests that all checkruns can be fetched"""

				        pr = GitHubPR("pytorch", "pytorch", 105260)

									
										89

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -39,6 +39,7 @@ from github_utils import (

				    gh_fetch_json_list,

				    gh_fetch_merge_base,

				    gh_fetch_url,

				    gh_graphql,

				    gh_post_commit_comment,

				    gh_post_pr_comment,

				    gh_update_pr_state,

				@ -152,13 +153,15 @@ GH_COMMIT_AUTHORS_FRAGMENT = """

				fragment CommitAuthors on PullRequestCommitConnection {

				  nodes {

				    commit {

				      author {

				      authors(first: 2) {

				        nodes {

				          user {

				            login

				          }

				          email

				          name

				        }

				      }

				      oid

				    }

				  }

				@ -458,19 +461,6 @@ HAS_NO_CONNECTED_DIFF_TITLE = (

				IGNORABLE_FAILED_CHECKS_THESHOLD = 10

				def gh_graphql(query: str, **kwargs: Any) -> Dict[str, Any]:

				    rc = gh_fetch_url(

				        "https://api.github.com/graphql",

				        data={"query": query, "variables": kwargs},

				        reader=json.load,

				    )

				    if "errors" in rc:

				        raise RuntimeError(

				            f"GraphQL query {query}, args {kwargs} failed: {rc['errors']}"

				        )

				    return cast(Dict[str, Any], rc)

				def gh_get_pr_info(org: str, proj: str, pr_no: int) -> Any:

				    rc = gh_graphql(GH_GET_PR_INFO_QUERY, name=proj, owner=org, number=pr_no)

				    return rc["data"]["repository"]["pullRequest"]

				@ -608,6 +598,7 @@ def parse_args() -> Any:

				    parser.add_argument("--revert", action="store_true")

				    parser.add_argument("--force", action="store_true")

				    parser.add_argument("--ignore-current", action="store_true")

				    parser.add_argument("--check-mergeability", action="store_true")

				    parser.add_argument("--comment-id", type=int)

				    parser.add_argument("--reason", type=str)

				    parser.add_argument("pr_num", type=int)

				@ -745,7 +736,7 @@ class GitHubPR:

				        # work for ghstack where the base is the custom branch, i.e. gh/USER/ID/base,

				        # so let's just use main instead

				        self.merge_base = gh_fetch_merge_base(

				            self.org, self.project, last_commit_oid, "main"

				            self.org, self.project, last_commit_oid, self.default_branch()

				        )

				        # Fallback to baseRefOid if the API call fails, i.e. rate limit. Note that baseRefOid

				@ -845,7 +836,7 @@ class GitHubPR:

				        def add_authors(info: Dict[str, Any]) -> None:

				            for node in info["commits_with_authors"]["nodes"]:

				                author_node = node["commit"]["author"]

				                for author_node in node["commit"]["authors"]["nodes"]:

				                    user_node = author_node["user"]

				                    author = f"{author_node['name']} <{author_node['email']}>"

				                    if user_node is None:

				@ -948,11 +939,6 @@ class GitHubPR:

				    def get_authors(self) -> Dict[str, str]:

				        rc = {}

				        # TODO: replace with  `self.get_commit_count()` when GraphQL pagination can be used

				        # to fetch all commits, see https://gist.github.com/malfet/4f35321b0c9315bcd7116c7b54d83372

				        # and https://support.github.com/ticket/enterprise/1642/1659119

				        if self.get_commit_count() <= 250:

				            assert len(self._fetch_authors()) == self.get_commit_count()

				        for idx in range(len(self._fetch_authors())):

				            rc[self.get_committer_login(idx)] = self.get_committer_author(idx)

				@ -1068,6 +1054,7 @@ class GitHubPR:

				        repo: GitRepo,

				        skip_mandatory_checks: bool,

				        comment_id: Optional[int] = None,

				        skip_all_rule_checks: bool = False,

				    ) -> List["GitHubPR"]:

				        assert self.is_ghstack_pr()

				        ghstack_prs = get_ghstack_prs(

				@ -1082,7 +1069,7 @@ class GitHubPR:

				            commit_msg = pr.gen_commit_message(

				                filter_ghstack=True, ghstack_deps=pr_dependencies

				            )

				            if pr.pr_num != self.pr_num:

				            if pr.pr_num != self.pr_num and not skip_all_rule_checks:

				                # Raises exception if matching rule is not found

				                find_matching_merge_rule(

				                    pr,

				@ -1113,13 +1100,19 @@ class GitHubPR:

				            msg_body = re.sub(RE_GHSTACK_DESC, "", msg_body)

				        msg = self.get_title() + f" (#{self.pr_num})\n\n"

				        msg += msg_body

				        # Mention PR co-authors

				        for author_login, author_name in self.get_authors().items():

				            if author_login != self.get_pr_creator_login():

				                msg += f"\nCo-authored-by: {author_name}"

				        msg += f"\nPull Request resolved: {self.get_pr_url()}\n"

				        msg += f"Approved by: {approved_by_urls}\n"

				        if ghstack_deps:

				            msg += f"ghstack dependencies: {', '.join([f'#{pr.pr_num}' for pr in ghstack_deps])}\n"

				        return msg

				    def add_numbered_label(self, label_base: str) -> None:

				    def add_numbered_label(self, label_base: str, dry_run: bool) -> None:

				        labels = self.get_labels() if self.labels is not None else []

				        full_label = label_base

				        count = 0

				@ -1127,7 +1120,7 @@ class GitHubPR:

				            if label_base in label:

				                count += 1

				                full_label = f"{label_base}X{count}"

				        gh_add_labels(self.org, self.project, self.pr_num, [full_label])

				        gh_add_labels(self.org, self.project, self.pr_num, [full_label], dry_run)

				    def merge_into(

				        self,

				@ -1157,9 +1150,9 @@ class GitHubPR:

				        repo.push(self.default_branch(), dry_run)

				        if not dry_run:

				            self.add_numbered_label(MERGE_COMPLETE_LABEL)

				            self.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)

				            for pr in additional_merged_prs:

				                pr.add_numbered_label(MERGE_COMPLETE_LABEL)

				                pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)

				        if comment_id and self.pr_num:

				            # When the merge process reaches this part, we can assume that the commit

				@ -1199,7 +1192,11 @@ class GitHubPR:

				        skip_mandatory_checks: bool = False,

				        comment_id: Optional[int] = None,

				        branch: Optional[str] = None,

				        skip_all_rule_checks: bool = False,

				    ) -> List["GitHubPR"]:

				        """

				        :param skip_all_rule_checks: If true, skips all rule checks, useful for dry-running merge locally

				        """

				        branch_to_merge_into = self.default_branch() if branch is None else branch

				        if repo.current_branch() != branch_to_merge_into:

				            repo.checkout(branch_to_merge_into)

				@ -1215,6 +1212,7 @@ class GitHubPR:

				                repo,

				                skip_mandatory_checks,

				                comment_id=comment_id,

				                skip_all_rule_checks=skip_all_rule_checks,

				            )

				@ -1669,7 +1667,19 @@ def get_classifications(

				    # going forward. It's preferable to try calling Dr.CI API directly first

				    # to get the latest results as well as update Dr.CI PR comment

				    drci_classifications = get_drci_classifications(pr_num=pr_num, project=project)

				    print(f"From Dr.CI API: {json.dumps(drci_classifications)}")

				    def get_readable_drci_results(drci_classifications: Any) -> str:

				        try:

				            s = f"From Dr.CI API ({pr_num}):\n"

				            for classification, jobs in drci_classifications.items():

				                s += f"  {classification}: \n"

				                for job in jobs:

				                    s += f"    {job['id']} {job['name']}\n"

				            return s

				        except Exception:

				            return f"From Dr.CI API: {json.dumps(drci_classifications)}"

				    print(get_readable_drci_results(drci_classifications))

				    # NB: if the latest results from Dr.CI is not available, i.e. when calling from

				    # SandCastle, we fallback to any results we can find on Dr.CI check run summary

				@ -1882,8 +1892,8 @@ def do_revert_prs(

				            pr.org, pr.project, pr.pr_num, revert_message, dry_run=dry_run

				        )

				        pr.add_numbered_label("reverted", dry_run)

				        if not dry_run:

				            pr.add_numbered_label("reverted")

				            gh_post_commit_comment(pr.org, pr.project, commit_sha, revert_msg)

				            gh_update_pr_state(pr.org, pr.project, pr.pr_num)

				@ -2053,7 +2063,7 @@ def merge(

				    print(f"Attempting merge of {initial_commit_sha} ({pr_link})")

				    if MERGE_IN_PROGRESS_LABEL not in pr.get_labels():

				        gh_add_labels(pr.org, pr.project, pr.pr_num, [MERGE_IN_PROGRESS_LABEL])

				        gh_add_labels(pr.org, pr.project, pr.pr_num, [MERGE_IN_PROGRESS_LABEL], dry_run)

				    explainer = TryMergeExplainer(

				        skip_mandatory_checks,

				@ -2073,8 +2083,7 @@ def merge(

				    check_for_sev(pr.org, pr.project, skip_mandatory_checks)

				    if skip_mandatory_checks or can_skip_internal_checks(pr, comment_id):

				        # do not wait for any pending signals if PR is closed as part of co-development process

				    if skip_mandatory_checks:

				        gh_post_pr_comment(

				            pr.org,

				            pr.project,

				@ -2201,8 +2210,7 @@ def merge(

				    # Finally report timeout back

				    msg = f"Merged timed out after {timeout_minutes} minutes. Please contact the pytorch_dev_infra team."

				    msg += f"The last exception was: {last_exception}"

				    if not dry_run:

				        gh_add_labels(pr.org, pr.project, pr.pr_num, ["land-failed"])

				    gh_add_labels(pr.org, pr.project, pr.pr_num, ["land-failed"], dry_run)

				    raise RuntimeError(msg)

				@ -2281,6 +2289,16 @@ def main() -> None:

				        )

				        return

				    if args.check_mergeability:

				        if pr.is_ghstack_pr():

				            get_ghstack_prs(repo, pr)  # raises error if out of sync

				        pr.merge_changes(

				            repo,

				            skip_mandatory_checks=True,

				            skip_all_rule_checks=True,

				        )

				        return

				    if not args.force and pr.has_invalid_submodule_updates():

				        message = (

				            f"This PR updates submodules {', '.join(pr.get_changed_submodules())}\n"

				@ -2329,7 +2347,10 @@ def main() -> None:

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

				    finally:

				        gh_remove_label(org, project, args.pr_num, MERGE_IN_PROGRESS_LABEL)

				        if not args.check_mergeability:

				            gh_remove_label(

				                org, project, args.pr_num, MERGE_IN_PROGRESS_LABEL, args.dry_run

				            )

				if __name__ == "__main__":

									
										171

.github/scripts/update_commit_hashes.py
									
										vendored
									
												View File
											
				@ -1,171 +0,0 @@

				import json

				import os

				import subprocess

				from argparse import ArgumentParser

				from typing import Any, Dict

				import requests

				UPDATEBOT_TOKEN = os.environ["UPDATEBOT_TOKEN"]

				PYTORCHBOT_TOKEN = os.environ["PYTORCHBOT_TOKEN"]

				OWNER, REPO = "pytorch", "pytorch"

				def git_api(

				    url: str, params: Dict[str, str], type: str = "get", token: str = UPDATEBOT_TOKEN

				) -> Any:

				    headers = {

				        "Accept": "application/vnd.github.v3+json",

				        "Authorization": f"token {token}",

				    }

				    if type == "post":

				        return requests.post(

				            f"https://api.github.com{url}",

				            data=json.dumps(params),

				            headers=headers,

				        ).json()

				    elif type == "patch":

				        return requests.patch(

				            f"https://api.github.com{url}",

				            data=json.dumps(params),

				            headers=headers,

				        ).json()

				    else:

				        return requests.get(

				            f"https://api.github.com{url}",

				            params=params,

				            headers=headers,

				        ).json()

				def parse_args() -> Any:

				    parser = ArgumentParser("Rebase PR into branch")

				    parser.add_argument("--repo-name", type=str)

				    parser.add_argument("--branch", type=str)

				    parser.add_argument("--pin-folder", type=str)

				    return parser.parse_args()

				def make_pr(repo_name: str, branch_name: str) -> Any:

				    params = {

				        "title": f"[{repo_name} hash update] update the pinned {repo_name} hash",

				        "head": branch_name,

				        "base": "main",

				        "body": "This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/"

				        + f".github/workflows/_update-commit-hash.yml).\nUpdate the pinned {repo_name} hash.",

				    }

				    response = git_api(f"/repos/{OWNER}/{REPO}/pulls", params, type="post")

				    print(f"made pr {response['html_url']}")

				    return response["number"]

				def approve_pr(pr_number: str) -> None:

				    params = {"event": "APPROVE"}

				    # use pytorchbot to approve the pr

				    git_api(

				        f"/repos/{OWNER}/{REPO}/pulls/{pr_number}/reviews",

				        params,

				        type="post",

				        token=PYTORCHBOT_TOKEN,

				    )

				def make_comment(pr_number: str, msg: str) -> None:

				    params = {"body": msg}

				    # comment with pytorchbot because pytorchmergebot gets ignored

				    git_api(

				        f"/repos/{OWNER}/{REPO}/issues/{pr_number}/comments",

				        params,

				        type="post",

				        token=PYTORCHBOT_TOKEN,

				    )

				def close_pr(pr_number: str) -> None:

				    params = {"state": "closed"}

				    git_api(

				        f"/repos/{OWNER}/{REPO}/pulls/{pr_number}",

				        params,

				        type="patch",

				    )

				def is_newer_hash(new_hash: str, old_hash: str, repo_name: str) -> bool:

				    def _get_date(hash: str) -> int:

				        # this git command prints the unix timestamp of the hash

				        return int(

				            subprocess.run(

				                f"git show --no-patch --no-notes --pretty=%ct {hash}".split(),

				                capture_output=True,

				                cwd=f"{repo_name}",

				            )

				            .stdout.decode("utf-8")

				            .strip()

				        )

				    return _get_date(new_hash) > _get_date(old_hash)

				def main() -> None:

				    args = parse_args()

				    branch_name = os.environ["NEW_BRANCH_NAME"]

				    pr_num = None

				    # query to see if a pr already exists

				    params = {

				        "q": f"is:pr is:open in:title author:pytorchupdatebot repo:{OWNER}/{REPO} {args.repo_name} hash update",

				        "sort": "created",

				    }

				    response = git_api("/search/issues", params)

				    if response["total_count"] != 0:

				        # pr does exist

				        pr_num = response["items"][0]["number"]

				        link = response["items"][0]["html_url"]

				        response = git_api(f"/repos/{OWNER}/{REPO}/pulls/{pr_num}", {})

				        branch_name = response["head"]["ref"]

				        print(

				            f"pr does exist, number is {pr_num}, branch name is {branch_name}, link is {link}"

				        )

				    hash = (

				        subprocess.run(

				            f"git rev-parse {args.branch}".split(),

				            capture_output=True,

				            cwd=f"{args.repo_name}",

				        )

				        .stdout.decode("utf-8")

				        .strip()

				    )

				    with open(f"{args.pin_folder}/{args.repo_name}.txt", "r+") as f:

				        old_hash = f.read().strip()

				        subprocess.run(f"git checkout {old_hash}".split(), cwd=args.repo_name)

				        f.seek(0)

				        f.truncate()

				        f.write(f"{hash}\n")

				    if is_newer_hash(hash, old_hash, args.repo_name):

				        # if there was an update, push to branch

				        subprocess.run(f"git checkout -b {branch_name}".split())

				        subprocess.run(f"git add {args.pin_folder}/{args.repo_name}.txt".split())

				        subprocess.run(

				            "git commit -m".split() + [f"update {args.repo_name} commit hash"]

				        )

				        subprocess.run(f"git push --set-upstream origin {branch_name} -f".split())

				        print(f"changes pushed to branch {branch_name}")

				        if pr_num is None:

				            # no existing pr, so make a new one and approve it

				            pr_num = make_pr(args.repo_name, branch_name)

				            approve_pr(pr_num)

				        make_comment(pr_num, "@pytorchbot merge")

				    else:

				        print(

				            f"tried to update from old hash: {old_hash} to new hash: {hash} but the old hash seems to be newer, not creating pr"

				        )

				        if pr_num is not None:

				            make_comment(pr_num, "closing pr as the current hash seems up to date")

				            close_pr(pr_num)

				            print(f"closing PR {pr_num}")

				if __name__ == "__main__":

				    main()

2

.github/templates/common.yml.j2 vendored

View File

 @ -8,7 +8,7 @@
 # NOTE: If testing pytorch/builder changes you can change this variable to change what pytorch/builder reference
 #       the binary builds will check out
 {%- set builder_repo = "pytorch/builder" -%}
 {%- set builder_branch = "main" -%}
 {%- set builder_branch = "release/2.3" -%}
 {%- macro concurrency(build_environment) -%}
 concurrency:

5

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -7,6 +7,7 @@
 name: !{{ build_environment }}
 {%- endblock %}
 on:
   push:
     {%- if branches == "nightly" %}
 @ -99,8 +100,8 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
       - name: ROCm set GPU_FLAG
         run: |
           echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

4

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -81,8 +81,8 @@ jobs:
           elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
             echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
           fi
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
         uses: nick-fields/retry@v2.8.2
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}

5

.github/templates/upload.yml.j2 vendored

View File

 @ -53,6 +53,9 @@
 {%- macro upload_binaries(config, is_windows=False, has_test=True, use_s3=True) -%}
 !{{ config["build_name"] }}-upload:  # Uploading
     if: ${{ github.repository_owner == 'pytorch' }}
     permissions:
       id-token: write
       contents: read
 {%- if has_test %}
     needs: !{{ config["build_name"] }}-test
 {%- else %}
 @ -65,8 +68,6 @@
       {%- endif %}
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
       aws-pytorch-uploader-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
       aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
       conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
       conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
     uses: ./.github/workflows/_binary-upload.yml

8

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -65,8 +65,8 @@ jobs:
     steps:
       !{{ common.setup_ec2_windows() }}
       !{{ set_runner_specific_vars() }}
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
       - name: Populate binary env
         shell: bash
         run: |
 @ -105,8 +105,8 @@ jobs:
         with:
           name: !{{ config["build_name"] }}
           path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       !{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
       - name: Populate binary env
         shell: bash
         run: |

									
										14

.github/workflows/_android-build-test.yml
									
										vendored
									
												View File
												
				@ -37,7 +37,7 @@ jobs:

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -59,25 +59,25 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -131,7 +131,7 @@ jobs:

				          export COMMAND

				          # shellcheck disable=SC2016

				          COMMAND='(echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh" | docker exec -u jenkins -e BUILD_LITE_INTERPRETER -e GRADLE_OFFLINE=1 -i "$id" bash) 2>&1'

				          COMMAND='(echo "sudo chown -R jenkins workspace && cd workspace && ./scripts/build_android_gradle.sh" | docker exec -u jenkins -e BUILD_LITE_INTERPRETER -e GRADLE_OFFLINE=1 -i "$id" bash) 2>&1'

				          echo "${COMMAND}" > ./command.sh && bash ./command.sh

				          # Skip docker push as this job is purely for size analysis purpose.

				          # Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.

				@ -141,5 +141,5 @@ jobs:

				        if: always()

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				        if: always()

									
										14

.github/workflows/_android-full-build-test.yml
									
										vendored
									
												View File
												
				@ -37,7 +37,7 @@ jobs:

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -59,25 +59,25 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -157,7 +157,7 @@ jobs:

				          docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_32" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_x86_32"

				          # run gradle buildRelease

				          (echo "./.circleci/scripts/build_android_gradle.sh" | docker exec \

				          (echo "./scripts/build_android_gradle.sh" | docker exec \

				            -e BUILD_ENVIRONMENT="pytorch-linux-focal-py3-clang9-android-ndk-r21e-gradle-build" \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				            -e AWS_DEFAULT_REGION \

				@ -186,5 +186,5 @@ jobs:

				        if: always()

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				        if: always()

									
										14

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -42,7 +42,7 @@ jobs:

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -64,30 +64,30 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.3

				        if: ${{ inputs.cuda-version != 'cpu' }}

				      - name: Output disk space left

				@ -196,5 +196,5 @@ jobs:

				          file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				        if: always()

									
										13

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -78,7 +78,7 @@ on:

				jobs:

				  build:

				    runs-on: ${{ inputs.runs_on }}

				    timeout-minutes: 180

				    timeout-minutes: 210

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				      BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}

				@ -139,13 +139,13 @@ jobs:

				        run: env

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}

				@ -173,7 +173,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: malfet/checkout@silent-checkout

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				          quiet-checkout: true

				@ -187,7 +186,7 @@ jobs:

				      - name: Checkout pytorch/builder to builder dir

				        uses: malfet/checkout@silent-checkout

				        with:

				          ref: main

				          ref: release/2.3

				          submodules: recursive

				          repository: pytorch/builder

				          path: builder

				@ -213,7 +212,7 @@ jobs:

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ inputs.DOCKER_IMAGE }}

				@ -270,7 +269,7 @@ jobs:

				      - name: Teardown Linux

				        if: always()

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				      - name: Chown workspace

				        if: always()

									
										13

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -127,14 +127,14 @@ jobs:

				          } >> "${GITHUB_ENV} }}"

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        continue-on-error: true

				        with:

				          github-secret: ${{ secrets.github-token }}

				        # Setup the environment

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}

				@ -155,7 +155,6 @@ jobs:

				      - name: Checkout PyTorch to pytorch dir

				        uses: malfet/checkout@silent-checkout

				        with:

				          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

				          submodules: recursive

				          path: pytorch

				@ -168,7 +167,7 @@ jobs:

				      - name: Checkout pytorch/builder to builder dir

				        uses: malfet/checkout@silent-checkout

				        with:

				          ref: main

				          ref: release/2.3

				          submodules: recursive

				          repository: pytorch/builder

				          path: builder

				@ -199,12 +198,12 @@ jobs:

				          path: "${{ runner.temp }}/artifacts/"

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.3

				        if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}

				      - name: Pull Docker image

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ inputs.DOCKER_IMAGE }}

				@ -214,7 +213,7 @@ jobs:

				      - name: Teardown Linux

				        if: always()

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				      - name: Chown workspace

				        if: always()

									
										25

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -59,18 +59,13 @@ on:

				      github-token:

				        required: true

				        description: Github Token

				      aws-pytorch-uploader-access-key-id:

				        required: true

				        description: AWS access key id

				      aws-pytorch-uploader-secret-access-key:

				        required: true

				        description: AWS secret access key

				      conda-pytorchbot-token:

				        required: true

				        description: Conda PyTorchBot token

				      conda-pytorchbot-token-test:

				        required: true

				        description: Conda PyTorchBot token

				jobs:

				  upload:

				    runs-on: ubuntu-22.04

				@ -100,10 +95,24 @@ jobs:

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          no-sudo: true

				      - name: Configure AWS credentials(PyTorch account) for nightly

				        if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/nightly' }}

				        uses: aws-actions/configure-aws-credentials@v3

				        with:

				          role-to-assume: arn:aws:iam::749337293305:role/gha_workflow_nightly_build_wheels

				          aws-region: us-east-1

				      - name: Configure AWS credentials(PyTorch account) for RC builds

				        if: ${{ github.event_name == 'push' &&  (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/')) }}

				        uses: aws-actions/configure-aws-credentials@v3

				        with:

				          role-to-assume: arn:aws:iam::749337293305:role/gha_workflow_test_build_wheels

				          aws-region: us-east-1

				      - name: Download Build Artifacts

				        id: download-artifacts

				        # NB: When the previous build job is skipped, there won't be any artifacts and

				@ -135,8 +144,6 @@ jobs:

				          PKG_DIR: "${{ runner.temp }}/artifacts"

				          UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"

				          # When running these on pull_request events these should be blank

				          AWS_ACCESS_KEY_ID: ${{ secrets.aws-pytorch-uploader-access-key-id }}

				          AWS_SECRET_ACCESS_KEY: ${{ secrets.aws-pytorch-uploader-secret-access-key }}

				          CONDA_PYTORCHBOT_TOKEN: ${{ secrets.conda-pytorchbot-token }}

				          CONDA_PYTORCHBOT_TOKEN_TEST: ${{ secrets.conda-pytorchbot-token-test }}

				          BUILD_NAME: ${{ inputs.build_name }}

									
										6

.github/workflows/_buck-build-test.yml
									
										vendored
									
												View File
												
				@ -23,7 +23,7 @@ jobs:

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -44,7 +44,7 @@ jobs:

				    runs-on: ${{ matrix.runner }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Set up JDK 8

				        uses: actions/setup-java@v3

				@ -53,7 +53,7 @@ jobs:

				          distribution: 'temurin'

				      - name: Setup miniconda

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3

				        with:

				          python-version: 3.8

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

									
										10

.github/workflows/_docs.yml
									
										vendored
									
												View File
												
				@ -66,7 +66,7 @@ jobs:

				    name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				          instructions: |

				@ -77,19 +77,19 @@ jobs:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -187,5 +187,5 @@ jobs:

				          s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				        if: always()

									
										6

.github/workflows/_ios-build-test.yml
									
										vendored
									
												View File
												
				@ -46,7 +46,7 @@ jobs:

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          fetch-depth: 1

				          submodules: false

				@ -80,7 +80,7 @@ jobs:

				    steps:

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Populate CI build options

				        shell: bash

				@ -102,7 +102,7 @@ jobs:

				            brew install libtool

				      - name: Setup miniconda for iOS

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3

				        with:

				          python-version: "3.9"

				          environment-file: .github/requirements/conda-env-iOS.txt

									
										10

.github/workflows/_linux-build.yml
									
										vendored
									
												View File
												
				@ -73,7 +73,7 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -82,14 +82,14 @@ jobs:

				      # checkout because when we run this action we don't *have* a local

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3

				        with:

				          docker-image-name: ${{ inputs.docker-image-name }}

				@ -103,7 +103,7 @@ jobs:

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -209,5 +209,5 @@ jobs:

				          path: sccache-stats-*.json

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				        if: always()

									
										22

.github/workflows/_linux-test.yml
									
										vendored
									
												View File
												
				@ -57,7 +57,7 @@ jobs:

				    timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}

				    steps:

				      - name: Setup SSH (Click me for login details)

				        uses: pytorch/test-infra/.github/actions/setup-ssh@main

				        uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3

				        if: ${{ !contains(matrix.runner, 'gcp.a100') }}

				        with:

				          github-secret: ${{ secrets.GITHUB_TOKEN }}

				@ -66,14 +66,14 @@ jobs:

				              docker exec -it $(docker container ps --format '{{.ID}}') bash

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Setup Linux

				        uses: ./.github/actions/setup-linux

				      - name: Calculate docker image

				        id: calculate-docker-image

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				        uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3

				        with:

				          docker-image-name: ${{ inputs.docker-image }}

				@ -87,13 +87,13 @@ jobs:

				          echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				      - name: Pull docker image

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				        uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3

				        with:

				          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

				        id: install-nvidia-driver

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@main

				        uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.3

				        if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')

				      - name: Lock NVIDIA A100 40GB Frequency

				@ -117,6 +117,10 @@ jobs:

				        with:

				          name: ${{ inputs.build-environment }}

				      - name: Download TD artifacts

				        continue-on-error: true

				        uses: ./.github/actions/download-td-artifacts

				      - name: Parse ref

				        id: parse-ref

				        run: .github/scripts/parse_ref.py

				@ -169,6 +173,9 @@ jobs:

				          NUM_TEST_SHARDS: ${{ matrix.num_shards }}

				          REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}

				          CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}

				          VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

				          NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}

				          NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				          SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				          SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}

				@ -218,6 +225,9 @@ jobs:

				            -e NUM_TEST_SHARDS \

				            -e REENABLED_ISSUES \

				            -e CONTINUE_THROUGH_ERROR \

				            -e VERBOSE_TEST_LOGS \

				            -e NO_TEST_TIMEOUT \

				            -e NO_TD \

				            -e PR_LABELS \

				            -e MAX_JOBS="$(nproc --ignore=2)" \

				            -e SCCACHE_BUCKET \

				@ -297,7 +307,7 @@ jobs:

				          path: ./**/core.[1-9]*

				      - name: Teardown Linux

				        uses: pytorch/test-infra/.github/actions/teardown-linux@main

				        uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3

				        if: always()

				      # NB: We are currently having an intermittent GPU-related issue on G5 runners with

									
										10

.github/workflows/_mac-build.yml
									
										vendored
									
												View File
												
				@ -71,11 +71,11 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				    steps:

				      - name: Clean up disk space before running MacOS workflow

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Set xcode version

				        env:

				@ -87,7 +87,7 @@ jobs:

				      - name: Setup miniconda

				        if: inputs.environment-file == ''

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				@ -97,7 +97,7 @@ jobs:

				      # environment even though the arch is x86-64

				      - name: Setup miniconda using the provided environment file

				        if: inputs.environment-file != ''

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: ${{ inputs.environment-file }}

				@ -207,4 +207,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

									
										13

.github/workflows/_mac-test-mps.yml
									
										vendored
									
												View File
												
				@ -34,12 +34,14 @@ jobs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}

				      keep-going: ${{ steps.filter.outputs.keep-going }}

				      ci-verbose-test-logs: ${{ steps.filter.outputs.ci-verbose-test-logs }}

				      ci-no-test-timeout: ${{ steps.filter.outputs.ci-no-test-timeout }}

				      ci-no-td: ${{ steps.filter.outputs.ci-no-td }}

				      reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				        with:

				          fetch-depth: 1

				          submodules: false

				      - name: Select all requested test configurations

				@ -79,7 +81,7 @@ jobs:

				          use-gha: true

				      - name: Setup miniconda

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				@ -95,6 +97,9 @@ jobs:

				          PY_VERS: 3.9

				          PR_BODY: ${{ github.event.pull_request.body }}

				          CONTINUE_THROUGH_ERROR: ${{ needs.filter.outputs.keep-going }}

				          VERBOSE_TEST_LOGS: ${{ needs.filter.outputs.ci-verbose-test-logs }}

				          NO_TEST_TIMEOUT: ${{ needs.filter.outputs.ci-no-test-timeout }}

				          NO_TD: ${{ needs.filter.outputs.ci-no-td }}

				          PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt

				          REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}

				        run: |

				@ -154,4 +159,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

									
										17

.github/workflows/_mac-test.yml
									
										vendored
									
												View File
												
				@ -79,11 +79,11 @@ jobs:

				          done

				      - name: Clean up disk space before running MacOS workflow

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

				      # [see note: pytorch repo ref]

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3

				      - name: Download build artifacts

				        uses: ./.github/actions/download-build-artifacts

				@ -91,8 +91,14 @@ jobs:

				          name: ${{ inputs.build-environment }}

				          use-gha: true

				      - name: Download TD artifacts

				        continue-on-error: true

				        uses: ./.github/actions/download-td-artifacts

				        with:

				          use-gha: true

				      - name: Setup miniconda

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@main

				        uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3

				        with:

				          python-version: ${{ inputs.python-version }}

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				@ -148,6 +154,9 @@ jobs:

				          PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}

				          PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}

				          CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}

				          VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

				          NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}

				          NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				          PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt

				          GITHUB_REPOSITORY: ${{ github.repository }}

				          GITHUB_WORKFLOW: ${{ github.workflow }}

				@ -218,4 +227,4 @@ jobs:

				      - name: Clean up disk space

				        if: always()

				        continue-on-error: true

				        uses: pytorch/test-infra/.github/actions/check-disk-space@main

				        uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

Compare commits

2370 Commits tensordict ... v2.3.0

17 .ci/docker/build.sh Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/huggingface.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

16 .ci/docker/common/install_acl.sh Normal file Unescape Escape View File

2 .ci/docker/common/install_base.sh Unescape Escape View File

53 .ci/docker/common/install_conda.sh Unescape Escape View File

1 .ci/docker/common/install_executorch.sh Unescape Escape View File

9 .ci/docker/common/install_onnx.sh Unescape Escape View File

3 .ci/docker/common/install_openssl.sh Unescape Escape View File

42 .ci/docker/common/install_protobuf.sh Unescape Escape View File

16 .ci/docker/common/install_rocm.sh Unescape Escape View File

2 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

3 .ci/docker/common/install_triton.sh Unescape Escape View File

7 .ci/docker/common/install_ucc.sh Unescape Escape View File

44 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

8 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

18 .ci/pytorch/build.sh Unescape Escape View File

7 .ci/pytorch/common_utils.sh Unescape Escape View File

2 .ci/pytorch/macos-common.sh Unescape Escape View File

2 .ci/pytorch/macos-test.sh Unescape Escape View File

4 .ci/pytorch/multigpu-test.sh Unescape Escape View File

139 .ci/pytorch/test.sh Unescape Escape View File

13 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

14 .ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat Unescape Escape View File

13 .ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat Unescape Escape View File

466 .circleci/README.md Unescape Escape View File

69 .circleci/scripts/binary_checkout.sh Unescape Escape View File

44 .circleci/scripts/binary_install_miniconda.sh Unescape Escape View File

4 .circleci/scripts/binary_macos_build.sh Unescape Escape View File

65 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

29 .circleci/scripts/binary_run_in_docker.sh Unescape Escape View File

111 .circleci/scripts/setup_ci_environment.sh Unescape Escape View File

50 .circleci/scripts/setup_linux_system_environment.sh Unescape Escape View File

1 .clang-tidy Unescape Escape View File

2 .devcontainer/Dockerfile Unescape Escape View File

2 .devcontainer/README.md Unescape Escape View File

22 .flake8 Unescape Escape View File

2 .github/actionlint.yaml vendored Unescape Escape View File

29 .github/actions/download-td-artifacts/action.yml vendored Normal file Unescape Escape View File

11 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

10 .github/actions/setup-rocm/action.yml vendored Unescape Escape View File

59 .github/actions/update-commit-hash/action.yml vendored Unescape Escape View File

1 .github/auto_request_review.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/torchbench.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/vision.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

6 .github/labeler.yml vendored Unescape Escape View File

9 .github/merge_rules.yaml vendored Unescape Escape View File

1 .github/requirements-gha-cache.txt vendored Unescape Escape View File

4 .github/requirements/conda-env-Linux-X64.txt vendored Unescape Escape View File

4 .github/requirements/conda-env-iOS.txt vendored Unescape Escape View File

4 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

14 .github/scripts/build_triton_wheel.py vendored Unescape Escape View File

223 .github/scripts/cherry_pick.py vendored Executable file Unescape Escape View File

274 .github/scripts/delete_old_branches.py vendored Normal file Unescape Escape View File

139 .github/scripts/fetch_latest_green_commit.py vendored Unescape Escape View File

20 .github/scripts/filter_test_configs.py vendored Unescape Escape View File

6 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

39 .github/scripts/generate_ci_workflows.py vendored Unescape Escape View File

2 .github/scripts/generate_docker_release_matrix.py vendored Unescape Escape View File

13 .github/scripts/get_aws_session_tokens.py vendored Executable file Unescape Escape View File

13 .github/scripts/github_utils.py vendored Unescape Escape View File

10 .github/scripts/gitutils.py vendored Unescape Escape View File

BIN .github/scripts/gql_mocks.json.gz vendored View File

12 .github/scripts/label_utils.py vendored Unescape Escape View File

44 .github/scripts/lintrunner.sh vendored Executable file Unescape Escape View File

51 .github/scripts/s390x-ci/README.md vendored Normal file Unescape Escape View File

66 .github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile vendored Normal file Unescape Escape View File

22 .github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service vendored Normal file Unescape Escape View File

6 .github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored Normal file Unescape Escape View File

30 .github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/entrypoint vendored Normal file Unescape Escape View File

11 .github/scripts/s390x-ci/self-hosted-builder/qemu-user-static.service vendored Normal file Unescape Escape View File

148 .github/scripts/test_fetch_latest_green_commit.py vendored Unescape Escape View File

73 .github/scripts/test_filter_test_configs.py vendored Unescape Escape View File

2370 Commits

tensordict ... v2.3.0

17

.ci/docker/build.sh

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/huggingface.txt

View File

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

16

.ci/docker/common/install_acl.sh Normal file

View File

2

.ci/docker/common/install_base.sh

View File

53

.ci/docker/common/install_conda.sh

View File

1

.ci/docker/common/install_executorch.sh

View File

9

.ci/docker/common/install_onnx.sh

View File

3

.ci/docker/common/install_openssl.sh

View File

42

.ci/docker/common/install_protobuf.sh

View File

16

.ci/docker/common/install_rocm.sh

View File

2

.ci/docker/common/install_rocm_magma.sh

View File

3

.ci/docker/common/install_triton.sh

View File

7

.ci/docker/common/install_ucc.sh

View File

44

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

8

.ci/docker/ubuntu/Dockerfile

View File

18

.ci/pytorch/build.sh

View File

7

.ci/pytorch/common_utils.sh

View File

2

.ci/pytorch/macos-common.sh

View File

2

.ci/pytorch/macos-test.sh

View File

4

.ci/pytorch/multigpu-test.sh

View File

139

.ci/pytorch/test.sh

View File

13

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

14

.ci/pytorch/win-test-helpers/installation-helpers/install_mkl.bat

View File

13

.ci/pytorch/win-test-helpers/installation-helpers/install_sccache.bat

View File

466

.circleci/README.md

View File

69

.circleci/scripts/binary_checkout.sh

View File

44

.circleci/scripts/binary_install_miniconda.sh

View File

4

.circleci/scripts/binary_macos_build.sh

View File

65

.circleci/scripts/binary_populate_env.sh

View File

29

.circleci/scripts/binary_run_in_docker.sh

View File

111

.circleci/scripts/setup_ci_environment.sh

View File

50

.circleci/scripts/setup_linux_system_environment.sh

View File

1

.clang-tidy

View File

2

.devcontainer/Dockerfile

View File

2

.devcontainer/README.md

View File

22

.flake8

View File

2

.github/actionlint.yaml vendored

View File

29

.github/actions/download-td-artifacts/action.yml vendored Normal file

View File

11

.github/actions/filter-test-configs/action.yml vendored

View File

10

.github/actions/setup-rocm/action.yml vendored

View File

59

.github/actions/update-commit-hash/action.yml vendored

View File

1

.github/auto_request_review.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/torchbench.txt vendored

View File

2

.github/ci_commit_pins/vision.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

6

.github/labeler.yml vendored

View File

9

.github/merge_rules.yaml vendored

View File

1

.github/requirements-gha-cache.txt vendored

View File

4

.github/requirements/conda-env-Linux-X64.txt vendored

View File

4

.github/requirements/conda-env-iOS.txt vendored

View File

4

.github/requirements/pip-requirements-macOS.txt vendored

View File

14

.github/scripts/build_triton_wheel.py vendored

View File

223

.github/scripts/cherry_pick.py vendored Executable file

View File

274

.github/scripts/delete_old_branches.py vendored Normal file

View File

139

.github/scripts/fetch_latest_green_commit.py vendored

View File

20

.github/scripts/filter_test_configs.py vendored

View File

6

.github/scripts/generate_binary_build_matrix.py vendored

View File

39

.github/scripts/generate_ci_workflows.py vendored

View File

2

.github/scripts/generate_docker_release_matrix.py vendored

View File

13

.github/scripts/get_aws_session_tokens.py vendored Executable file

View File

13

.github/scripts/github_utils.py vendored

View File

10

.github/scripts/gitutils.py vendored

View File

BIN
.github/scripts/gql_mocks.json.gz vendored

View File

12

.github/scripts/label_utils.py vendored

View File

44

.github/scripts/lintrunner.sh vendored Executable file

View File

51

.github/scripts/s390x-ci/README.md vendored Normal file

View File

66

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile vendored Normal file

View File

22

.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service vendored Normal file

View File

6

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored Normal file

View File

30

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/entrypoint vendored Normal file

View File

11

.github/scripts/s390x-ci/self-hosted-builder/qemu-user-static.service vendored Normal file

View File

148

.github/scripts/test_fetch_latest_green_commit.py vendored

View File

73

.github/scripts/test_filter_test_configs.py vendored

View File

15

.github/scripts/test_trymerge.py vendored

View File