pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-11 22:34:53 +08:00

Author	SHA1	Message	Date
atalman	e891a3bba9	[releng] Add release 2.2 to Release Compatibility Matrix for PyTorch releases (#114758 ) Update RELEASE.md for release 2.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114758 Approved by: https://github.com/DanilBaibak	2023-11-29 16:27:59 +00:00
Jack Taylor	4a4c9fb0b8	[ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141 ) Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797 Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper. This PR also brings in additional required hipify mappings for the wrapper codegen file. NOTE: we can unskip some of these tests when we enabled MI210 runners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141 Approved by: https://github.com/jansel, https://github.com/malfet	2023-11-29 15:11:24 +00:00
Nikita Shulga	a3bbf9ce3e	[BE][RelEng] Remove `dynamo` extra (#114720 ) As all dynamo dependencies are part of the default requirements, see ``` % curl -s https://pypi.org/pypi/torch/2.1.1/json \| jq '.info.requires_dist' [ "filelock", "typing-extensions", "sympy", "networkx", "jinja2", "fsspec", "nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-curand-cu12 (==10.3.2.106) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-nccl-cu12 (==2.18.1) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "nvidia-nvtx-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "triton (==2.1.0) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"", "jinja2 ; extra == 'dynamo'", "opt-einsum (>=3.3) ; extra == 'opt-einsum'" ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114720 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-11-29 15:08:27 +00:00
Yanbo Liang	b6a30bbfb6	[Dynamo] Forward fix dynamo trace rule test failure due to landing race (#114739 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114739 Approved by: https://github.com/janeyx99, https://github.com/huydhn	2023-11-29 09:31:12 +00:00
Jerry Zhang	d2f4215dbb	[quant][pt2e] Fix the order for implicit sharing code (#114704 ) Summary: Current order of implicit sharing breaks common annotation patterns of SharedQuantizationSpec, so we changed the order here. But it's not going to work in all possible annotation cases, so quantizer implementors still need to be careful. In general if people only refer to node/edges that comes before the current node/edge in SharedQuantizationSpec, it should work I think Test Plan: CI, make sure this Fixed some internal tests Differential Revision: D51605918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114704 Approved by: https://github.com/andrewor14	2023-11-29 08:58:28 +00:00
Elias Ellison	7692595834	Use different conv layout optimization heuristics for inference (#114600 ) While many models regress in training when converted to channels last, in inference the results are quite different. Almost all of the models experienced a speedup when converted to channels last. There were a few big regressions in torchbench - `timm_regnet` from `1.4343 → 1.0573` and `timm_resnet` from `1.7484 → 1.2868`. I used a modified script of the operator benchmarks [here](https://gist.github.com/eellison/e11dc645412f52e8b45fb26ba6f9f6a1) to measure the average speedup of convolutions across all of the input shapes found in torchbench according to the existing classifications that @shunting314 used - grouped convs, small channel convs, convolution with larger in-channel than out-channel. Only grouped convolutions benchmarked as a slowdown in inference. I updated the inference heuristic to multiply the flops of each conv with its predicted speedup/slowdown in channels last. With this heuristic the two previously regressing models no longer regress. Speeds up inference for torchbench ~8% and timm ~6%. The motivating model here was SDXL which now hits channels last and improves 10%. There were some models that were sped up in training when forcing channels last (along with a number of regressions). It's possible there is some speedup in training to be had with additional heuristics. We could also have more granular classification/predictions which might benefit both training and inference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114600 Approved by: https://github.com/jansel, https://github.com/shunting314	2023-11-29 07:53:59 +00:00
cyy	4e38178bb8	[Reland] [1/N] Fixes clang-tidy warnings in header files (#114668 ) Reland of #113608 after fixing the problematic parts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114668 Approved by: https://github.com/huydhn	2023-11-29 07:11:51 +00:00
angelayi	c10893654e	[export] Fix run_decomps to work with fake mode (#114714 ) Fixes https://github.com/pytorch/pytorch/issues/114711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114714 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2023-11-29 06:52:13 +00:00
Chen, Zejun	a076a74f11	[Nested Tensor] Add xpu device in assertion for nested tensor creation (#114664 ) Add xpu device checking in nested tensor creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/114664 Approved by: https://github.com/jgong5, https://github.com/xunnanxu	2023-11-29 05:59:35 +00:00
Pearu Peterson	69c4819f53	Add bsr_dense_addmm triton kernel (#114595 ) As in the title. The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters): - it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below). - it supports rectangular blocks in sparse BSR tensor weights The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`): - with 16x16 blocks, the average/maximal speed up is 55/94 % - with 32x32 blocks, the average/maximal speed up is 33/63 % - with 64x64 blocks, the average/maximal speed up is 23/42 % - with 128x128 blocks, the average/maximal speed up is 15/39 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595 Approved by: https://github.com/cpuhrsch	2023-11-29 05:29:25 +00:00
Yanbo Liang	57a5a687b0	[Dynamo][6.2/N] Dump the in graph function list(~2600 ops) and add unit tests. (#114196 ) This is the second PR according https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114196 Approved by: https://github.com/jansel	2023-11-29 05:09:48 +00:00
Angela Yi	05f071d922	[export] Fix state dict device serialization (#114695 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/114000 Will check with SherlockNoMad on why we need to convert to cpu after his PTO Test Plan: CI Differential Revision: D51629068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114695 Approved by: https://github.com/ydwu4	2023-11-29 05:05:22 +00:00
PyTorch MergeBot	7c8d3639cf	Revert "[fx] log the node when it's get eliminated (#112684 )" This reverts commit 6256d3710e18f08af8588d1aae88c758bd9c6b30. Reverted https://github.com/pytorch/pytorch/pull/112684 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112684#issuecomment-1831198778))	2023-11-29 04:31:15 +00:00
Brian Hirsh	64ccdd4afb	AOTAutograd: keep input mutations in the graph if they are under no_grad, even if they require_grad (#114646 ) Quick recap of events: (1) https://github.com/pytorch/pytorch/pull/111347, which fixed a perf regression in 2.1 compared to 2.0, introduced a correctness problem around input mutations on inputs that require grad that show up in an inference-only graph (the specific case where this can happen is rare and nobody reported the issue, but it was fixed a few weeks later) (2) That fix happened here: https://github.com/pytorch/pytorch/pull/113584, which makes sure to keep input mutations outside of the graph, so the autograd engine can set metadata properly on them (3) That in turn caused a slight regression compared to (1), which is what this PR attempts to fix. In particular, code like the below is safe to keep the mutations in the graph for: ``` @torch.compile def f(x): x.mul_(2) x = torch.ones(2, requires_grad=True).clone() # x requires_grad, so the input mutation will change some autograd metadata, like the version counter # However, the mutation is under no_grad, so we don't have to worry about e.g. aliases of x having their .grad_fn fields changed with torch.no_grad(): f(x) ``` This particular case is pretty important to the shampoo optimizer code, which is run under `torch.compile`, and mutates parameters (which require grad). Pull Request resolved: https://github.com/pytorch/pytorch/pull/114646 Approved by: https://github.com/zou3519	2023-11-29 04:29:32 +00:00
Scott Wolchok	ce00c8fb45	[PyTorch] Remove hardcoded device=cuda in test_aot_inductor (#112797 ) All the other tests use self.device, so this seems like an oversight? Cost me a lot of time debugging the minimal arrayref interface, which is only intended for CPU. Differential Revision: [D50949928](https://our.internmc.facebook.com/intern/diff/D50949928/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112797 Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov ghstack dependencies: #113997	2023-11-29 03:12:33 +00:00
Scott Wolchok	5b9add666f	[PyTorch] AOTI: Emit CACHED_TORCH_TYPE only as needed (#113997 ) Avoids potential compatibility issues where a new dtype is supported by the DSO but not the binary loading it. Differential Revision: [D51434335](https://our.internmc.facebook.com/intern/diff/D51434335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113997 Approved by: https://github.com/int3	2023-11-29 03:12:32 +00:00
Richard Zou	73a661abf1	Stop using excess memory in generate_opcheck_tests, re-enable fbgemm TBE tests (#114641 ) Summary: 1. We stop using excess memory in generate_opcheck_tests. This is safe because all the individual test utils already ensure that they do not modify the inputs. 2. We re-enable the fbgemm TBE tests (see internal diff, but all of this is open source). They were previously removed because they OOM'ed when run serially; (1) and (3) cut down the memory usage to ~20gb peak. 3. I needed to skip some newly failing generated tests and also some that had an impact on the memory usage. Test Plan: - run tests Reviewed By: sryap Differential Revision: D51601964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114641 Approved by: https://github.com/williamwen42	2023-11-29 02:21:13 +00:00
Shiyan Deng	6256d3710e	[fx] log the node when it's get eliminated (#112684 ) Summary: ATT Test Plan: CI Reviewed By: strisunshinewentingwang Differential Revision: D50912413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112684 Approved by: https://github.com/zyan0	2023-11-29 01:43:04 +00:00
Nikita Shulga	24f06c7783	[no ci] Add `.watchman` to .gitignore (#114718 ) Followup after https://github.com/pytorch/pytorch/pull/114716 TODO: should the old filename be deleted, or it just depends on Atom/VSCode version Pull Request resolved: https://github.com/pytorch/pytorch/pull/114718 Approved by: https://github.com/kit1980	2023-11-29 01:37:40 +00:00
PyTorch MergeBot	48820c928c	Revert "[test] AOTAutograd: support mutations on buffers that happen during th bw (#112906 )" This reverts commit c8974d649d684a33a5c02a0b112a6e0743201d97. Reverted https://github.com/pytorch/pytorch/pull/112906 on behalf of https://github.com/huydhn due to There are lots of failure after this change `c8974d649d`, this is probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/112906#issuecomment-1831016362))	2023-11-29 00:49:57 +00:00
Eddie Yan	4bfb19827e	Cleanup `.watchman` file (#114716 ) This seems to be an artifact from an fb tool that snuck into a commit (#113117)? CC @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/114716 Approved by: https://github.com/mikaylagawarecki, https://github.com/yanboliang, https://github.com/malfet	2023-11-29 00:48:58 +00:00
Jesse Cai	ae593d0393	[sparse][semi-structured][inductor] meta registrations for _cslt_sparse_mm + additional stride checking in test. (#114685 ) _cslt_sparse_mm + additional stride checking in test. Summary: This PR adds in meta registrations for _cslt_sparse_mm. Based on the work @drisspg did in #114370. Additionally, it updates the tests by checking that the strides of the spare result and the result returned by sparse+compile are the same, to avoid errors like those found in https://github.com/pytorch/pytorch/pull/114477. Test Plan: ``` python test/test_sparse_semi_structred -k compile_cusparselt python test/test_sparse_semi_structred -k compile_cutlass ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114685 Approved by: https://github.com/alexsamardzic, https://github.com/drisspg	2023-11-29 00:31:52 +00:00
Will Constable	43d0659d74	[C10D] Fix DUMP_ON_TIMEOUT env (#114699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114699 Approved by: https://github.com/kwen2501, https://github.com/XilunWu, https://github.com/fduwjj	2023-11-29 00:15:45 +00:00
Aaron Gokaslan	bc34f02c38	[BE][Easy]: Apply RUF019: remove duplicate checks for dict access (#114478 ) Applies RUF019 nightly preview rule to the codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/114478 Approved by: https://github.com/mikaylagawarecki	2023-11-29 00:14:02 +00:00
Brian Hirsh	c8974d649d	[test] AOTAutograd: support mutations on buffers that happen during th bw (#112906 ) I can hold off on reviews / landing until I talk to Driss and we confirm that we need this for FP8. This PR also needs testing and probably shouldn't land until Tugsuu's input mutation handling [PR](https://github.com/pytorch/pytorch/pull/111046) goes through. What this PR tries to solve is when you have a model that tries to mutate some nn module state (a buffer), but during the backward. It appears that this might be necessary for FP8's delayed scaling. Today, AOTAutograd will just not realize if you happened to mutate any graph inputs when running the backward pass, and functionalize them away but not realize that they were input mutations. This PR tries to: (a) detect this situation (input mutations during the backward) (b) put `copy_()`'s in the graph to properly handle the input mutation when we can. In cases where we can't keep the copy_() in the graph, we just error loudly (I imagine that these cases will be extremely rare, but we can fix them if they ever come up). This is mostly a prototype for now, not ready for review. I made this example locally to test out: ``` import torch class MutatingAutogradFn(torch.autograd.Function): @staticmethod def forward(ctx, x, buf): ctx.save_for_backward(buf) return x @staticmethod def backward(ctx, x_grad): buf = ctx.saved_tensors[0] buf.add_(x_grad) return x_grad * 3, None class Mod(torch.nn.Module): def __init__(self): super().__init__() self.buf = torch.ones(2) @torch._dynamo.allow_in_graph def backward_mutating_fn(self, x, buf): return MutatingAutogradFn.apply(x, buf) def forward(self, x): tmp = self.backward_mutating_fn(x, self.buf) return tmp + self.buf m = Mod() x = torch.ones(2, requires_grad=True) out = m(x) # After the fw, buf should not have been mutated print(m.buf) out.sum().backward() # bw has run, so buf should now be mutated print(m.buf) print(x.grad) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112906 Approved by: https://github.com/ezyang	2023-11-28 23:59:21 +00:00
Bin Bao	11277cc510	[CI] Remove an exception catching for Triton compiler error (#113064 ) Summary: The workaround was there when Triton compiler was at its early stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113064 Approved by: https://github.com/eellison	2023-11-28 23:46:30 +00:00
Will Constable	3fccc0446c	Add dtensor and fsdp/2d tests to inductor_distributed CI (#114642 ) Smuggle important and not too slow tests to run on this trunk job, instead of just on the periodic job where they currently reside. - test_dtensor_compile took 70sec, test_fsdp_2d_parallel took 198sec locally As a follow up, organize the distributed-mgpu tests better and maybe rename this job to reflect its more 'general dist mgpu' Pull Request resolved: https://github.com/pytorch/pytorch/pull/114642 Approved by: https://github.com/wanchaol, https://github.com/malfet	2023-11-28 23:06:18 +00:00
joncrall	765d4599ee	Give users control over packages in torch.utils.collect_env (#112993 ) I'm looking to repurpose some logic in `torch.utils.collect_env` for the `geowatch` package. I'm mostly able to just use this script as a library, which is great because it reduces code in my package. However, the issue is that the package patterns that are relevant to torch are hard-coded inside of `get_conda_packages` and `get_pip_packages`. The changes I made are simple. I defined the default package patterns as two global sets, and I added an argument to each function that lets the user customize exactly what package patterns are relevant. If they are not specified the defaults are used. I was considering extending the power of the patterns by utilizing `fnmatch`, `re` (or [xdev.pattern](https://github.com/Erotemic/xdev/blob/main/xdev/patterns.py) which abstracts them both), but instead I opted to just use the existing `__contains__` test to keep things simple. From torch's perspective this should make maintaining this file slightly easier because to update relevant packages, the developer now updates two neighboring top-level globals instead of two separated local variables. However, it does add an argument to two functions, and that argument isn't used in torch itself, so there is an argument for removing that, and then users could still have some control by modifying globals, but I think the way I did it balances the tradeoffs well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112993 Approved by: https://github.com/zou3519	2023-11-28 22:35:25 +00:00
rzou	ce4bff4013	[dynamo] fix functools.wraps on nested functions (#114279 ) Updated version of #108885 addressing the review. In this PR: - We add a VT.can_reconstruct utility that checks if VT.reconstruct() does something. - If functools.wraps(fn) is passed a `fn` that either has a source or has .can_reconstruct() == True, then we stash the source (or the VT) - Later on, we use the source (or VT.reconstruct) to actually reconstruct the object in codegen. Test Plan: - New tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/114279 Approved by: https://github.com/voznesenskym	2023-11-28 22:34:59 +00:00
Wei Lu	a26d747615	[PyTorch][Vulkan] Fix matrix multiplication performance test binary (#114624 ) Summary: Due to recent changes in D51421256 and D51379737, - shaders of `mm`, `addmm`, `bmm`, `baddbmm` are reduced into just `mm`, - height and width packing logic is applied to linear operations so the current perf testings of `addmm` and `create_linear_context` and `run_linear_context` are no longer valid (0 latency will be printed, see test plan). Specifically, the original test extracts latency of `vulkan.addmm` which doesn't exist any more. Instead the current implementation of `addmm` invokes ``` vulkan.convert_channels_to_height_packed vulkan.convert_channels_to_width_packed vulkan.mm vulkan.mul_scalar vulkan.add ``` To deal with this - for `addmm` and `run_linear_context`, we apply a new function `extractTotalShaderResultsAndSetState` which aggregates latency of all invoded shaders except `nchw_to_image` and `image_to_nchw`; - for `create_linear_context`, besides `nchw_to_image` and `image_to_nchw`, we also aggregate `vulkan.convert_channels_to_height_packed` Test Plan: - build binary, at `fbsource` ``` buck2 build -c ndk.debug_info_level=0 -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid --show-output -c pt.vulkan_full_precision=1 ``` - test on android device ``` adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid ``` ## Before addmm_benchmark ``` (base) luwei@luwei-mbp ~ % adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid 2023-11-16T06:48:18+00:00 Running /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid Run on (4 X 1708.8 MHz CPU s) *WARNING* CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ... Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4334408 vulkan.nchw_to_image {500, 500, 1} 4327648 vulkan.nchw_to_image {500, 500, 1} 4322760 vulkan.convert_channels_to_height_packed{500, 125, 1} 1233960 vulkan.convert_channels_to_width_packed{125, 500, 1} 1286896 vulkan.mm {125, 125, 1} 76186084 vulkan.mul_scalar {500, 500, 1} 1132924 vulkan.mul_scalar {500, 500, 1} 1128556 vulkan.add {500, 500, 1} 4285788 vulkan.image_to_nchw {500, 500, 1} 1421576 ... addmm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 0.000 ms 77.2 ms 5 ``` create_linear_context_benchmark ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4336696 vulkan.convert_channels_to_height_packed{500, 125, 1} 1229384 ... create_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 8.57 ms 32.9 ms 5 ``` run_linear_context_benchmark ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4305548 vulkan.convert_channels_to_height_packed{500, 125, 1} 1196104 ... run_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 0.000 ms 86.2 ms 5 ``` ## After addmm_benchmark ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4332016 vulkan.nchw_to_image {500, 500, 1} 4321356 vulkan.nchw_to_image {500, 500, 1} 4314908 vulkan.convert_channels_to_height_packed{500, 125, 1} 1195896 vulkan.convert_channels_to_width_packed{125, 500, 1} 1273428 vulkan.mm {125, 125, 1} 77055680 vulkan.mul_scalar {500, 500, 1} 1111708 vulkan.mul_scalar {500, 500, 1} 1111032 vulkan.add {500, 500, 1} 4236024 vulkan.image_to_nchw {500, 500, 1} 1429480 ... addmm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 51.1 ms 76.0 ms 5 ``` create_linear_context_benchmark ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4332432 vulkan.convert_channels_to_height_packed{500, 125, 1} 1235884 ... create_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 9.74 ms 30.6 ms 5 ``` run_linear_context_benchmark ``` Kernel Name Workgroup Size Duration (ns) =========== ============== =========== vulkan.nchw_to_image {500, 500, 1} 4289740 vulkan.convert_channels_to_height_packed{500, 125, 1} 1227928 ... run_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1 50.4 ms 86.0 ms 5 ``` full result in P887658084 Reviewed By: liuk22 Differential Revision: D51506293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114624 Approved by: https://github.com/yipjustin	2023-11-28 22:27:26 +00:00
Kaichao You	d114f31b30	add testcase when bytecode hook changes the bytecode; fix code map (#114487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114487 Approved by: https://github.com/jansel	2023-11-28 22:14:57 +00:00
Jez Ng	47e6cc4d22	Remove yet more type-ignores in dynamo/inductor (#114684 ) Probably the last big batch for a while Pull Request resolved: https://github.com/pytorch/pytorch/pull/114684 Approved by: https://github.com/Skylion007	2023-11-28 22:09:38 +00:00
Aaron Gokaslan	9f073ae304	[BE][Easy]: add some PLR pylint checks and exclusions to ruff (#114519 ) Add a couple of additional checks and exclusions Pull Request resolved: https://github.com/pytorch/pytorch/pull/114519 Approved by: https://github.com/jansel	2023-11-28 20:49:03 +00:00
chundian	74e10f0f60	[inductor] Fix torch.split bug on unbacked symint (#113406 ) torch.split(x, l) fails when l's shape is the unbacked symint. E.g. l = y.tolist() makes l the unbacked shape, because l depends on the data access of y. The downdtream call `SliceView.create()` evaluates the shape even if the input shape is unbacked symint, which brings up the bug. Test Plan: python test/inductor/test_unbacked_symints.py -k test_split_with_sizes Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2023-11-28 20:45:13 +00:00
Guo Yejun	4aa2c51a09	[doc] fix typo on graph 3 that is recorded (#114666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114666 Approved by: https://github.com/eellison	2023-11-28 20:40:13 +00:00
Guo Yejun	4a35ec3c0e	[docs] correct the code for cudagraph trees integration (#114583 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114583 Approved by: https://github.com/eellison	2023-11-28 20:28:52 +00:00
Will Constable	44c9e4cbf0	[C10D] Decouple PGNCCL desync from dbg dump (#114614 ) Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently of desync debug feature. Currently default to disabled (so no behavior change by default), but plan to default this to true after validation. Moves 'sleep for 30 sec' that used to be after desync debug to before it. In my view sleeping before desync is equivalent since we always sleep the same duration, and keeps the code simpler this way. Fixes #114433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114614 Approved by: https://github.com/zdevito ghstack dependencies: #114651	2023-11-28 19:46:10 +00:00
Jon Chuang	cef79c0df4	[inductor] `_sparse_semi_structured_linear` fallback - no meta registration; not on testing path (#114477 ) Test was wrong in original PR and merged changes were never tested. Further, the sparse op was never actually compiled due to missing `fullgraph=True` and missing meta registration. When meta is added as per this PR, it gives wrong answers when input needs to be padded and when input needs to be reshaped. Is this something to do with the generated inductor code for: ``` constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0) ... slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1); _sparse_semi_structured_linear = None ``` and ``` [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] mul: "Sym(s0s1)" = primals_4 primals_5 [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view: "f16[s0s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]); primals_6 = mul = None ... [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]); slice_1 = None ``` Failing graphs: Padded: ``` [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] ===== Forward graph 5 ===== [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] <eval_with_key>.66 class GraphModule(torch.nn.Module): [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[1, 128]"): [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] _sparse_semi_structured_linear: "f16[32, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(constant_pad_nd, primals_1, primals_2); constant_pad_nd = primals_1 = primals_2 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1); _sparse_semi_structured_linear = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] slice_2: "f16[1, 128]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807); slice_1 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] relu: "f16[1, 128]" = torch.ops.aten.relu.default(slice_2); slice_2 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias: "f16[1, 128]" = torch.ops.aten.alias.default(relu) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias_1: "f16[1, 128]" = torch.ops.aten.alias.default(alias); alias = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] le: "b8[1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0); alias_1 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x) [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] permute: "f16[128, 1]" = torch.ops.aten.permute.default(primals_3, [1, 0]); primals_3 = None [2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] return [relu, le, permute] ``` Reshape: ``` [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] <eval_with_key>.69 class GraphModule(torch.nn.Module): [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[128]", primals_4: "Sym(s0)", primals_5: "Sym(s1)", primals_6: "f16[s0, s1, 128]"): [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x) [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] mul: "Sym(s0s1)" = primals_4 * primals_5 [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view: "f16[s0s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]); primals_6 = mul = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] _sparse_semi_structured_linear: "f16[s0s1, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(view, primals_1, primals_2, bias = primals_3); primals_1 = primals_2 = primals_3 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] slice_1: "f16[s0*s1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 1, 0, 9223372036854775807); _sparse_semi_structured_linear = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]); slice_1 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x) [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] relu: "f16[s0, s1, 128]" = torch.ops.aten.relu.default(view_1); view_1 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(relu) [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] alias_1: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(alias); alias = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] le: "b8[s0, s1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0); alias_1 = None [2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] return [relu, view, le, primals_4, primals_5] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114477 Approved by: https://github.com/jcaip	2023-11-28 19:35:05 +00:00
voznesenskym	ddf1cb7870	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-28 19:33:35 +00:00
titaiwangms	e83c05c833	[ONNX] Add ONNX ExportedProgram tests (#114633 ) Fix #114166 Fix #113705 This PR references tests from `test_export.py` to make sure the exported program from PyTorch can all be successfully exported into ONNX model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114633 Approved by: https://github.com/thiagocrepaldi	2023-11-28 19:03:13 +00:00
Tarun Karuturi	39f16c221e	Adding event_tracer evalue logging calls in codegen (#114584 ) Summary: This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API. When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead. Test Plan: CI Reviewed By: larryliu0820 Differential Revision: D51534590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584 Approved by: https://github.com/larryliu0820	2023-11-28 18:32:05 +00:00
Will Constable	e6a8052051	[C10D] Flight recorder - disable c++ stacktrace by default (#114651 ) CPP Stacktrace processing (symbolizer) takes a long time on some systems using a particular version of addr2line. In slow systems, this makes flight-recorder dumping slow enough to time out on even toy programs. TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection as part of the flight recorder. CPP stacktrace is fast enough for use on certain combinations of OS. We can investigate moving to llvm's symbolizer as a replacement. On devserver with C++ stacktraces disabled/enabled: ``` python test/distributed/test_c10d_nccl.py -k test_short Ran 1 test in 12.175s TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short Ran 1 test in 53.338s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114651 Approved by: https://github.com/zdevito	2023-11-28 16:49:20 +00:00
Nikita Shulga	b060694088	Add `bits` dtypes to `torch._C` stubs (#114661 ) As defined `6ae0554d11/c10/core/ScalarType.h (L54-L58)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114661 Approved by: https://github.com/ngimel	2023-11-28 15:21:58 +00:00
Bin Bao	0bef97fac3	[dynamo] Support itertools.groupby (#114192 ) Summary: for https://github.com/pytorch/pytorch/issues/108698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114192 Approved by: https://github.com/jansel	2023-11-28 14:58:59 +00:00
Andrew Gu	cc7a969bb3	[FSDP] Added test for `ignored_states` + auto wrap (#114612 ) This adds some unit testing for the `ignored_states` argument and auto wrapping. There is some ongoing discussion with @erhoo82 about his particular use case, but it should not block this PR. (We can land a separate PR if needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114612 Approved by: https://github.com/wanchaol ghstack dependencies: #114611	2023-11-28 14:36:34 +00:00
lezcano	79ee99e6d2	[easy] Dispatch torch.from_numpy to torch.as_tensor (#114609 ) ...rather than detaching the tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/114609 Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym ghstack dependencies: #114608	2023-11-28 12:04:37 +00:00
lezcano	0bb2600c28	Allow to differentiate through NumPy code (#114608 ) With this PR it is possible to differentiate through NumPy code modulo the usual caveats that apply to differentiation: - That there are no graphbreaks - That the decomposition in `torch._numpy` is differentiable @ev-br and I were somewhat careful to achieve the second point, but it is not tested though and through, so YMMV Pull Request resolved: https://github.com/pytorch/pytorch/pull/114608 Approved by: https://github.com/voznesenskym	2023-11-28 12:04:37 +00:00
Xuehai Pan	89a1fe6966	[pytree] register pytree node type in both C++ pytree and Python pytree (#112111 ) Changes: 1. Add `_private_register_pytree_node` API in both C++ and Python pytree. In C++ pytree, the API will only register pytree node for C++ pytree. In Python pytree, the API will only register pytree node for Python pytree. 2. Do not allow registering a type as pytree node twice in the Python pytree. 3. Add thread lock to the Python pytree node register API. 4. The old `_register_pytree_node` API will call the `_private_register_pytree_node` API and raise a deprecation warning. 5. Add a new `register_pytree_node` API to register node type in both C++ and Python implementations. 6. Add tests to ensure a warning will be raised when the old private function is called. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111 Approved by: https://github.com/zou3519	2023-11-28 11:41:38 +00:00
Pearu Peterson	088fc7779e	Eliminate unnecessary copy in CUDA addmm with sparse compressed block operand (#114484 ) As in the title. As a result, `nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>)` performance increases as follows (`float16`, `NVIDIA A100-SXM4-80GB`): - 256x256 weights, speed up is 14..27 % - 512x512 weights, speed up is 9..25 % - 1024x1024 weights, speed up is 5..20 % - 2048x2048 weights, speed up is 3..16 % - 4092x4092 weights, speed up is 2..9 % Pull Request resolved: https://github.com/pytorch/pytorch/pull/114484 Approved by: https://github.com/cpuhrsch	2023-11-28 11:35:55 +00:00
angelayi	00412e6dfa	[export] Add meta to params (#114622 ) The graph from `capture_pre_autograd_graph` doesn't have `meta["val"]` on the param nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114622 Approved by: https://github.com/frank-wei, https://github.com/zhxchen17, https://github.com/khabinov	2023-11-28 07:40:15 +00:00

1 2 3 4 5 ...

66843 Commits