Commit Graph

66843 Commits

Author SHA1 Message Date
e891a3bba9 [releng] Add release 2.2 to Release Compatibility Matrix for PyTorch releases (#114758)
Update RELEASE.md for release 2.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114758
Approved by: https://github.com/DanilBaibak
2023-11-29 16:27:59 +00:00
4a4c9fb0b8 [ROCm] Add ROCm AMDGPU support for inductor cpp codegen (#105141)
Follows from previous enablement attempt: https://github.com/pytorch/pytorch/pull/101797

Adds support for hsaco binaries in inductor's cpp_wrapper codegen and enables the CUDA tests in test_cpp_wrapper.

This PR also brings in additional required hipify mappings for the wrapper codegen file.

NOTE: we can unskip some of these tests when we enabled MI210 runners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105141
Approved by: https://github.com/jansel, https://github.com/malfet
2023-11-29 15:11:24 +00:00
a3bbf9ce3e [BE][RelEng] Remove dynamo extra (#114720)
As all dynamo dependencies are part of the default requirements, see
```
% curl -s https://pypi.org/pypi/torch/2.1.1/json | jq '.info.requires_dist'
[
  "filelock",
  "typing-extensions",
  "sympy",
  "networkx",
  "jinja2",
  "fsspec",
  "nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-curand-cu12 (==10.3.2.106) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-nccl-cu12 (==2.18.1) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "nvidia-nvtx-cu12 (==12.1.105) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "triton (==2.1.0) ; platform_system == \"Linux\" and platform_machine == \"x86_64\"",
  "jinja2 ; extra == 'dynamo'",
  "opt-einsum (>=3.3) ; extra == 'opt-einsum'"
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114720
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-11-29 15:08:27 +00:00
b6a30bbfb6 [Dynamo] Forward fix dynamo trace rule test failure due to landing race (#114739)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114739
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2023-11-29 09:31:12 +00:00
d2f4215dbb [quant][pt2e] Fix the order for implicit sharing code (#114704)
Summary:
Current order of implicit sharing breaks common annotation patterns of SharedQuantizationSpec, so we changed the order here.
But it's not going to work in all possible annotation cases, so quantizer implementors still need to be careful.
In general if people only refer to node/edges that comes before the current node/edge in SharedQuantizationSpec, it should work I think

Test Plan: CI, make sure this Fixed some internal tests

Differential Revision: D51605918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114704
Approved by: https://github.com/andrewor14
2023-11-29 08:58:28 +00:00
7692595834 Use different conv layout optimization heuristics for inference (#114600)
While many models regress in training when converted to channels last, in inference the results are quite different. Almost all of the models experienced a speedup when converted to channels last. There were a few big regressions in torchbench - `timm_regnet` from `1.4343 → 1.0573` and `timm_resnet` from `1.7484 → 1.2868`.

 I used a modified script of the operator benchmarks [here](https://gist.github.com/eellison/e11dc645412f52e8b45fb26ba6f9f6a1) to measure the average speedup of convolutions across all of the input shapes found in torchbench according to the existing classifications that @shunting314 used - grouped convs, small channel convs, convolution with larger in-channel than out-channel. Only grouped convolutions benchmarked as a slowdown in inference.

I updated the inference heuristic to multiply the flops of each conv with its predicted speedup/slowdown in channels last. With this heuristic the two previously regressing models no longer regress.

Speeds up inference for torchbench ~8% and timm ~6%. The motivating model here was SDXL which now hits channels last and improves 10%.

There were some models that were sped up in training when forcing channels last (along with a number of regressions). It's possible there is some speedup in training to be had with additional heuristics. We could also have more granular classification/predictions which might benefit both training and inference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114600
Approved by: https://github.com/jansel, https://github.com/shunting314
2023-11-29 07:53:59 +00:00
cyy
4e38178bb8 [Reland] [1/N] Fixes clang-tidy warnings in header files (#114668)
Reland of #113608 after fixing the problematic parts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114668
Approved by: https://github.com/huydhn
2023-11-29 07:11:51 +00:00
c10893654e [export] Fix run_decomps to work with fake mode (#114714)
Fixes https://github.com/pytorch/pytorch/issues/114711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114714
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2023-11-29 06:52:13 +00:00
a076a74f11 [Nested Tensor] Add xpu device in assertion for nested tensor creation (#114664)
Add xpu device checking in nested tensor creation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114664
Approved by: https://github.com/jgong5, https://github.com/xunnanxu
2023-11-29 05:59:35 +00:00
69c4819f53 Add bsr_dense_addmm triton kernel (#114595)
As in the title.

The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters):
- it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below).
- it supports rectangular blocks in sparse BSR tensor weights

The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is  55/94 %
- with 32x32 blocks, the average/maximal speed up is  33/63 %
- with 64x64 blocks, the average/maximal speed up is  23/42 %
- with 128x128 blocks, the average/maximal speed up is  15/39 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595
Approved by: https://github.com/cpuhrsch
2023-11-29 05:29:25 +00:00
57a5a687b0 [Dynamo][6.2/N] Dump the in graph function list(~2600 ops) and add unit tests. (#114196)
This is the second PR according https://github.com/pytorch/pytorch/pull/113009#issuecomment-1804417925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114196
Approved by: https://github.com/jansel
2023-11-29 05:09:48 +00:00
05f071d922 [export] Fix state dict device serialization (#114695)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/114000
Will check with SherlockNoMad on why we need to convert to cpu after his PTO

Test Plan: CI

Differential Revision: D51629068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114695
Approved by: https://github.com/ydwu4
2023-11-29 05:05:22 +00:00
7c8d3639cf Revert "[fx] log the node when it's get eliminated (#112684)"
This reverts commit 6256d3710e18f08af8588d1aae88c758bd9c6b30.

Reverted https://github.com/pytorch/pytorch/pull/112684 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112684#issuecomment-1831198778))
2023-11-29 04:31:15 +00:00
64ccdd4afb AOTAutograd: keep input mutations in the graph if they are under no_grad, even if they require_grad (#114646)
Quick recap of events:

(1) https://github.com/pytorch/pytorch/pull/111347, which fixed a perf regression in 2.1 compared to 2.0, introduced a correctness problem around input mutations on inputs that require grad that show up in an inference-only graph (the specific case where this can happen is rare and nobody reported the issue, but it was fixed a few weeks later)

(2) That fix happened here: https://github.com/pytorch/pytorch/pull/113584, which makes sure to keep input mutations outside of the graph, so the autograd engine can set metadata properly on them

(3) That in turn caused a slight regression compared to (1), which is what this PR attempts to fix. In particular, code like the below is safe to keep the mutations in the graph for:

```
@torch.compile
def f(x):
    x.mul_(2)

x = torch.ones(2, requires_grad=True).clone()
# x requires_grad, so the input mutation will change some autograd metadata, like the version counter
# However, the mutation is under no_grad, so we don't have to worry about e.g. aliases of x having their .grad_fn fields changed
with torch.no_grad():
    f(x)
```

This particular case is pretty important to the shampoo optimizer code, which is run under `torch.compile`, and mutates parameters (which require grad).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114646
Approved by: https://github.com/zou3519
2023-11-29 04:29:32 +00:00
ce00c8fb45 [PyTorch] Remove hardcoded device=cuda in test_aot_inductor (#112797)
All the other tests use self.device, so this seems like an oversight? Cost me a lot of time debugging the minimal arrayref interface, which is only intended for CPU.

Differential Revision: [D50949928](https://our.internmc.facebook.com/intern/diff/D50949928/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112797
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
ghstack dependencies: #113997
2023-11-29 03:12:33 +00:00
5b9add666f [PyTorch] AOTI: Emit CACHED_TORCH_TYPE only as needed (#113997)
Avoids potential compatibility issues where a new dtype is supported by the DSO but not the binary loading it.

Differential Revision: [D51434335](https://our.internmc.facebook.com/intern/diff/D51434335/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113997
Approved by: https://github.com/int3
2023-11-29 03:12:32 +00:00
73a661abf1 Stop using excess memory in generate_opcheck_tests, re-enable fbgemm TBE tests (#114641)
Summary:
1. We stop using excess memory in generate_opcheck_tests. This is safe because
   all the individual test utils already ensure that they do not modify the
   inputs.
2. We re-enable the fbgemm TBE tests (see internal diff, but all of this is open
   source). They were previously removed because they OOM'ed when run serially;
   (1) and (3) cut down the memory usage to ~20gb peak.
3. I needed to skip some newly failing generated tests and also some that had an
   impact on the memory usage.

Test Plan: - run tests

Reviewed By: sryap

Differential Revision: D51601964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114641
Approved by: https://github.com/williamwen42
2023-11-29 02:21:13 +00:00
6256d3710e [fx] log the node when it's get eliminated (#112684)
Summary: ATT

Test Plan: CI

Reviewed By: strisunshinewentingwang

Differential Revision: D50912413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112684
Approved by: https://github.com/zyan0
2023-11-29 01:43:04 +00:00
24f06c7783 [no ci] Add .watchman to .gitignore (#114718)
Followup after https://github.com/pytorch/pytorch/pull/114716

TODO: should the old filename be deleted, or it just depends on Atom/VSCode version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114718
Approved by: https://github.com/kit1980
2023-11-29 01:37:40 +00:00
48820c928c Revert "[test] AOTAutograd: support mutations on buffers that happen during th bw (#112906)"
This reverts commit c8974d649d684a33a5c02a0b112a6e0743201d97.

Reverted https://github.com/pytorch/pytorch/pull/112906 on behalf of https://github.com/huydhn due to There are lots of failure after this change c8974d649d, this is probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/112906#issuecomment-1831016362))
2023-11-29 00:49:57 +00:00
4bfb19827e Cleanup .watchman file (#114716)
This seems to be an artifact from an fb tool that snuck into a commit (#113117)? CC @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114716
Approved by: https://github.com/mikaylagawarecki, https://github.com/yanboliang, https://github.com/malfet
2023-11-29 00:48:58 +00:00
ae593d0393 [sparse][semi-structured][inductor] meta registrations for _cslt_sparse_mm + additional stride checking in test. (#114685)
_cslt_sparse_mm + additional stride checking in test.

Summary:

This PR adds in meta registrations for _cslt_sparse_mm.

Based on the work @drisspg did
in #114370.

Additionally, it updates the tests by checking that the strides of the
spare result and the result returned by sparse+compile are the same, to
avoid errors like those found in

https://github.com/pytorch/pytorch/pull/114477.

Test Plan:
```
python test/test_sparse_semi_structred -k compile_cusparselt
python test/test_sparse_semi_structred -k compile_cutlass
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114685
Approved by: https://github.com/alexsamardzic, https://github.com/drisspg
2023-11-29 00:31:52 +00:00
43d0659d74 [C10D] Fix DUMP_ON_TIMEOUT env (#114699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114699
Approved by: https://github.com/kwen2501, https://github.com/XilunWu, https://github.com/fduwjj
2023-11-29 00:15:45 +00:00
bc34f02c38 [BE][Easy]: Apply RUF019: remove duplicate checks for dict access (#114478)
Applies RUF019 nightly preview rule to the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114478
Approved by: https://github.com/mikaylagawarecki
2023-11-29 00:14:02 +00:00
c8974d649d [test] AOTAutograd: support mutations on buffers that happen during th bw (#112906)
I can hold off on reviews / landing until I talk to Driss and we confirm that we need this for FP8. This PR also needs testing and probably shouldn't land until Tugsuu's input mutation handling [PR](https://github.com/pytorch/pytorch/pull/111046) goes through.

What this PR tries to solve is when you have a model that tries to mutate some nn module state (a buffer), but during the **backward**. It appears that this might be necessary for FP8's delayed scaling.

Today, AOTAutograd will just not realize if you happened to mutate any graph inputs when running the backward pass, and functionalize them away but not realize that they were input mutations. This PR tries to:

(a) detect this situation (input mutations during the backward)

(b) put `copy_()`'s in the graph to properly handle the input mutation when we can. In cases where we can't keep the copy_() in the graph, we just error loudly (I imagine that these cases will be extremely rare, but we can fix them if they ever come up).

This is mostly a prototype for now, not ready for review.

I made this example locally to test out:
```
import torch

class MutatingAutogradFn(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x, buf):
        ctx.save_for_backward(buf)
        return x

    @staticmethod
    def backward(ctx, x_grad):
        buf = ctx.saved_tensors[0]
        buf.add_(x_grad)
        return x_grad * 3, None

class Mod(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.buf = torch.ones(2)

    @torch._dynamo.allow_in_graph
    def backward_mutating_fn(self, x, buf):
        return MutatingAutogradFn.apply(x, buf)

    def forward(self, x):
        tmp = self.backward_mutating_fn(x, self.buf)
        return tmp + self.buf

m = Mod()

x = torch.ones(2, requires_grad=True)
out = m(x)
# After the fw, buf should not have been mutated
print(m.buf)
out.sum().backward()
# bw has run, so buf should now be mutated
print(m.buf)
print(x.grad)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112906
Approved by: https://github.com/ezyang
2023-11-28 23:59:21 +00:00
11277cc510 [CI] Remove an exception catching for Triton compiler error (#113064)
Summary: The workaround was there when Triton compiler was at its early stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113064
Approved by: https://github.com/eellison
2023-11-28 23:46:30 +00:00
3fccc0446c Add dtensor and fsdp/2d tests to inductor_distributed CI (#114642)
Smuggle important and not too slow tests to run on this trunk job,
instead of just on the periodic job where they currently reside.
 - test_dtensor_compile took 70sec, test_fsdp_2d_parallel took 198sec
   locally

As a follow up, organize the distributed-mgpu tests better and maybe
rename this job to reflect its more 'general dist mgpu'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114642
Approved by: https://github.com/wanchaol, https://github.com/malfet
2023-11-28 23:06:18 +00:00
765d4599ee Give users control over packages in torch.utils.collect_env (#112993)
I'm looking to repurpose some logic in `torch.utils.collect_env` for the `geowatch` package. I'm mostly able to just use this script as a library, which is great because it reduces code in my package. However, the issue is that the package patterns that are relevant to torch are hard-coded inside of `get_conda_packages` and `get_pip_packages`.

The changes I made are simple. I defined the default package patterns as two global sets, and I added an argument to each function that lets the user customize exactly what package patterns are relevant. If they are not specified the defaults are used.

I was considering extending the power of the patterns by utilizing `fnmatch`, `re` (or [xdev.pattern](https://github.com/Erotemic/xdev/blob/main/xdev/patterns.py) which abstracts them both), but instead I opted to just use the existing `__contains__` test to keep things simple.

From torch's perspective this should make maintaining this file slightly easier because to update relevant packages, the developer now updates two neighboring top-level globals instead of two separated local variables. However, it does add an argument to two functions, and that argument isn't used in torch itself, so there is an argument for removing that, and then users *could* still have some control by modifying globals, but I think the way I did it balances the tradeoffs well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112993
Approved by: https://github.com/zou3519
2023-11-28 22:35:25 +00:00
ce4bff4013 [dynamo] fix functools.wraps on nested functions (#114279)
Updated version of #108885 addressing the review. In this PR:
- We add a VT.can_reconstruct utility that checks if VT.reconstruct()
  does something.
- If functools.wraps(fn) is passed a `fn` that either has a source or
  has .can_reconstruct() == True, then we stash the source (or the VT)
- Later on, we use the source (or VT.reconstruct) to actually
  reconstruct the object in codegen.

Test Plan:
- New tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114279
Approved by: https://github.com/voznesenskym
2023-11-28 22:34:59 +00:00
a26d747615 [PyTorch][Vulkan] Fix matrix multiplication performance test binary (#114624)
Summary:
Due to recent changes in D51421256 and D51379737,
- shaders of `mm`, `addmm`, `bmm`, `baddbmm` are reduced into just `mm`,
- height and width packing logic is applied to linear operations

so the current perf testings of `addmm` and `create_linear_context` and `run_linear_context` are no longer valid (0 latency will be printed, see test plan). Specifically, the original test extracts latency of `vulkan.addmm` which doesn't exist any more. Instead the current implementation of `addmm` invokes
```
vulkan.convert_channels_to_height_packed
vulkan.convert_channels_to_width_packed
vulkan.mm
vulkan.mul_scalar
vulkan.add
```
To deal with this
- for `addmm` and `run_linear_context`, we apply a new function `extractTotalShaderResultsAndSetState` which aggregates latency of all invoded shaders except `nchw_to_image` and `image_to_nchw`;
- for `create_linear_context`, besides `nchw_to_image` and `image_to_nchw`, we also aggregate `vulkan.convert_channels_to_height_packed`

Test Plan:
- build binary, at `fbsource`
```
buck2 build  -c ndk.debug_info_level=0  -c ndk.static_linking=true -c pt.enable_qpl=0 -c pt.vulkan_use_gpu_diagnostics=1 --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_mm_perf_test_binAndroid  --show-output  -c pt.vulkan_full_precision=1
```
- test on android device
```
adb push buck-out/v2/gen/fbsource/f1f3f9bed27e143c/xplat/caffe2/__pt_vulkan_mm_perf_test_binAndroid__/pt_vulkan_mm_perf_test_binAndroid /data/local/tmp
adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
```
## Before
addmm_benchmark
```
(base) luwei@luwei-mbp ~ % adb shell /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
2023-11-16T06:48:18+00:00
Running /data/local/tmp/pt_vulkan_mm_perf_test_binAndroid
Run on (4 X 1708.8 MHz CPU s)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
...
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4334408
vulkan.nchw_to_image     {500, 500, 1}                    4327648
vulkan.nchw_to_image     {500, 500, 1}                    4322760
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1233960
vulkan.convert_channels_to_width_packed{125, 500, 1}                    1286896
vulkan.mm                {125, 125, 1}                   76186084
vulkan.mul_scalar        {500, 500, 1}                    1132924
vulkan.mul_scalar        {500, 500, 1}                    1128556
vulkan.add               {500, 500, 1}                    4285788
vulkan.image_to_nchw     {500, 500, 1}                    1421576
...
addmm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1                      0.000 ms         77.2 ms            5
```
create_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4336696
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1229384
...
create_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1       8.57 ms         32.9 ms            5
```
run_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4305548
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1196104
...
run_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1         0.000 ms         86.2 ms            5
```

## After
addmm_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4332016
vulkan.nchw_to_image     {500, 500, 1}                    4321356
vulkan.nchw_to_image     {500, 500, 1}                    4314908
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1195896
vulkan.convert_channels_to_width_packed{125, 500, 1}                    1273428
vulkan.mm                {125, 125, 1}                   77055680
vulkan.mul_scalar        {500, 500, 1}                    1111708
vulkan.mul_scalar        {500, 500, 1}                    1111032
vulkan.add               {500, 500, 1}                    4236024
vulkan.image_to_nchw     {500, 500, 1}                    1429480
...
addmm_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1                       51.1 ms         76.0 ms            5
```
create_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4332432
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1235884
...
create_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1       9.74 ms         30.6 ms            5
```
run_linear_context_benchmark
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4289740
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1227928
...
run_linear_context_benchmark/N:500/M:500/P:500/iterations:5/manual_time/threads:1          50.4 ms         86.0 ms            5
```
full result in P887658084

Reviewed By: liuk22

Differential Revision: D51506293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114624
Approved by: https://github.com/yipjustin
2023-11-28 22:27:26 +00:00
d114f31b30 add testcase when bytecode hook changes the bytecode; fix code map (#114487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114487
Approved by: https://github.com/jansel
2023-11-28 22:14:57 +00:00
47e6cc4d22 Remove yet more type-ignores in dynamo/inductor (#114684)
Probably the last big batch for a while

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114684
Approved by: https://github.com/Skylion007
2023-11-28 22:09:38 +00:00
9f073ae304 [BE][Easy]: add some PLR pylint checks and exclusions to ruff (#114519)
Add a couple of additional checks and exclusions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114519
Approved by: https://github.com/jansel
2023-11-28 20:49:03 +00:00
74e10f0f60 [inductor] Fix torch.split bug on unbacked symint (#113406)
torch.split(x, l) fails when l's shape is the unbacked symint.

E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.

Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2023-11-28 20:45:13 +00:00
4aa2c51a09 [doc] fix typo on graph 3 that is recorded (#114666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114666
Approved by: https://github.com/eellison
2023-11-28 20:40:13 +00:00
4a35ec3c0e [docs] correct the code for cudagraph trees integration (#114583)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114583
Approved by: https://github.com/eellison
2023-11-28 20:28:52 +00:00
44c9e4cbf0 [C10D] Decouple PGNCCL desync from dbg dump (#114614)
Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently
of desync debug feature.

Currently default to disabled (so no behavior change by default),
but plan to default this to true after validation.

Moves 'sleep for 30 sec' that used to be after desync debug to before
it. In my view sleeping before desync is equivalent since we always
sleep the same duration, and keeps the code simpler this way.

Fixes #114433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114614
Approved by: https://github.com/zdevito
ghstack dependencies: #114651
2023-11-28 19:46:10 +00:00
cef79c0df4 [inductor] _sparse_semi_structured_linear fallback - no meta registration; not on testing path (#114477)
Test was wrong in original PR and merged changes were never tested. Further, the sparse op was never actually compiled due to missing `fullgraph=True` and missing meta registration.

When meta is added as per this PR, it gives wrong answers when input needs to be padded and when input needs to be reshaped.

Is this something to do with the generated inductor code for:
```
 constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0)
...
slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1);  _sparse_semi_structured_linear = None
```
and

```
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         mul: "Sym(s0*s1)" = primals_4 * primals_5
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view: "f16[s0*s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]);  primals_6 = mul = None
...
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]);  slice_1 = None
```

Failing graphs:
Padded:
```
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  ===== Forward graph 5 =====
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.66 class GraphModule(torch.nn.Module):
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[1, 128]"):
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         _sparse_semi_structured_linear: "f16[32, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(constant_pad_nd, primals_1, primals_2);  constant_pad_nd = primals_1 = primals_2 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1);  _sparse_semi_structured_linear = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_2: "f16[1, 128]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807);  slice_1 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         relu: "f16[1, 128]" = torch.ops.aten.relu.default(slice_2);  slice_2 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias: "f16[1, 128]" = torch.ops.aten.alias.default(relu)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias_1: "f16[1, 128]" = torch.ops.aten.alias.default(alias);  alias = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         le: "b8[1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0);  alias_1 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         permute: "f16[128, 1]" = torch.ops.aten.permute.default(primals_3, [1, 0]);  primals_3 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return [relu, le, permute]

```

Reshape:

```
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.69 class GraphModule(torch.nn.Module):
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[128]", primals_4: "Sym(s0)", primals_5: "Sym(s1)", primals_6: "f16[s0, s1, 128]"):
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         mul: "Sym(s0*s1)" = primals_4 * primals_5
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view: "f16[s0*s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]);  primals_6 = mul = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         _sparse_semi_structured_linear: "f16[s0*s1, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(view, primals_1, primals_2, bias = primals_3);  primals_1 = primals_2 = primals_3 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_1: "f16[s0*s1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 1, 0, 9223372036854775807);  _sparse_semi_structured_linear = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]);  slice_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         relu: "f16[s0, s1, 128]" = torch.ops.aten.relu.default(view_1);  view_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(relu)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias_1: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(alias);  alias = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         le: "b8[s0, s1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0);  alias_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return [relu, view, le, primals_4, primals_5]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114477
Approved by: https://github.com/jcaip
2023-11-28 19:35:05 +00:00
ddf1cb7870 AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-28 19:33:35 +00:00
e83c05c833 [ONNX] Add ONNX ExportedProgram tests (#114633)
Fix #114166
Fix #113705

This PR references tests from `test_export.py` to make sure the exported program from PyTorch can all be successfully exported into ONNX model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114633
Approved by: https://github.com/thiagocrepaldi
2023-11-28 19:03:13 +00:00
39f16c221e Adding event_tracer evalue logging calls in codegen (#114584)
Summary:
This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API.

When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead.

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D51534590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584
Approved by: https://github.com/larryliu0820
2023-11-28 18:32:05 +00:00
e6a8052051 [C10D] Flight recorder - disable c++ stacktrace by default (#114651)
CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line.  In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.

TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.

CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.

On devserver with C++ stacktraces disabled/enabled:
```
python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 12.175s

TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 53.338s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114651
Approved by: https://github.com/zdevito
2023-11-28 16:49:20 +00:00
b060694088 Add bits dtypes to torch._C stubs (#114661)
As defined 6ae0554d11/c10/core/ScalarType.h (L54-L58)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114661
Approved by: https://github.com/ngimel
2023-11-28 15:21:58 +00:00
0bef97fac3 [dynamo] Support itertools.groupby (#114192)
Summary: for https://github.com/pytorch/pytorch/issues/108698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114192
Approved by: https://github.com/jansel
2023-11-28 14:58:59 +00:00
cc7a969bb3 [FSDP] Added test for ignored_states + auto wrap (#114612)
This adds some unit testing for the `ignored_states` argument and auto wrapping. There is some ongoing discussion with @erhoo82 about his particular use case, but it should not block this PR. (We can land a separate PR if needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114612
Approved by: https://github.com/wanchaol
ghstack dependencies: #114611
2023-11-28 14:36:34 +00:00
79ee99e6d2 [easy] Dispatch torch.from_numpy to torch.as_tensor (#114609)
...rather than detaching the tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114609
Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym
ghstack dependencies: #114608
2023-11-28 12:04:37 +00:00
0bb2600c28 Allow to differentiate through NumPy code (#114608)
With this PR it is possible to differentiate through NumPy code modulo
the usual caveats that apply to differentiation:
- That there are no graphbreaks
- That the decomposition in `torch._numpy` is differentiable

@ev-br and I were somewhat careful to achieve the second point, but
it is not tested though and through, so YMMV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114608
Approved by: https://github.com/voznesenskym
2023-11-28 12:04:37 +00:00
89a1fe6966 [pytree] register pytree node type in both C++ pytree and Python pytree (#112111)
Changes:

1. Add `_private_register_pytree_node` API in both C++ and Python pytree. In C++ pytree, the API will only register pytree node for C++ pytree. In Python pytree, the API will only register pytree node for Python pytree.
2. Do not allow registering a type as pytree node twice in the Python pytree.
3. Add thread lock to the Python pytree node register API.
4. The old `_register_pytree_node` API will call the `_private_register_pytree_node` API and raise a deprecation warning.
5. Add a new `register_pytree_node` API to register node type in both C++ and Python implementations.
6. Add tests to ensure a warning will be raised when the old private function is called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112111
Approved by: https://github.com/zou3519
2023-11-28 11:41:38 +00:00
088fc7779e Eliminate unnecessary copy in CUDA addmm with sparse compressed block operand (#114484)
As in the title.

As a result, `nn.linear(<strided tensor>, <BSR tensor>, bias=<strided tensor>)` performance increases as follows (`float16`, `NVIDIA A100-SXM4-80GB`):
- 256x256 weights, speed up is 14..27 %
- 512x512 weights, speed up is 9..25 %
- 1024x1024 weights, speed up is 5..20 %
- 2048x2048 weights, speed up is 3..16 %
- 4092x4092 weights, speed up is 2..9 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114484
Approved by: https://github.com/cpuhrsch
2023-11-28 11:35:55 +00:00
00412e6dfa [export] Add meta to params (#114622)
The graph from `capture_pre_autograd_graph` doesn't have `meta["val"]` on the param nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114622
Approved by: https://github.com/frank-wei, https://github.com/zhxchen17, https://github.com/khabinov
2023-11-28 07:40:15 +00:00