Compare commits

...

2391 Commits

Author SHA1 Message Date
bd14a05729 [dynamo] Allow inlining of hooks for the top module
ghstack-source-id: 51408faf9d8b5f054544107f38316f2ccf1f7a3a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124501
2024-05-10 10:04:37 -07:00
7af546f53f [wip][inductor] Fix batch fusion pass
ghstack-source-id: e6872d4b64bf35d3cfe98cf816b8eaab983fc256
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125935
2024-05-10 10:04:37 -07:00
5f58cf65d1 Refactor other post compile wrappers in forward functions (#125610)
This continues the refactor by creating CompilerWrappers for FakifiedOut, RngFunctionalization, and aot_dispatch_subclass_wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125610
Approved by: https://github.com/bdhirsh
2024-05-10 17:01:40 +00:00
cc4da72b47 [caffe2] Make all get_backtrace() implementations lazy (#125750)
Summary: #125682 (D56586844) added support for lazy symbolization to `Error` and adopted it for internal use cases; this commit adopts it for `get_backtrace()` as well.

Test Plan: Sandcastle and GH CI.

Differential Revision: D56881683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125750
Approved by: https://github.com/ezyang
2024-05-10 16:02:40 +00:00
31372fa842 Support generic stream/event on CUDA/HIP backend (#125757)
# Motivation
According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on CUDA backend.

# Additional Context
new method/attribute on `torch.Event` for cuda
- torch.Event.event_id
- torch.Event.elapsed_time
- torch.Event.synchronize

new method on `c10::Event` on cuda backend
- c10.Event.event_id
- c10.Event.elapsed_time
- c10.Event.synchronize

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125757
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/EikanWang
2024-05-10 13:34:09 +00:00
946b96fd54 [AOTI] Add a failing test case (#123235)
Summary: Reported in https://github.com/pytorch/pytorch/issues/123210, and can reproduced in AOTI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123235
Approved by: https://github.com/chenyang78
2024-05-10 11:22:06 +00:00
f87fbfdb01 GPT-fast benchmark: remove Embedding layer from model size (#125901)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125901
Approved by: https://github.com/Chillee
2024-05-10 08:18:13 +00:00
d81db9c1df GitHub workflows / Dynamic rollout (#125680)
This PR introduces a tool to dynamically switch between ARC runners and old runners without having to update the PR to the latest version.

There is also a third option - use both runners at the same time (aka shadow deployment). In this case, failed workflows using ARC launchers will not block merge process.

The GitHub issue is used to control access to ARC launchers - [Access Rules Issue](https://github.com/pytorch/test-infra/issues/5132):

* In the FIRST comment you can specify who will use the ARC runners:
   * Add a GitHub username to use ARC runners.
   * Add "*" at the beginning to switch ALL users to ARC runners.
   * Add "!" at the beginning to switch ALL users to old runners.
* In the SECOND comment you can specify do we need to run ARC runners and old runners at the same time.
   * To use both runners, add a second comment with the word "both".
   * If we want to use only one type of runners, just remove the second comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125680
Approved by: https://github.com/ZainRizvi
2024-05-10 07:32:10 +00:00
cyy
2ed17e0b1e Remove binaries using caffe2 functionality (#125885)
This PR removed some binaries using deleted or to be deleted Caffe2 functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125885
Approved by: https://github.com/r-barnes, https://github.com/Chillee
2024-05-10 06:21:10 +00:00
013722bcb8 Allow symbols to reach conv_layout stride argument (#125829)
#Fix https://github.com/pytorch/pytorch/issues/125638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125829
Approved by: https://github.com/anijain2305
2024-05-10 06:13:36 +00:00
fcbf2b61e6 Memoize local_scalar_dense calls, refactor all memos (#125623)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125623
Approved by: https://github.com/eellison
2024-05-10 01:52:55 +00:00
8be4104cf3 Update conda to latest version for Docker release builds (#125887)
Fixes https://github.com/pytorch/pytorch/issues/125879

Issue is somewhat similar to this issue: https://github.com/pytorch/pytorch/issues/106470
doing:
```
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia
```
pulls cpu version of pytorch,vision and audio here: https://github.com/pytorch/pytorch/actions/runs/9014006158/job/24795924934#step:11:6849
```
16 37.21     mpmath-1.2.1               |          py311_0         1.2 MB  pytorch-nightly
#16 37.21     nettle-3.7.3               |       hbbd107a_1         809 KB
#16 37.21     networkx-3.1               |  py311h06a4308_0         3.3 MB
#16 37.21     openh264-2.1.1             |       h4ff587b_0         711 KB
#16 37.21     pillow-9.3.0               |  py311h3fd9d12_2         874 KB  pytorch-nightly
#16 37.21     pytorch-2.4.0.dev20240509  |     py3.11_cpu_0        87.1 MB  pytorch-nightly
#16 37.21     pytorch-cuda-12.1          |       ha16c6d3_6           7 KB  pytorch-nightly
#16 37.21     pytorch-mutex-1.0          |              cpu           3 KB  pytorch-nightly
#16 37.21     sympy-1.12                 |  py311h06a4308_0        14.4 MB
#16 37.21     torchaudio-2.2.0.dev20240509|        py311_cpu         5.1 MB  pytorch-nightly
#16 37.21     torchvision-0.19.0.dev20240509|        py311_cpu         7.3 MB  pytorch-nightly
```
Updating conda to latest and rebuilding solved this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125887
Approved by: https://github.com/huydhn
2024-05-10 01:43:59 +00:00
d14d6127f6 [BE] Rename macos-12 to macos-13/macos- jobs (#125859)
As CI does not have any MacOS 12 runners anymore
Cleanup any misleading references about cross-compilation as M1 builds are done natively for quite some time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125859
Approved by: https://github.com/ZainRizvi
2024-05-10 01:30:29 +00:00
2ad794550a Support generic stream/event on XPU backend (#125751)
# Motivation
According to [#123611](https://github.com/pytorch/pytorch/pull/123611), we support generic stream/event on XPU backend.

# Additional Context
new method/attribute on `torch.Event` for xpu
- torch.Event.event_id
- torch.Event.elapsed_time
- torch.Event.synchronize

new method on `c10::Event` on xpu backend
- c10.Event.event_id
- c10.Event.elapsed_time
- c10.Event.synchronize

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125751
Approved by: https://github.com/jgong5, https://github.com/albanD
2024-05-10 01:27:30 +00:00
d19d932183 update pointwise cat heuristics (#125772)
Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat:

- fusing into a pointwise use
- horizontally fusing copy_ kernels

The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case that we would have to emit separate copy_ kernels anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772
Approved by: https://github.com/Chillee
2024-05-10 01:07:39 +00:00
978b572652 Add registration API for torch.compile-eager (#121387)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows.

- Load cache
- Check cache according to the input tensors
  - Cache Hit: Run the cached kernel directly
  - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function.

Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368

Differential Revision: [D57164385](https://our.internmc.facebook.com/intern/diff/D57164385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5
2024-05-10 00:30:27 +00:00
c9a258e474 [export] handle constant aliasing for export (#125509)
Summary: Currently export will [error out](2b5ae2611e/torch/export/_trace.py (L477)) if a constant is aliased. This PR supports this by modifying ConstantAttrMap to map constants to a list of FQNs instead of a single FQN, populating the ExportedProgram constants dict to contain multiple entries to the same constant.

Test Plan: added test case in test_export.py

Differential Revision: D56955654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125509
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-05-10 00:14:37 +00:00
fd816bf630 Add script for removing Inductor dependencies from Inductor generated code (#125811)
Usage:
```python
TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python foo.py
TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1 python /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py
python remove_inductor_deps.py /tmp/torchinductor_chilli/js/cjsbczkf6fj36nhaxxypll6cy4fmwmkoauklrgrvuody2mn7oeef.py
```

Example generated code: https://pastebin.com/m6Ae8heB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125811
Approved by: https://github.com/chenyang78
2024-05-10 00:00:25 +00:00
3267814d53 [inductor] refactor: device dispatch inside do_bench (#125736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125736
Approved by: https://github.com/shunting314
2024-05-09 23:50:02 +00:00
13545fe68a [export] Don't create a new fake mode if dynamo tracing (#125185)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125185
Approved by: https://github.com/mikekgfb
2024-05-09 23:43:08 +00:00
23e71ffd82 Remove unused caffe2 subdirs (#125818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125818
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-05-09 22:57:55 +00:00
350a3ed82f Fix unused variable 'kEps' (#125870)
Summary:
> fbcode/caffe2/caffe2/utils/math_gpu_test.cc:227:17: error: unused variable 'kEps' [-Werror,-Wunused-const-variable]

See https://www.internalfb.com/intern/test/844425000398735?ref_report_id=0

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan: Sandcastle run

Reviewed By: r-barnes

Differential Revision: D56731004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125870
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-05-09 22:57:37 +00:00
477612c0f6 [dynamo] Clear GenerationTracker on dynamo reset (#125855)
Fixes https://github.com/pytorch/pytorch/issues/125567

Not doing this causes modules to be unspecialized when tests run in sequence, and specialized when run alone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125855
Approved by: https://github.com/jansel
2024-05-09 22:47:54 +00:00
52fad83335 [onnx.export] Avoid linear look up in env for exist_in_env (#124909)
This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis.

- As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity.
- A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done.
- Resolves (4) in #121422.
- This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict.

Partially fixes #121422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909
Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby
2024-05-09 22:38:00 +00:00
37d2ecd123 Only log toplevel torchscript calls. (#125714)
Summary: as title.

Test Plan: CI

Reviewed By: gmagogsfm

Differential Revision: D57069719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125714
Approved by: https://github.com/SherlockNoMad
2024-05-09 22:29:53 +00:00
e43d656921 FakeTensor speedup: minor cleanups (#124224)
A few cleanup tasks that didn't really fit into the other diffs in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124224
Approved by: https://github.com/oulgen
ghstack dependencies: #122911, #124223
2024-05-09 22:11:51 +00:00
a08be4b705 FakeTensor speedup: Split cache_key so we only validate once (#124223)
When dispatching a fake tensor op we cache the result with `(op, args)` as the key. There are some args (such as one with a dynamic output shape) where the output can't be cached. Instead of validating the args every time we compute the cache only validate the args when we first see a new cache key.

18.3% FakeTensor perf win on the microbenchmark (21.7% cumulative)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124223
Approved by: https://github.com/oulgen, https://github.com/masnesral
ghstack dependencies: #122911
2024-05-09 22:11:51 +00:00
6a8b1da18d FakeTensor speedup: Delay formatting stack trace until it's actually asked for. (#122911)
When constructing a `FakeTensorMode`, instead of immediately formatting a full stack trace, grab the traceback and only format it on demand.

4.2% FakeTensor perf win on the microbenchmark.

```
import time
import torch
import torch._dynamo as dynamo
from torch._subclasses.fake_tensor import FakeTensorMode
import numpy as np

def toy_example(a, b):
    x = a / (torch.abs(a) + 1)
    b = b * -1
    return x * b

def run_test1():
    dynamo.reset()
    j = [1, 2, 3]
    toy_example(torch.randn(j), torch.randn(j))

def run_test2():
    dynamo.reset()
    j = [1, 2, 3]
    with FakeTensorMode():
        toy_example(torch.randn(j), torch.randn(j))

ITERATIONS = 500000
FORMAT_STRING = "{name:12}: TOT: {tot:10.3f}, AVG: {avg:10.3f}, MIN: {min:10.3f}, P50: {p50:10.3f}, P90: {p90:10.3f}, P99: {p99:10.3f}"

def run_tests(name, step):
    step()
    timings = []
    start = time.time()
    for i in range(ITERATIONS):
        a = time.perf_counter_ns()
        step()
        b = time.perf_counter_ns()
        timings.append(b - a)
    end = time.time()
    fmt = {
        "best": min(timings),
        "tot": end - start,
        "avg": np.average(timings),
        "min": min(timings),
        "p50": np.percentile(timings, 50),
        "p90": np.percentile(timings, 90),
        "p99": np.percentile(timings, 99)
    }
    print(FORMAT_STRING.format(name=name, **fmt))
    return fmt

ts = run_tests("tensor", run_test1)
fs = run_tests("fake tensor", run_test2)
ratio = {k: a / b for ((k, a), (_, b)) in zip(fs.items(), ts.items())}
print(FORMAT_STRING.format(name="ratio", **ratio))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122911
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-05-09 22:11:51 +00:00
eaaf0f3299 Print capture_pre_autograd_graph warning only once (#125848)
Summary: Print this warning only once to avoid flooding the logs of workflows where this is called frequently.

Test Plan: CI

Differential Revision: D57163341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125848
Approved by: https://github.com/zhxchen17
2024-05-09 22:04:05 +00:00
20271f0a3b Drop caffe2-linux-jammy-py3_8-gcc11-build (#125857)
Removes more caffe2 testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125857
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-05-09 21:52:27 +00:00
ae5e2ab92e [dynamo][fsdp] Use Tensor match for FSDP modules (#125827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125827
Approved by: https://github.com/yf225, https://github.com/jansel
ghstack dependencies: #125828, #125805
2024-05-09 21:26:15 +00:00
0d4fdb0bb7 Revert "[ROCm] amdsmi library integration (#119182)"
This reverts commit 85447c41e32b1e43a025ea19ac812a0c7f88ff57.

Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3 ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))
2024-05-09 21:18:21 +00:00
966ebd2e24 Add --warm-start-latency to benchmark harness (#125353)
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
2024-05-09 21:12:15 +00:00
ee00349780 [dynamo][logs] move recompilation reason within compile_id scope (#125805)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125805
Approved by: https://github.com/ezyang
ghstack dependencies: #125828
2024-05-09 20:37:23 +00:00
a7575e8bd5 [dynamo] Use correct source for custom getattr (#125828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125828
Approved by: https://github.com/williamwen42
2024-05-09 20:37:23 +00:00
7c00635125 [CI] Move gha artifact download before xml parsing for test stat uploads (#125609)
Move gha artifact download to before any xml parsing is done for uplaod-test-stats

Do not download gha artifacts during xml parsing since got uploaded to s3 in the above and will be downloaded when all the artifacts are downloaded from s3

The previous method resulted in dups if you run the script again

TODO: write a deduper so we don't have to worry at all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125609
Approved by: https://github.com/huydhn
2024-05-09 20:35:09 +00:00
1ecea513b6 Fix common_methods_invocations example inputs to _efficient_attention_forward (#125788)
Fixes #120693

this tries to fix the sample input in common_methods_invocations.py:
* I think the arange was intended to be skipping every other integer in the range. Previously, we'd have one length that was -1.
* k, v tensors were too small - updated the sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125788
Approved by: https://github.com/drisspg, https://github.com/Aidyn-A
2024-05-09 20:08:49 +00:00
6fd745255e Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit 3f36145db298f7305b3b4df6c82c9101025a049a.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))
2024-05-09 19:52:45 +00:00
74a0ef8f8c Enable UFMT format on test/test_package.py test/test_per_overload_api.py (#125834)
Fixes some files in https://github.com/pytorch/pytorch/issues/123062

Run lintrunner on files:
test/test_package.py
test/test_per_overload_api.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125834
Approved by: https://github.com/malfet
2024-05-09 19:48:22 +00:00
ed8a560845 Update Release Calendar for 2.3.1 and 2.4 releases (#125794)
As per:
- https://dev-discuss.pytorch.org/t/pytorch-release-2-4-0-call-for-features/2051
- https://dev-discuss.pytorch.org/t/pytorch-release-2-3-1-planning/2052

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125794
Approved by: https://github.com/malfet
2024-05-09 18:31:52 +00:00
85447c41e3 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
0e419b9146 Fix graph partitioner and make runtime assertion work with submodules in export (#125793)
Summary: This fix does three things:

1. When we add inputs from partioner to the top level graph module, we insert in the order of partioner which is not guaranteed to be same as original graph inputs. This PR fixes that.
2. When we replace autograd ops with HOP, we create new submodules and access their outputs via getitem calls. As a result, previous node names associated with getitem gets updated, resulting in the graph being different from produced graph signature. So I just update the graph signature accordingly.
3. We run runtime_assertion pass before autograd HOP pass because the constraints won't be populated correctly.

Differential Revision: [D57130314](https://our.internmc.facebook.com/intern/diff/D57130314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125793
Approved by: https://github.com/zhxchen17
2024-05-09 18:13:46 +00:00
98821b3d92 Disable various flaky tests in test_foreach (#125783)
* Similar to #125046
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125783
Approved by: https://github.com/huydhn
2024-05-09 18:08:39 +00:00
ae20f15941 [dynamo] trace through nn parametrize (#125771)
Fix https://github.com/pytorch/pytorch/issues/120914

Example dynamo output graph (from test_nn_parametrize):
```
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code] TRACED GRAPH
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]  ===== __compiled_fn_1 =====
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]  /data/users/williamwen/pytorch2/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]     def forward(self, L_x_: "f32[10, 10]"):
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         l_x_ = L_x_
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         # File: /data/users/williamwen/pytorch2/torch/nn/utils/parametrize.py:275 in forward, code: x = self[0](self.original)
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         l__self___parametrizations__param___original: "f32[10, 10]" = self.L__self___parametrizations__param___original
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         # File: /data/users/williamwen/pytorch2/test/dynamo/test_repros.py:4759 in forward, code: return torch.sin(x)
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         x: "f32[10, 10]" = torch.sin(l__self___parametrizations__param___original);  l__self___parametrizations__param___original = None
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         # File: /data/users/williamwen/pytorch2/test/dynamo/test_repros.py:4755 in forward, code: return self.param @ x
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         matmul: "f32[10, 10]" = x @ l_x_;  x = l_x_ = None
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]         return (matmul,)
V0508 11:16:26.687000 140092517021504 torch/_dynamo/output_graph.py:1272] [0/0] [__graph_code]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125771
Approved by: https://github.com/jbschlosser
ghstack dependencies: #125710, #125724
2024-05-09 17:43:48 +00:00
6ea226b99c Fix DDP no_sync when find_unused_parameters is True (#124193)
Fixes #69031, #42793

This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state.

Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix.

Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193
Approved by: https://github.com/rohan-varma
2024-05-09 17:33:33 +00:00
8fb3ff2a4e Revert "[profiler] enable CUPTI range profiler in build (#125685)"
This reverts commit 2deea9e6e9faf5eacebefa2336861d129c598c99.

Reverted https://github.com/pytorch/pytorch/pull/125685 on behalf of https://github.com/atalman due to Broke nightly ([comment](https://github.com/pytorch/pytorch/pull/125685#issuecomment-2103093237))
2024-05-09 17:28:02 +00:00
26b942c4fc [C10D] Document destroy_process_group usage (#122358)
This API was not documented. It has already been a source of confusion,
but recently has become more urgent as improper destruction can lead to
hangs due to ncclCommAbort's requirement of being called collectively.
<img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9">

Fixes #48203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122358
Approved by: https://github.com/shuqiangzhang
2024-05-09 16:51:31 +00:00
257d40ba2e Docker release - push nightly tags only for amd64 builds (#125845)
Fixes failure: https://github.com/pytorch/pytorch/actions/runs/9014006158/job/24765880791#step:12:43
```
Unable to find image 'ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240509-runtime' locally
2.4.0.dev20240509-runtime: Pulling from pytorch/pytorch-nightly
docker: no matching manifest for linux/amd64 in the manifest list entries.
```
This cpu image does not exist for amd64 and not uploaded to dockerhub. Hence don't tag it .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125845
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-05-09 16:42:15 +00:00
3ccf107f01 [export] remove upgrader. (#125625)
Summary: talked to executorch team, seems we can remove this now.

Test Plan: CI

Differential Revision: D57013451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125625
Approved by: https://github.com/larryliu0820
2024-05-09 16:30:12 +00:00
0241ed9331 Fix sparse fake tensors detach (#125679)
As in the title.

Fixes a bug reported in https://github.com/pytorch/pytorch/pull/117907#discussion_r1589581536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125679
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-05-09 15:40:57 +00:00
7e86a7c015 Lint: Update older-python test to 3.6 (#125843)
As python-3.5 can no longer connect to pypi after today's cert update
Fixes https://github.com/pytorch/pytorch/issues/125841
2024-05-09 07:23:59 -07:00
b8a706a321 [EZ][BE] Use untyped_storage in tests (#125838)
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if base.storage().data_ptr() != other.storage().data_ptr():
```

(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )

Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
2024-05-09 14:04:21 +00:00
4e29e80bf0 Run MPS tests on MacOS Sonoma (#125801)
Those ones are running 14.4.1, so I wonder if they actually pass CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-05-09 13:43:12 +00:00
b9588101c4 [Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207)
**Summary**
Fix 2 regression issues caused by previous refactor:

- Fix the issue in dequant promotion pass with dynamic quant when the dequant node is with `tensor` overload.
- Fix numerical issue in dynamic quant, since input will convert to scales' dtype (which is `double`) to do quant operatoration with previous implementation.

**TestPlan**
```
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_input_dim_exceeds_2
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_dynamic_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125207
Approved by: https://github.com/peterbell10, https://github.com/jgong5
ghstack dependencies: #124041, #124246
2024-05-09 08:47:24 +00:00
c337395cdb [Inductor][Quant] Change the QConv output scale name (#124246)
**Summary**
Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #124041
2024-05-09 08:44:00 +00:00
d83ab88f81 [Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:

- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.

**Changes in this PR**

- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-09 08:40:44 +00:00
96c8447001 change error message to avoid failing when nn modules inlined (#125612)
#address https://github.com/pytorch/pytorch/issues/125605

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125612
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2024-05-09 08:34:31 +00:00
da2f4bbc33 remove empty partition (#124920)
In some rare scenarios, the partitioner will produce an empty partition. it's a waste of time to compile an empty graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124920
Approved by: https://github.com/ezyang
2024-05-09 07:39:47 +00:00
e5766f02d0 [onnx.export] Avoid dict <-> unordered_map implicit copies (#123063)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- Avoid [implicit copy](https://pybind11.readthedocs.io/en/stable/advanced/cast/stl.html#automatic-conversion) between `pybind11::dict` and `std::unordered_map` that
  happens for every node that gets processed. The copy scales with N
  (number of nodes), so this creates a quadratic time complexity.
  Solution is to always use `pybind11::dict`.
- This alone speeds up exports by x2 for large models.
- Resolves (1) in #121422.

(partial fix of #121422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123063
Approved by: https://github.com/justinchuby
2024-05-09 07:34:47 +00:00
c59a2369be [fsdp2] Accomodate FSDP2 to accept parent mesh > 2 (#125778)
as titled, to support higher dimension parallelism

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125778
Approved by: https://github.com/weifengpy
2024-05-09 05:02:21 +00:00
aaa2f93a4f Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward (#125785)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125785
Approved by: https://github.com/albanD
2024-05-09 04:28:16 +00:00
ed48ea9997 [AOTI] Refine the C shim autogen mechanism (#125589)
Summary: Based on the discussions in https://github.com/pytorch/pytorch/pull/120513. Instead of auto-generate C shim fallback ops for thousands of ops, we maintain a list of fallback ops based on torch/_inductor/lowering.py, and only generate C shim functions for those ops. At the torchgen time, we will re-generate C shim files and compare the header file contents against the existing C shim headers. If there is any change, the compilation will fail with prompt on how to proceed. This makes sure the ABI-compatible C shim layer is small enough to maintain in the long run.

Differential Revision: [D57004046](https://our.internmc.facebook.com/intern/diff/D57004046)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125589
Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/albanD, https://github.com/ezyang
2024-05-09 02:48:16 +00:00
0bde9c08ef Prevent rendezvous shutdown on worker restarts (#124819)
Fixes #123678

#### Summary
When the rank leaves and joins back, the workers are restarted and while restarting the rendezvous is shut down. This change prevents rendezvous shutdown during worker restarts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124819
Approved by: https://github.com/malfet, https://github.com/kurman, https://github.com/eqy
2024-05-09 02:40:31 +00:00
cyy
6c4f43f826 Decouple most Caffe2 components from the build systems (r-barnes) (#125711)
Copying #125392 here so I can edit it more easily.

Co-authored-by: cyy <cyyever@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125711
Approved by: https://github.com/malfet
2024-05-09 02:19:59 +00:00
fdff9920f6 [pytorch] fix blasLt on windows (#125792)
Summary:
It seems like required functions are not available due to `_MSC_VER` guard. Does anyone have more context why this functionality has been disabled for windows?

I'm also unsure how this currently compiles in OSS land on windows, as there doesn't seem to be any preprocessor protection around `scaled_gemm` getting pulled in.

Test Plan:
Fix compilation errors like this
```
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): error C2039: 'scaled_gemm': is not a member of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\CUDABlas.h(19): note: see declaration of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): note: the template instantiation context (the oldest one first) is
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(71): note: while compiling class template 'at::cuda::tunable::DefaultScaledGemmOp'
Action failed: fbsource//xplat/caffe2:ATen_cuda_lib_ovrsource (cxx_compile aten/src/ATen/native/cuda/Blas.cpp)
```

Differential Revision: D57087985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125792
Approved by: https://github.com/malfet, https://github.com/eqy
2024-05-09 01:54:25 +00:00
902a74c1d6 [caffe2] Lazily symbolize backtrace in c10::Error (#125787)
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.

By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.

Test Plan:
Added unit tests exercising the lazy nature of `what()`.

Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346

We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
 {F1496627896}

Differential Revision: D57128632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125787
Approved by: https://github.com/huydhn
2024-05-09 01:46:57 +00:00
ea3f625e32 Revert "[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)"
This reverts commit 33e6791645b5950b0f39301f55b8a4a79c0ca847.

Reverted https://github.com/pytorch/pytorch/pull/124041 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
ca579c177b Revert "[Inductor][Quant] Change the QConv output scale name (#124246)"
This reverts commit 9ba9f7fa821af062ef3d1580b75e70f74ba05063.

Reverted https://github.com/pytorch/pytorch/pull/124246 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
97509c8eb2 Revert "[Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207)"
This reverts commit 3da949b0fbe91e802d30e00165141d1390621d71.

Reverted https://github.com/pytorch/pytorch/pull/125207 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
19bab45e67 [Inductor] Add SDPA pattern for OOB GPT2 models (#125562)
Add SDPA pattern for 2 OOB models:
- token-classification+gpt2
- text-generation+gpt2

Note that these models have two masks: attention mask with float type and causal mask with bool type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125562
Approved by: https://github.com/jgong5, https://github.com/kadeng, https://github.com/jansel
2024-05-09 01:21:09 +00:00
3da949b0fb [Inductor][Quant] Fix PT2E Dynamic Quant regression (#125207)
**Summary**
Fix 2 regression issues caused by previous refactor:

- Fix the issue in dequant promotion pass with dynamic quant when the dequant node is with `tensor` overload.
- Fix numerical issue in dynamic quant, since input will convert to scales' dtype (which is `double`) to do quant operatoration with previous implementation.

**TestPlan**
```
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_input_dim_exceeds_2
clear && python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_dynamic_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125207
Approved by: https://github.com/peterbell10, https://github.com/jgong5
ghstack dependencies: #124041, #124246
2024-05-09 01:05:00 +00:00
d474d79420 [dynamo][disable] Move disable impl to its own __call__ method (#125486)
There were internal cases where calling disable in distributed causes trace_rules to be generated, which imports distributed and causes circular import errors.

The code has also gone bulky. I think it is time for disable code to exist separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125486
Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel
2024-05-09 01:03:12 +00:00
9ba9f7fa82 [Inductor][Quant] Change the QConv output scale name (#124246)
**Summary**
Change the name of QConv output scale from `inv_output_scale` to `output_scale` after we move the optimization of quant/dequant from decomposition to lowering phase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124246
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #124041
2024-05-09 00:57:10 +00:00
33e6791645 [Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:

- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.

**Changes in this PR**

- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-09 00:54:22 +00:00
1b1b18a7a4 Add LRScheduler Composability E2E Tests (#125653)
adds tests to verify the LRSchedulers correctly update the compiled optimizers without recompiles.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125653
Approved by: https://github.com/yanboliang
ghstack dependencies: #123751, #123752, #123753, #125383
2024-05-09 00:52:43 +00:00
8c9c169b48 LRScheduler composability kernel tests (#125383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125383
Approved by: https://github.com/eellison
ghstack dependencies: #123751, #123752, #123753
2024-05-09 00:52:43 +00:00
69eeef0727 Update LRScheduler to handle tensor LR (#123753)
Enables LRScheduler to handle tensor LRs.

Note on test changes:
For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751, #123752
2024-05-09 00:52:43 +00:00
7b36b4a765 Fix user warning for tensor LR (#123752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123752
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751
2024-05-09 00:52:43 +00:00
0ea6ffc613 Swap warning counter to flag in LRScheduler (#123751)
This was a counter previously, this should be a flag to indicate whether or not the optimizer step has been called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123751
Approved by: https://github.com/janeyx99
2024-05-09 00:52:43 +00:00
78a1693266 [Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 1) (#122866)
Reuse Inductor test suite for Intel GPU including:
test_torchinductor.py
test_triton_wrapper.py
test_metrics.py
test_codecache.py
test_codegen_triton.py
test_kernel_benchmark.py
test_triton_heuristics.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122866
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-05-09 00:51:35 +00:00
4dd33a1c2b Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node (#123711)
This PR fix `torch.backends.xeon.run_cpu` behavior when it is launched from `torchrun` with `--nproc-per-node` parameter.

As a CPU launcher, `run_cpu` would bind cores to each instance it launches using `numactl`, and assign cores to each instance evenly.

However, if we use `torchrun` to start `run_cpu` and use `--nproc-per-node` to create multiple `run_cpu` processes.   In this case, each `run_cpu` process would assume it can use all the CPU cores, which causes each `run_cpu` process compete for CPU cores.  This results in poor performance.

This PR recognize environment variable `LOCAL_WORLD_SIZE` and `LOCAL_RANK` set by `torchrun`, then use this information to further shard the cores bind to each instance.  With this PR, when launched by `torchrun --nproc-per-node ...`, different CPU cores will be bind to different workers, which maximize CPU utilization and application performance.

The specific use case this PR enabled is using TorchServe with DeepSpeed tensor parallel.  In this case, TorchServe would run `torchrun --nproc-per-node <tp_size>` to start tensor parallel workers it needed.  When run TorchServe on multisocket CPU server with DeepSpeed tensor parallel, we need this PR to achieve best performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123711
Approved by: https://github.com/jingxu10, https://github.com/ezyang
2024-05-09 00:32:11 +00:00
8def2e92f2 [inductor] autotune benchmark support for cpu (#125159)
This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159
Approved by: https://github.com/jansel
2024-05-09 00:28:49 +00:00
96a5698408 Fix torch.profiler Schedule Function (Function Event only) (#125510)
Summary:
github issue: https://github.com/pytorch/pytorch/issues/73828

Whenever we transition from RECORD_AND_SAVE to WARMUP in the profiler schedule, we instantiate a new backend profiler which wipes out the last cycle's information. This makes using the `repeat` parameter less useful in the schedule as you only get contents of the last cycle/repeat. In this diff, we save the accumulated Function Events before setting the new ones and then merge the two EventLists after post processing/cleaning is done. This diff only fixes Function Events so that we can get statistics over each cycle within a schedule. A follow up should be made to accumulate the chrome tracings as well if it is requested.

Test Plan: Added functional python tests in test_profiler.py that test different schedules and their FunctionEvent counts

Differential Revision: D56956245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125510
Approved by: https://github.com/aaronenyeshi
2024-05-08 23:32:50 +00:00
ff090c6937 [dynamo] support tracing nn.Module @property that accesses closure cells (#125724)
Fix https://github.com/pytorch/pytorch/issues/125702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125724
Approved by: https://github.com/jansel, https://github.com/jbschlosser
ghstack dependencies: #125710
2024-05-08 23:25:39 +00:00
93f3d561f9 [dynamo] don't make nn parametrized Modules unspecialized (#125710)
Workaround for https://github.com/pytorch/pytorch/issues/125314 and https://github.com/pytorch/pytorch/issues/125478.

We no longer make parametrized nn.Modules unspecialized. Instead, when we are about to call a function from the `torch.nn.utils.parametrize` module, we skip the frame.

The script from https://github.com/pytorch/pytorch/issues/125314 now outputs
```
parametrize=True: 6587ms
parametrize=False: 1729ms
parametrize=True: 4497ms
parametrize=False: 1539ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125710
Approved by: https://github.com/jansel, https://github.com/jbschlosser
2024-05-08 23:25:39 +00:00
e71207b729 Fix infinite recursion in API BC test (#125706)
```
python test/test_fx.py -k test_public_api_surface
```
was failing with a complaint about infinite recursion. Fixed that and then marked the two API changes from #123681 as private (for `get_example_value`) and backward compatible (for `insert_deferred_runtime_asserts`).

Fixes #104012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125706
Approved by: https://github.com/BoyuanFeng
2024-05-08 23:07:16 +00:00
04bf7713e8 [c10d] Reduce test time by reusing ProcessGroup (#125648)
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
    pg = self._create_process_group_nccl(store, self.opts())
    pg.allreduce(tensor)
    ...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?

If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
    self._spawn_processes()

def tearDown(self):
    for p in self.processes:
        p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.

## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class.  Therefore, we would do:
```
@classmethod
def setUpClass(self):
    dist.init_process_group(...)

@classmethod
def tearDownClass(self):
    dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**

Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.

Point is, launch the processes only once.

## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.

## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.

## Next step
Migrate the tests of distributed that would fit with this test style!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648
Approved by: https://github.com/wconstab
2024-05-08 22:33:40 +00:00
8f27c7f181 [sparse] Fix type-dispatch errors (#124777)
I am building PyTorch with the Intel oneAPI 2024.0 compiler and without cuSparseLt, and encountered various type errors of the following forms:
```
[ 63%] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu.o
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(87): error: no operator "=" matches these operands
            operand types are: cutlass::uint2b_t = int
          detected during:
            instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here

/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(88): error: no operator "=" matches these operands
            operand types are: cutlass::uint2b_t = int
          detected during:
            instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here

/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(238): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
            argument types are: (int, int)
            object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
          detected during:
            instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here

/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(241): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
            argument types are: (int, int)
            object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
          detected during:
            instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
            instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
            instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
            instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
            instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
```

The casts added by this PR get the build working again for me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124777
Approved by: https://github.com/jcaip
2024-05-08 21:49:33 +00:00
1b8891a31d make torch._check understand Eq commutativity (#125629)
Summary:
Given `torch._check(a == b)` we can still get a data-dependent error needing `b == a`. Simple fix.

```
def forward(self, x1, x2, x3, y):
    z1 = x1.item()
    z2 = x2.item()
    z3 = x3.item()
    torch._check((z2 + z3) == z1)
    # torch._check(z1 == (z2 + z3)) didn't work, now does
    if z2 + z3 == z1:
        return y * 2
    else:
        return y + 3
```

Differential Revision: D57014730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125629
Approved by: https://github.com/ezyang
2024-05-08 21:39:21 +00:00
346343e6b5 [DeviceMesh] Make _validate_tp_mesh_dim support 3D (#125763)
Currently a 3D mesh with a submesh sliced out for TP is going to fail
this check.

According to @wanchaol in [this
comment](https://github.com/pytorch/pytorch/pull/125250#discussion_r1586653669)
it should be OK to remove these checks.  Though I would appreciate a
more careful review here, since I'm not too sure if there are other edge
cases where these checks are important.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125763
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-05-08 21:22:11 +00:00
e457fdcd81 Revert "[caffe2] Lazily symbolize backtrace in c10::Error (#125682)"
This reverts commit 08f6ef0e1ccadf4626c0d7ecb15db96c01b8f418.

Reverted https://github.com/pytorch/pytorch/pull/125682 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125682#issuecomment-2101477132))
2024-05-08 21:11:27 +00:00
7e0edafe86 [compiled autograd][dynamo] improve lifted autograd.Function.backward handling and fallback to pseudo-eager (#125661)
- `FakeContext` hides all fields other than ctx.saved_tensors, this dynamo errors when the autograd.Function.backward uses other attrs on ctx and it also doesn't allow fallback to eager.
- If we remove it, we still can't fallback to eager: node variables are already freed (ctx.saved_tensors throws)
- However, we can fallback to "pseudo-eager" by using a duck-typed ctx and routing the ctx.saved_tensors to lifted tensors
- Dynamo tries to inline external_utils.call_backward, treats BackwardCFunction as a AutogradFunctionContextVariable (only used up until we create the fake context: FakeBackwardCFunction)
- we call_function backward from the forward class AutogradFunctionVariable, and we still pass in the fake context as a UserDefinedObjectVariable (can later use AutogradFunctionContextVariable + HOO graph speculate)

Fixes #125489  #124827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125661
Approved by: https://github.com/jansel
2024-05-08 21:00:37 +00:00
de8ce3be20 [TD] Heuristic based on file path (#125477)
Get the folders of each changed file and attempt to map the folders to some tests.

The intention is to push up things like dynamo tests if someone changes a file in the dynamo folder

Please see the tests for examples of what should be matched together
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125477
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-05-08 20:56:53 +00:00
17ab7f77c2 [Kineto] Update Kineto Submodule Hash (#125621)
Summary:
Update the Kineto submodule in PyTorch. The following diffs are included:
- Removed CUPTI overhead track in AMD traces
- Delay logging for CUDA stream wait event until the end
- Changed chrome trace unit will be in milliseconds, and data will be in ns
- Refactored roctracer to include metadata and improved names.
- Lowered Kineto Stage log level, reducing noisy output
- Changed relative time of ts to quarterly interval for distributed trace alignment
- Fixed Non-risky deprecated use of 0/NULL
- Removed hardcoding of /opt/rocm
- Handling cuLaunchKernelEx better
- Fixed Non-risky missing field initializers and unused variables.

Test Plan: CI and this is running internally.

Differential Revision: D57011897

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125621
Approved by: https://github.com/sraikund16
2024-05-08 20:49:07 +00:00
255a3afbf1 [dynamo] don't LOAD_FAST local context variables in modified bytecode (#125719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125719
Approved by: https://github.com/jansel
2024-05-08 20:39:06 +00:00
0feca5e341 Increase Python version for Docker builds (#125782)
Fixes https://github.com/pytorch/pytorch/issues/73714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125782
Approved by: https://github.com/huydhn
2024-05-08 20:32:32 +00:00
19a9de114a Forbid subclassing _TensorBase directly (#125558)
As per title.
This ensures that all the places where we assume the method defined in _tensor.py do exist.

BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore.
You should replace any use of _TensorBase to Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558
Approved by: https://github.com/ezyang
2024-05-08 20:29:29 +00:00
afea237935 [minimizer] Create block traverse mode in minimizer for graph aware debugging (#125613)
Summary:
block traverse mode:
Assumption:
culprits block formed by (start_idx, end_idx) in topologically sorted graph
and the error will go away if  graph patterns breaks

Reviewed By: junhanh

Differential Revision: D56799587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125613
Approved by: https://github.com/jfix71
2024-05-08 20:21:21 +00:00
603d1e6049 [DTensor] allow numel 1 tensor operand to be implicitly replicate DTensor (#125073)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125073
Approved by: https://github.com/wanchaol
2024-05-08 19:47:47 +00:00
445a0c01da Retry: Low mem max_pool2d_with_indices (#122832)
Based on #105687

The low memory path does not need to strictly return the int8 offsets
instead the offset to index computation can be separated from the
inner function of the max pool lowering. The partitioner can then choose
to move the offset to index computation into the backward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122832
Approved by: https://github.com/peterbell10, https://github.com/eellison
2024-05-08 19:37:08 +00:00
005a12722d Remove duplicated nodes in dfs_iter_find_cycle (#125585)
In case the `dfs_iter_find_cycle` function receives duplicated node entries in the `all_user_nodes` argument, it will still process each one of them. This commit changes the `all_user_nodes` list into a set, so each element is unique, resulting in a shorter execution time of the `propose_partitions` function.

Fixes #125584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125585
Approved by: https://github.com/Skylion007
2024-05-08 19:21:15 +00:00
3f36145db2 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-08 19:15:55 +00:00
f4b2d50fd7 [export] disable_forced_specializations (#124949)
Summary:
By default, some inferred dynamic shapes guards/constraints that are not expressible with the current dynamic shapes language will lead to specialization to the concrete input values provided. If disable_forced_specializations is set to True, we will not specialize, and will not perform runtime checks on such produced guards. Instead, we allow the user to specify arbitrary shapes, and fail during runtime if the inputs are invalid. Constraints expressible with the language (e.g. ranges, linear derived dims) will still be enforced, and behavior for all other guards remains the same.

Cases where we typically specialize are reshapes:
```
x: [4, 6]  # [s0, s1]
x = x.reshape([x.shape[0] - 1, -1])
# this emits a guard Mod(s0*s1, s0-1) = 0, we specialize on s0=4, s1=6

x: [4, 6], y: [24]  # [s0, s1], [s2]
x = x.reshape([-1]) + y
# this emits a guard s0*s1 = s2, we specialize on s0=4, s1=6, s2=24
```

For now only applicable for non-strict mode (need to figure out how to pass this flag into dynamo's call of produce_guards).

Test Plan: Added test case that checks compilation, runtime, and suggested fixes behavior.

Differential Revision: D56361177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124949
Approved by: https://github.com/avikchaudhuri
2024-05-08 18:42:39 +00:00
74b1674860 Use nvidia/cuda:CUDA_VERSION-devel-ubuntu22.04 as base for official Docker release (#125770)
We don't need to install cudnn system wide. We already install it with pytorch install during this step:
```
[conda-installs 2/3] RUN case linux/amd64 in          "linux/arm64")  pip install --extra-index-url https://download.pytorch.org/whl/cpu/ torch torchvision torchaudio ;;          *)              /opt/conda/bin/conda install -c "pytorch-nightly" -c "nvidia" -y "python=3.10" pytorch torchvision torchaudio "pytorch-cuda=$(echo 12.1.1 | cut -d'.' -f 1-2)"  ;;     esac &&     /opt/conda/bin/conda clean -ya
```
Ref: https://github.com/pytorch/pytorch/actions/runs/8998055687/job/24717424912

Validate via: https://github.com/pytorch/builder/actions/workflows/validate_docker_images.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125770
Approved by: https://github.com/nWEIdia, https://github.com/seemethere
2024-05-08 18:41:30 +00:00
faf0015052 [dtensor] run transformer sdpa in dtensor (#122997)
Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments.

Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention?

Test files containing this Transformer:
- `test/distributed/tensor/parallel/test_tp_examples.py`
- `test/distributed/_composable/fsdp/test_fully_shard_training.py`
- `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122997
Approved by: https://github.com/XilunWu
ghstack dependencies: #122995, #122996
2024-05-08 17:08:47 +00:00
efece3f142 [dtensor] add op support for memory efficient attention (#122996)
This is a followup to flash attention. On cuda, flash attention is supported only for fp16/bf16, whereas memory efficient attention is supported for fp32 (but not fp64). With this PR, one can run SDPA and in general Transformer completely in dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122996
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #122995
2024-05-08 17:08:27 +00:00
08be8ec8a9 [dtensor] improve new factory strategy (#122995)
Previously, the new tensor out of the "new factory" all become replicated.
With this PR, if the new tensor has the same shape as the old tensor **and** the shape can be evenly sharded, then the old spec is inherited and preferred.

To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995
Approved by: https://github.com/wanchaol
2024-05-08 17:05:07 +00:00
affd7a9789 Get PT2 Cutlass backend working under fbcode [take 2] (#125688)
Differential Revision: D57051232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125688
Approved by: https://github.com/chenyang78
2024-05-08 16:44:49 +00:00
87f86fd586 Fix multi template debug trace (#125703)
Fix for https://github.com/pytorch/pytorch/issues/125642

We were trying to render the template of multi template kernel before it had been finalized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125703
Approved by: https://github.com/shunting314
2024-05-08 16:31:18 +00:00
e28d9947a1 AsyncCollectiveTensor: prevent wait_tensor() calls on graph inputs from getting DCEd (#125677)
@wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss.

The problem is the following:

(1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block

(2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph.

(3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing

(4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #125676
2024-05-08 15:54:01 +00:00
5d97c22845 AOTAutograd: use info not debug logging for ViewAndMutationMeta (#125676)
Before, the AOTAutograd metadata would not get logged when running with `TORCH_LOGS="aot"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125676
Approved by: https://github.com/albanD
2024-05-08 15:54:01 +00:00
6f619cc727 [ez] functorch/test_vmap and test_dataloader to run in parallel (#125597)
Also mark test_svd serial in linalg to see if it helps with the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125597
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-08 15:37:29 +00:00
bd2635578b [vision hash update] update the pinned vision hash (#125521)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125521
Approved by: https://github.com/pytorchbot
2024-05-08 15:34:36 +00:00
2e237fcd70 Revert "[inductor] add cpp builder code. (#124045)"
This reverts commit 469383755fe416eb1c41fa724762ad3eaecdff07.

Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/clee2000 due to broke inductor/test_codecache and inductor/test_max_autotune 469383755f https://github.com/pytorch/pytorch/actions/runs/8996772350/job/24724775182 ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2100851419))
2024-05-08 15:33:20 +00:00
c5b6c696c1 Start refactoring runtime wrappers (#125595)
This is the first PR in a series where I try to organize our runtime wrappers a bit: specifically, I'd like to separate wrappers into objects that have (up to) 2 methods:
A **pre-compile** function, which takes in flat_fn and flat_args (inputs to the compiler) and wraps/modifies them
A **post-compile** function, which takes in a compiled_fn and runtime args and wraps the compiled_function.

Extra metadata necessary to run the compile functions can be stored on the attributes of the class. This way, when we think about caching, the set of attributes on the class should be the exact set of metadata that we need to serialize and save in the cache (along with common data, like fw_metadata)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125595
Approved by: https://github.com/bdhirsh
2024-05-08 15:20:36 +00:00
13462ecd27 Update preserve_node_meta to reset torch.fx.traceback.current_meta (#125500)
Fixes https://github.com/pytorch/pytorch/issues/122766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125500
Approved by: https://github.com/xmfan, https://github.com/ezyang
2024-05-08 14:30:34 +00:00
8cad88e1f3 [BE]: Improve exception typing. Remove NOQAs (#125535)
Improve some exception typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125535
Approved by: https://github.com/albanD
2024-05-08 14:07:13 +00:00
82b7b59d2a [inductor] Check if n is the input tensor of conv_pointwise (#125119)
Fix https://github.com/pytorch/pytorch/issues/124837.
Check whether n is the input tensor of convolution_pointwise or qconv2d_pointwise, if so freeze the layout to channels_last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125119
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-05-08 13:25:49 +00:00
d17be10df1 make torch.amp.autocast more generic (#125103)
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.

# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.

# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
2024-05-08 12:13:26 +00:00
320af5eaa6 Compute bounds for the variables created during codegen (#123100)
Before we would just bail out on these bounds for all variables that did
not come from the FX graph. Now we propagate the bounds whenever we have
a rule for that op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-08 08:14:06 +00:00
15a9770225 [DSD] Implement broadcast_from_rank0 option for optim state_dict (#125339)
Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125339
Approved by: https://github.com/weifengpy
ghstack dependencies: #125708, #125338
2024-05-08 07:22:20 +00:00
0542fd485f [DSD] Implement broadcast_from_rank0 option for model state_dict (#125338)
Summary:
This is useful if users would like to avoid CPU memory OOM when loading from a full state_dict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125338
Approved by: https://github.com/weifengpy
ghstack dependencies: #125708
2024-05-08 07:11:18 +00:00
88fbe79550 [DSD] Fix set_optimizer_state_dict() changes the parameters with some optimizers (#125708)
Summary:
Some optimizers, like AdamW, change the parameters even if gradients are zero. So `set_optimizer_state_dict()` may affect the parameters values with these optimizers. This PR fixes the issue.

This PR also fixes https://github.com/pytorch/pytorch/issues/121186.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125708
Approved by: https://github.com/wz337
2024-05-08 06:57:20 +00:00
469383755f [inductor] add cpp builder code. (#124045)
Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug.
I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time.

Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-05-08 05:27:15 +00:00
08f6ef0e1c [caffe2] Lazily symbolize backtrace in c10::Error (#125682)
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.

By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.

Test Plan:
Added unit tests exercising the lazy nature of `what()`.

Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346

We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
 {F1496627896}

Reviewed By: ezyang

Differential Revision: D56586844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125682
Approved by: https://github.com/ezyang
2024-05-08 04:57:59 +00:00
a1a22a22d5 [ROCm] Parameterize the triton build dir (#125420)
- Removes hard coding and helps in internal builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125420
Approved by: https://github.com/malfet
2024-05-08 04:46:52 +00:00
50073127b5 [tp] add some test for shard output layouts for rowwise parallel (#125713)
as titled. This is to make sure everything work as expected if we
configure RowwiseParallel output layouts be sharded

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125713
Approved by: https://github.com/XilunWu
ghstack dependencies: #125693, #125695
2024-05-08 03:45:34 +00:00
9a2375b6b7 [dtensor] improve some pretty print in op schema (#125695)
as titled, when I debugged
https://github.com/pytorch/pytorch/pull/125369 I found this would be
quality of life improvements

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125695
Approved by: https://github.com/yifuwang, https://github.com/XilunWu
ghstack dependencies: #125693
2024-05-08 03:45:34 +00:00
65fec7bbbf [dtensor] make sure meta tensor random op does not alternate rng state (#125693)
as titled, for meta tensor ops, we should avoid calling the RNGTracker,
which could potentially alter the current RNG state. Meta tensor ops
should be no-op and post `to_empty` init would really alter the RNG
state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125693
Approved by: https://github.com/XilunWu
2024-05-08 03:45:29 +00:00
38baa02a40 Meta kernel for _pack_padded_sequence (#124794)
Summary: Op implementation: 8cf54929e3/aten/src/ATen/native/PackedSequence.cpp (L34)

Fixes https://fb.workplace.com/groups/pytorch.edge.users/permalink/1499571650913123/

I'm not entirely sure how to test this meta kernel.

Differential Revision: D56478332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124794
Approved by: https://github.com/ezyang
2024-05-08 03:11:22 +00:00
2deea9e6e9 [profiler] enable CUPTI range profiler in build (#125685)
Fixes #125272
## About
(This is a re-spin of PR #106617)

Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616).  Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link.

The change includes-
* Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY`
* WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed
See https://github.com/pytorch/kineto/pull/724

## Example usage of CUPTI profiler
The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics
```
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
        on_trace_ready=trace_handler,
        experimental_config=torch.profiler._ExperimentalConfig(
            profiler_metrics=[
                "kineto__tensor_core_insts",
                "dram__bytes_read.sum",
                "dram__bytes_write.sum"],
            profiler_measure_per_kernel=False),
    ) as prof:
        res = train_batch(modeldef)
        prof.step()
```
For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist.

### Details of how to configure CUPTI profielr
The` _Experimental` config structure can be used to pass metrics to profiler
```
   profiler_metrics : a list of CUPTI profiler metrics used
       to measure GPU performance events. Any metric supported by CUPTI can be used, see here=
       https://docs.nvidia.com/cupti/r_main.html#r_profiler
       There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting.
   profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
      or for the entire measurement duration.
```

## Testing
Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724)
```
$> USE_CUDA=1 python setup.py install
--   CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so
--   CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so
```

Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc.
```
>$ export KINETO_LOG_LEVEL=1
>$ python xor.py
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040
  Log file: /tmp/libkineto_activities_1683060.json
  Trace start time: 2023-02-11 19:11:47  Trace duration: 500ms
  Warmup duration: 0s
  Max GPU buffer size: 128MB
  Enabled activities: cuda_profiler_range
Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum
Cupti Profiler measure per kernel : 0
Cupti Profiler max ranges : 10
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109]    sm__inst_executed_pipe_tensor.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109]    dram__bytes_read.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109]    dram__bytes_write.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s
STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685
Approved by: https://github.com/sraikund16
2024-05-08 02:34:31 +00:00
9fedf41b60 Dockerfile should set the syntax directive to v1 (#125632)
Fixes #125526 [#1811](https://github.com/pytorch/builder/issues/1811)

Adopt syntax=docker/dockerfile:1 whcih has been stable since 2018, while still best practice to declare in 2024.
- Syntax features dependent upon the [syntax directive version are documented here](https://hub.docker.com/r/docker/dockerfile).
- While you can set a fixed minor version, [Docker officially advises to only pin the major version]

```
(https://docs.docker.com/build/dockerfile/frontend/#stable-channel):
We recommend using docker/dockerfile:1, which always points to the latest stable release of the version 1 syntax, and receives both "minor" and "patch" updates for the version 1 release cycle.
BuildKit automatically checks for updates of the syntax when performing a build, making sure you are using the most current version.
```

**Support for building with Docker prior to v23 (released on Feb 2023)**
NOTE: 18.06 may not be the accurate minimum version for using docker/dockerfile:1, according to the [DockerHub tag history](https://hub.docker.com/layers/docker/dockerfile/1.0/images/sha256-92f5351b2fca8f7e2f452aa9aec1c34213cdd2702ca92414eee6466fab21814a?context=explore) 1.0 of the syntax seems to be from Dec 2018, which is probably why docker/dockerfile:experimental was paired with it in this file.

Personally, I'd favor only supporting builds with Docker v23. This is only relevant for someone building this Dockerfile locally, the user could still extend the already built and published image from a registry on older versions of Docker without any concern for this directive which only applies to building this Dockerfile, not images that extend it.

However if you're reluctant, you may want to refer others to [this Docker docs page](https://docs.docker.com/build/buildkit/#getting-started) where they should only need the ENV DOCKER_BUILDKIT=1, presumably the requirement for experimental was dropped with syntax=docker/dockerfile:1 with releases of Docker since Dec 2018. Affected users can often quite easily install a newer version of Docker on their OS, as per Dockers official guidance (usually via including an additional repo to the package manager).

**Reference links**
Since one of these was already included in the inline note (now a broken link), I've included relevant links mentioned above. You could alternatively rely on git blame with a commit message referencing the links or this PR for more information.

Feel free to remove any of the reference links, they're mostly only relevant to maintainers to be aware of (which this PR itself has detailed adequately above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125632
Approved by: https://github.com/malfet
2024-05-08 01:52:56 +00:00
58e045d03c [MPS] Fix strided ELU op (#125692)
Fixes https://github.com/pytorch/pytorch/issues/124834

Summary of changes:

In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692
Approved by: https://github.com/kulinseth
2024-05-08 01:34:40 +00:00
21aaac47e7 [torchelastic] add timing events to different stages of rendezvous (#125636)
Summary: as title

Test Plan: unit tests. Launched a test job and observed scuba results: {F1506543300}

Differential Revision: D57018103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125636
Approved by: https://github.com/d4l3k
2024-05-08 01:14:23 +00:00
a3d97f6ce4 [ONNX] Benchmark onnx export w/ ort fusions (#125700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700
Approved by: https://github.com/thiagocrepaldi
2024-05-08 01:10:05 +00:00
baf36f6d11 Pad bandwidth bound split k kernels on a100 (#125650)
For
```
import torch
import triton

dtype = torch.bfloat16

t1 = torch.empty([2, 20569856], dtype=dtype, device="cuda")
t2 = torch.empty([20569856, 13], dtype=dtype, device="cuda")

@torch.compile()
def benchmark(t1, t2):
    return torch.ops.aten.mm(t1, t2)

print(triton.testing.do_bench(lambda: benchmark(t1, t2)))
```

Improves perf from 449ms  -> 1.2578779458999634ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125650
Approved by: https://github.com/bertmaher
2024-05-08 01:04:35 +00:00
ba27548679 [MPS] Remove in place views (causes too many crashes) (#124895)
Fixes https://github.com/pytorch/pytorch/issues/96153

Remove in place views as they are a general cause for many crashes.
Proper fix to handle views without copies will come in a different PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895
Approved by: https://github.com/kulinseth
2024-05-08 01:00:37 +00:00
3fb53bb6a7 [MPS] Fix strided mse_loss (#125696)
Fixes https://github.com/pytorch/pytorch/issues/124621

Summary of changes:
- In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696
Approved by: https://github.com/kulinseth
2024-05-08 00:52:26 +00:00
939b701d3a SymInt-ify mem-efficient attention forward op signature (#125418)
Need this for dynamic shapes! Before this PR, guards on constant min / max seq len values are introduced when SDPA calls mem-efficient attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125418
Approved by: https://github.com/soulitzer
2024-05-07 23:59:28 +00:00
bb6ba31250 [DCP] Adds storage metadata, and passes it during the save path (#124772)
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:

1. The creation of save_id and load_id:
    - a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
    - both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)

2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.

*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*

*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*

This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.

One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
2024-05-07 23:53:53 +00:00
244d93039d Remove fbobjc_configs from xplat (#125586)
Summary: Pull out the configs attributes in xplat targets, these no longer do anything.

Test Plan:
```
$ buck2 uquery //xplat/... > /dev/null
```

Reviewed By: d16r

Differential Revision: D56855974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125586
Approved by: https://github.com/malfet
2024-05-07 23:48:20 +00:00
8b4d62009d [EZ] Update jinja2 to 3.1.4 (#125698)
To address https://cwe.mitre.org/data/definitions/79.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125698
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-07 23:40:54 +00:00
8be4c1bc2f [export] Add metadata for nodes insert_deferred_runtime_asserts (#125414)
Fixes [internal error](https://fb.workplace.com/groups/1075192433118967/permalink/1416709435633930/).

The issue is that the asserting nodes added in the `insert_deferred_runtime_assertion` pass do not contain metadata that the ExportedProgram requires the graph to have. One solution to fix this is to retrace the entire module, or another solution is to manually add back this metadata.

This diff implements the latter solution (manually add back the metadata) through hooking into fx.graph's `create_node` function, and adding export-specific metadata for every node that is created. The reason I did this is so that the `insert_deferred_runtime_assertion` does not have to know about what metadata export wants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125414
Approved by: https://github.com/zhxchen17, https://github.com/BoyuanFeng
2024-05-07 23:15:21 +00:00
8024e72326 [export] Warn on capture_pre_autograd_graph. (#125602)
Summary: capture_pre_autograd_graph is deprecated and torch.export won't able to provide timely fix for this API. To reduce some confusion around this we should explicitly give users clear warnings.

Test Plan: eyes

Reviewed By: tarun292

Differential Revision: D56955202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125602
Approved by: https://github.com/angelayi
2024-05-07 22:51:17 +00:00
021ff7fd77 [BE] Explicitly handle all types c10::isSigned (#125637)
By defining `CASE_ISSIGNED` macros that just returns `std::numeric_limits<dtype>::is_signed`  for the types where it makes sense and explicitly code some types when it does not

Remove `default:` case from the switch to avoid regressions like the one reported in https://github.com/pytorch/pytorch/issues/125124 , as [`-Wswitch-enum`](https://clang.llvm.org/docs/DiagnosticsReference.html#wswitch-enum) in combination with `-Werror` will raise an error in case of a missing entry, for example:
```
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: warning: enumeration value 'QInt32' not handled in switch [-Wswitch]
  switch (t) {
          ^
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: note: add missing switch cases
  switch (t) {
          ^
1 warning generated.
```

Fixes https://github.com/pytorch/pytorch/issues/125124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125637
Approved by: https://github.com/albanD
2024-05-07 22:51:03 +00:00
51f25c08f4 Fix 'Could not infer dtype of SymBool' on torch.tensor call (#125656)
Internal xref:
https://fb.workplace.com/groups/469587837192818/posts/1638909336927323/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125656
Approved by: https://github.com/albanD
2024-05-07 22:41:49 +00:00
e3d5afc60a Enable dynamo'd test for 116499 (#123469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123469
Approved by: https://github.com/janeyx99
ghstack dependencies: #123619
2024-05-07 22:17:01 +00:00
0f02e0aa39 Disable dynamo on functional optims if capturable=False (#123619)
This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False.

Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze

Note on the actual fallback logic - there was an issue with torchscript originally not handling *args, **kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619
Approved by: https://github.com/janeyx99
2024-05-07 22:17:01 +00:00
0fd1fc17c3 [MPS] Fix abs for complex types (#125662)
By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura.

Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks

Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types

Fixes https://github.com/pytorch/pytorch/issues/125135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662
Approved by: https://github.com/kulinseth
2024-05-07 22:15:20 +00:00
2163956208 [TGIF][HHC][Sharding] add device_ordinal to Subgraph (#125616)
Summary: Add a new field device_ordinal to Subgraph class so we can store device ordinal during splitting

Test Plan: See test plan for D56535827

Differential Revision: D57010103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125616
Approved by: https://github.com/jiayisuse
2024-05-07 21:48:59 +00:00
b356a0de86 Add support for multiple flexattention calls in a single compile (#125516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125516
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-05-07 21:37:37 +00:00
d4225c55d9 [fx] Prioritize runtime assertions ops (#124213)
Summary:
We want to prioritize operators involved in data-dependent runtime assertions when legalizing the graph. For example, in the following piece of code, the `assert_scalar` and `assert_async` calls need to occur before the `slice_copy` for the program to run correctly with fake tensors. Otherwise we will run into a data-dependent error.

```
        _local_scalar_dense: "Sym(u113)" = torch.ops.aten._local_scalar_dense.default(aten_minimum_default);  aten_minimum_default = None

        ge_1: "Sym(u113 >= 2)" = _local_scalar_dense >= 2
        aten_scalar_tensor_default_3: "f32[]" = executorch_exir_dialects_edge__ops_aten_scalar_tensor_default(ge_1);  ge_1 = None
        aten__assert_async_msg_2 = executorch_exir_dialects_edge__ops_aten__assert_async_msg(aten_scalar_tensor_default_3, '_local_scalar_dense is outside of inline constraint [2, 1000].');  aten_scalar_tensor_default_3 = None
        le_1: "Sym(u113 <= 1000)" = _local_scalar_dense <= 1000
        aten_scalar_tensor_default_4: "f32[]" = executorch_exir_dialects_edge__ops_aten_scalar_tensor_default(le_1);  le_1 = None
        aten__assert_async_msg_3 = executorch_exir_dialects_edge__ops_aten__assert_async_msg(aten_scalar_tensor_default_4, '_local_scalar_dense is outside of inline constraint [2, 1000].');  aten_scalar_tensor_default_4 = None

        mul: "Sym(-u112)" = -1 * sym_size;  sym_size = None
        add: "Sym(-u112 + u113)" = _local_scalar_dense + mul;  mul = None
        lt: "Sym(-u112 + u113 < 0)" = add < 0;  add = None
        aten__assert_scalar_default = executorch_exir_dialects_edge__ops_aten__assert_scalar_default(lt, 'Deferred runtime assertion failed -u0 + u1 < 0');  lt = None

        aten_slice_copy_tensor_3: "f32[u113]" = executorch_exir_dialects_edge__ops_aten_slice_copy_Tensor(getitem, 0, 0, _local_scalar_dense);  getitem = None
```

Test Plan: test case

Differential Revision: D56201450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124213
Approved by: https://github.com/SherlockNoMad
2024-05-07 21:31:10 +00:00
2f79a18324 Revert "[inductor] add cpp builder code. (#124045)"
This reverts commit 7864d287a1e56685aa754285cc2d3c31ff055f62.

Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing trunk jobs 7864d287a1 including lint ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2099306071))
2024-05-07 21:04:49 +00:00
c5e04a4479 More accurate is_bw and prompt parents cleanup for ModuleTracker utils (#125634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125634
Approved by: https://github.com/soulitzer, https://github.com/Chillee
2024-05-07 20:57:36 +00:00
fdfef759a6 Add userbase library dir to windows dll search path (#125684)
Fixes https://github.com/pytorch/pytorch/issues/125109 which is a regression introduced by https://github.com/pytorch/builder/pull/1467 that adds dynamic dependency to mkl, which if installed in the user-dir is placed into `sysconfig.sysconfig.get_config_var("userbase") / "Library" / "bin"`

Fix this one, but adding `userbase` folder to the DLL search path

Testing before this fix:
```
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\__init__.py", line 141, in <module>
    raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\lib\shm.dll" or one of its dependencies.
>>> exit()
```

After:
```
c:\Program Files\Python312>python
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> exit()
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125684
Approved by: https://github.com/malfet
2024-05-07 20:43:40 +00:00
7864d287a1 [inductor] add cpp builder code. (#124045)
Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug.
I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time.

Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-05-07 20:07:41 +00:00
b23b6e7108 Ensure that vmap is restored properly if an exception is thrown during frame eval (#122074)
We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly.  Make sure that if it's in a bad state we restore it after frame eval.

Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.75s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
================== short test summary info ===================
FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients"
does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ...
======= 1 failed, 1 skipped, 8587 deselected in 10.99s =======
```
(Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.89s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
======= 1 passed, 1 skipped, 8587 deselected in 11.34s =======
```
(test_vmap_free_tensor_dynamic_shapes passes either way)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074
Approved by: https://github.com/oulgen
2024-05-07 19:36:52 +00:00
196a0b1722 Add Inductor micro benchmark workflow (#125450)
Fixes #ISSUE_NUMBER

Co-authored-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125450
Approved by: https://github.com/huydhn
2024-05-07 18:56:01 +00:00
5fd0b6e5f7 Revert "add uuid in cudaDeviceProperties (#125083)"
This reverts commit f35fe4eaf1e9fa2e631f6bf1a3eb6e5fbf14183b.

Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky.  ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))
2024-05-07 18:16:27 +00:00
f7d48302b6 [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337)
Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336
2024-05-07 17:57:34 +00:00
a89177936c [DSD] Correctly handle _extra_state (#125336)
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335
2024-05-07 17:31:33 +00:00
9f1d3eebf5 Update PyTorch ONNX Exporter maintainers (#125630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125630
Approved by: https://github.com/BowenBao, https://github.com/kit1980
2024-05-07 17:29:05 +00:00
6f1e3a6bf7 [DCP] Always flatten mapping even if no tensors present (#125335)
Summary:
 Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.

 We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125335
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #125333, #125501, #125334
2024-05-07 17:08:49 +00:00
790f43c315 Run test_inductor_distributed with run_test (#125647)
CI feature like retrying and disable flaky test won't be available otherwise

### Testing

https://github.com/pytorch/pytorch/actions/runs/8977431927/job/24659532123#step:15:1688 looks correct now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125647
Approved by: https://github.com/clee2000
2024-05-07 17:07:36 +00:00
22767e4791 [DCP] Always create requests for non-tensor objects (#125334)
Summary:
If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125334
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333, #125501
2024-05-07 17:04:36 +00:00
9782439277 [Profiler] Do not emit a warning when using CPU profiler (#125654)
This fixes a logic regression introduced by https://github.com/pytorch/pytorch/pull/123247 where
```python
if self.use_device and self.use_device != _get_privateuse1_backend_name():
```
was replaced with
```python
        VALID_DEVICE_OPTIONS = ["cuda", "xpu", "privateuseone"]
        if self.use_device not in VALID_DEVICE_OPTIONS:
```

That triggers a warning every time code is invoke with `self.use_device` set to None

This change also skips all the checks which are useless if `use_device` is None to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125654
Approved by: https://github.com/aaronenyeshi
2024-05-07 16:56:17 +00:00
7863e04615 Back out "Get cutlass_library import working under fbcode" (#125606)
Summary: Original commit changeset: de79f6bfe348

Differential Revision: D57002294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125606
Approved by: https://github.com/chenyang78
2024-05-07 16:55:11 +00:00
71dc15742c [DSD] Improve the performance of distributed state_dict (#125501)
Summary:
1. Remove gc.collect(), which is not necessary.
2. Use lru_cache to cache _get_fqns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125501
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333
2024-05-07 16:55:05 +00:00
0e57bbb6d7 Set timeout for C++ tests (#125517)
Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm.  In this case, a C++ test on Windows timed out after 2+ hours.

```
2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021]
2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137]
2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2:
2024-05-03T02:45:33.7927201Z Traceback (most recent call last):
2024-05-03T02:45:33.7928032Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
2024-05-03T02:45:33.7928722Z     self.run()
2024-05-03T02:45:33.7929722Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run
2024-05-03T02:45:33.7931639Z     self._target(*self._args, **self._kwargs)
2024-05-03T02:45:33.7932435Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker
2024-05-03T02:45:33.7933338Z     task = get()
2024-05-03T02:45:33.7933946Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get
2024-05-03T02:45:33.7935219Z     res = self._reader.recv_bytes()
2024-05-03T02:45:33.7935897Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes
2024-05-03T02:45:33.7936609Z     buf = self._recv_bytes(maxlength)
2024-05-03T02:45:33.7937302Z   File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes
2024-05-03T02:45:33.7938316Z     waitres = _winapi.WaitForMultipleObjects(
2024-05-03T02:45:33.7938766Z KeyboardInterrupt
```

Retrying was working, but it was already too late to finish the job.  I'm setting the same default `THRESHOLD * 3` timeout value here for C++ tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517
Approved by: https://github.com/clee2000
2024-05-07 16:41:38 +00:00
1b396d69cb Revert "[CUDNN] Remove defunct cuDNN V8 API build flag (#120006)"
This reverts commit ee4cafa098ede2d9546016223cbc1a522ea3630a.

Reverted https://github.com/pytorch/pytorch/pull/120006 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm jobs in trunk ee4cafa098 ([comment](https://github.com/pytorch/pytorch/pull/120006#issuecomment-2098849813))
2024-05-07 16:28:04 +00:00
848fce35b5 [CI][ez] Don't retry when it says don't retry (#125643)
default arg for retry_shell is retries=1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125643
Approved by: https://github.com/huydhn
2024-05-07 16:20:00 +00:00
0de9ce9bb3 [export] Fix serialization of empty torch artifact (#125542)
A previous PR added support for serializing/deserializing example inputs, but this fails when `example_inputs` is none.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125542
Approved by: https://github.com/pianpwk, https://github.com/BoyuanFeng, https://github.com/ydwu4
2024-05-07 15:54:45 +00:00
b37bef9b13 Use triton_key instead of triton.__version__ for hash (#125624)
Using `triton.__version__` is not correct as the version is not always updated with each code change, so we should use the proper hash function provided by triton library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125624
Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/jansel
2024-05-07 15:43:50 +00:00
8573d9551a Fix to preserve tensor wrapper subclass dtype through multiprocessing serialization (#125615)
Fixes #125583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125615
Approved by: https://github.com/albanD
2024-05-07 14:35:48 +00:00
b29d77b54f Separate arm64 and amd64 docker builds (#125617)
Fixes https://github.com/pytorch/pytorch/issues/125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: https://github.com/pytorch/builder/issues/1811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125617
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-05-07 11:50:54 +00:00
5dee46266a Fix & optimize open device registration test. (#125572)
1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Refactor the tests and make it more flexible
3. Disable the two tests temporarily
     - test_open_device_faketensor
     - test_open_device_scalar_type_fallback
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125572
Approved by: https://github.com/albanD
2024-05-07 08:30:01 +00:00
f0c6d6100b Enable dynamo-traced optimizer peak memory tests (#124543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124543
Approved by: https://github.com/yf225, https://github.com/janeyx99
2024-05-07 08:21:50 +00:00
5033d3ba6d Disable fb_memcache for MTIA (#125658)
Differential Revision: [D57035819](https://our.internmc.facebook.com/intern/diff/D57035819/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125658
Approved by: https://github.com/jamesjwu
2024-05-07 07:00:26 +00:00
e72936c27c [PT2D] Fix the circular import issue (#125618)
As title

Differential Revision: [D57011394](https://our.internmc.facebook.com/intern/diff/D57011394/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125618
Approved by: https://github.com/wz337
2024-05-07 05:10:18 +00:00
acafabaa29 Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431)
Less verbose and it aligns it with the dynamo deepdive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123431
Approved by: https://github.com/peterbell10
2024-05-07 05:07:00 +00:00
058e28108f [inductor][cpp] support int64 vertical vec reduction (fix #124821) (#125563)
Fix #124821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125563
Approved by: https://github.com/desertfire
2024-05-07 03:56:22 +00:00
a60fa960e5 refactor: extract get_lr warning (#125545)
Extract the `_get_lr_called_within_step` checking in the `get_lr()` of every LRSchedulers.
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125545
Approved by: https://github.com/janeyx99
2024-05-07 03:15:58 +00:00
461ffaaaf3 [dynamo] support torchbind object input (#124978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124978
Approved by: https://github.com/jansel
2024-05-07 03:02:00 +00:00
c165a8e71d Enable UFMT on test_decomp.py, test_expanded_weights.py and some files (#125117)
Part of: #123062

Ran lintrunner on:

- test/test_decomp.py
- test/test_deploy.py
- test/test_determination.py
- test/test_dlpack.py
- test/test_dynamic_shapes.py
- test/test_expanded_weights.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125117
Approved by: https://github.com/jansel
2024-05-07 02:36:40 +00:00
48b6c8dbc3 [Inductor] log fusion failure due to index mismatch (#124986)
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches

In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown

My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2  ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )

For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
2024-05-07 02:29:00 +00:00
f35fe4eaf1 add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy
2024-05-07 01:26:01 +00:00
4332fc4095 [export] Allow constant attr mutation (#125424)
Test Plan: CI

Differential Revision: D56893728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125424
Approved by: https://github.com/pianpwk
2024-05-07 00:34:57 +00:00
c0c2f6156a Updated docs to add the error case for torch.multinomial Issue#125388 (#125495)
Summary: Updated docs to add the error condition for torch.multinomial

Test Plan: No change in code

Reviewers: @drisspg

Subscribers: @drisspg

Tasks:

Tags:

Fixes #125388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125495
Approved by: https://github.com/drisspg
2024-05-07 00:26:27 +00:00
3407899ba1 DTensor Fused ADAM (#125369)
Fixes https://github.com/pytorch/pytorch/issues/124633 https://github.com/pytorch/ao/issues/205

```
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adamw_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 9 deselected / 1 selected
Running 1 items in this shard

test/distributed/_tensor/test_optimizers.py .

=============================================================================== 1 passed, 9 deselected in 5.95s ================================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$ pytest test/distributed/_tensor/test_optimizers.py -s -k adam_1d_sharding
===================================================================================== test session starts ======================================================================================
platform linux -- Python 3.9.19, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/marksaroufim/pytorch
configfile: pytest.ini
plugins: hypothesis-6.100.2
collected 10 items / 7 deselected / 3 selected
Running 3 items in this shard

test/distributed/_tensor/test_optimizers.py ...

=============================================================================== 3 passed, 7 deselected in 10.79s ===============================================================================
(pt) [marksaroufim@devvm17057.vll0 ~/pytorch (dfusedadam)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125369
Approved by: https://github.com/wanchaol
2024-05-07 00:08:09 +00:00
65fc3c31bc [BE] Delete unused AT_FORALL_SCALAR_TYPES_AND[456] (#125607)
Check that they are not used by running the following
```
% grep -h "AT_FORALL_SCALAR_TYPES_AND" . -R|grep -v #define|cut -d\( -f1|sort|uniq
              AT_FORALL_SCALAR_TYPES_AND3
        AT_FORALL_SCALAR_TYPES_AND3
      AT_FORALL_SCALAR_TYPES_AND
      AT_FORALL_SCALAR_TYPES_AND2
      AT_FORALL_SCALAR_TYPES_AND3
      AT_FORALL_SCALAR_TYPES_AND7
    AT_FORALL_SCALAR_TYPES_AND2
    AT_FORALL_SCALAR_TYPES_AND3
    AT_FORALL_SCALAR_TYPES_AND7
  AT_FORALL_SCALAR_TYPES_AND2
  AT_FORALL_SCALAR_TYPES_AND3
  AT_FORALL_SCALAR_TYPES_AND7
// AT_FORALL_SCALAR_TYPES / AT_FORALL_SCALAR_TYPES_AND macros below, which are
AT_FORALL_SCALAR_TYPES_AND
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
using at::Half; // for AT_FORALL_SCALAR_TYPES_AND3
```
or by checking online using https://github.com/search?type=code&q=AT_FORALL_SCALAR_TYPES_AND4+repo%3Apytorch%2Fpytorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125607
Approved by: https://github.com/albanD
2024-05-07 00:01:35 +00:00
3411d54811 fix loading optimizer options from archive (#125215)
This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
2024-05-06 23:58:40 +00:00
eqy
ee4cafa098 [CUDNN] Remove defunct cuDNN V8 API build flag (#120006)
The flag basically does nothing following #95722

Let's see if the quantization tests break

CC @malfet @atalmanagement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006
Approved by: https://github.com/malfet
2024-05-06 23:13:58 +00:00
b98c689261 Better repro command: include test class + fix paths for py3.8 (#125498)
Fixes #117850

This PR:
* Adds the class name in the repro command
* Fixes the path to the test file for python 3.8 jobs (apparently `inspect.getfile(class_type)` returns a relative path in this older python version)

Before (in python 3.8):
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test_autograd.py -k test_foo
```

After:
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_foo
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125498
Approved by: https://github.com/huydhn, https://github.com/janeyx99
2024-05-06 22:19:12 +00:00
22bcfc25ef Initial implementation of Inductor FX Graph Remote Cache (#124669)
This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file.

It uses the same idea with the autotuning result cache that is currently live.

This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor.

Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669
Approved by: https://github.com/jansel, https://github.com/eellison
2024-05-06 22:10:27 +00:00
05bd7fe3eb Nested Tensor + AOTI test (#125513)
Since we expect AOTI to be important for serving NJT in the future, I'm adding a test demonstrating that AOTI currently works with NJT when NJT is entirely in the graph (no NJTs going in or out), and to prevent regressions.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125513
Approved by: https://github.com/angelayi, https://github.com/desertfire
2024-05-06 22:07:22 +00:00
1b3fd83ab2 [TD] Enable TD on AVX related configs (#125482)
On test configs `nogpu_AVX512` and `nogpu_NO_AVX2`, which are the next longest jobs on trunk after windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125482
Approved by: https://github.com/huydhn
2024-05-06 22:02:16 +00:00
8c74162074 Reduce the number of layers for mixtral moe model to adapt CI memory limitation (#125608)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125608
Approved by: https://github.com/Chillee, https://github.com/huydhn
2024-05-06 21:52:25 +00:00
7ddf57e9f5 xfail codegen dynamic if the test is xfailed (#125573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125573
Approved by: https://github.com/peterbell10
2024-05-06 20:55:33 +00:00
373a00df9a [dynamo] better file open method in funcname_cache (#125435)
Fix https://github.com/pytorch/pytorch/issues/124960?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125435
Approved by: https://github.com/ezyang
2024-05-06 20:55:15 +00:00
cbb3791891 [pipelining] Add tests for tracing frontend (#125449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125449
Approved by: https://github.com/wconstab
ghstack dependencies: #125273, #125448
2024-05-06 20:44:56 +00:00
bdaa7bbd7d [dynamo] fix potentially missing _torchdynamo_inline from ScriptFunction (#125447)
Fix https://github.com/pytorch/pytorch/issues/119747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125447
Approved by: https://github.com/jansel
2024-05-06 20:36:56 +00:00
ad9a27f3e5 Move autocast op list to autocast_mode.h to make sure other backends can reuse it. (#125114)
This PR refactors the op list added in #124051. To make sure other backends can reuse it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125114
Approved by: https://github.com/albanD
2024-05-06 20:31:15 +00:00
2a42c40791 Revert "Compute bounds for the variables created during codegen (#123100)"
This reverts commit bb668c6468dd4adf7737a069e7af4c3f612cfc81.

Reverted https://github.com/pytorch/pytorch/pull/123100 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing inductor tests bb668c6468 ([comment](https://github.com/pytorch/pytorch/pull/123100#issuecomment-2096837821))
2024-05-06 20:23:39 +00:00
9cd4bcb2c4 [FSDP] mark pre_backward_hook unserializable (#125464)
Saw a warning like this:

```
/opt/conda/lib/python3.10/site-packages/torch/utils/hooks.py:86: UserWarning: backward hook functools.partial(<function _pre_backward_hook at 0x7f9a3940fac0>, FullyShardedDataParallel(

....

), <torch.distributed.fsdp.flat_param.FlatParamHandle object at 0x7f25202a9720>) on tensor will not be serialized.  If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125464
Approved by: https://github.com/ezyang
2024-05-06 20:20:31 +00:00
7d10b06e1a Allow building for sm90a (#125523)
# Summary
Fixes: #125413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125523
Approved by: https://github.com/Skylion007
2024-05-06 20:03:12 +00:00
ee0c47349c Revert "Upgrade submodule oneDNN to v3.4 (#122472)"
This reverts commit dbcf123105a3f11d02f04067ca0cb377ed09e88c.

Reverted https://github.com/pytorch/pytorch/pull/122472 on behalf of https://github.com/atalman due to broke aarch64 builds and tests ([comment](https://github.com/pytorch/pytorch/pull/122472#issuecomment-2096750000))
2024-05-06 19:28:20 +00:00
af144139df Remove some pre-c++17 cruft (#125590)
Summary: C++20 has [eliminated](https://en.cppreference.com/w/cpp/types/result_of) `result_of` in favour of `invoke_result`. It's mysterious that this code even still works, but, nevertheless, I'm fixing it.

Test Plan: Sandcastle

Differential Revision: D56987418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125590
Approved by: https://github.com/Skylion007
2024-05-06 19:19:28 +00:00
daf1eb44bc try to fix the warning in distribute_tensor (#125476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #125475
2024-05-06 18:59:47 +00:00
7ffa5558ee Revert "[FX] Update type hints in torch.fx._compatibility.py (#125469)"
This reverts commit 235b4d6ec22ddac35b2e47b7e871ef10538d4aee.

Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))
2024-05-06 18:36:43 +00:00
bb668c6468 Compute bounds for the variables created during codegen (#123100)
Before we would just bail out on these bounds for all variables that did
not come from the FX graph. Now we propagate the bounds whenever we have
a rule for that op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123100
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-06 18:12:15 +00:00
3827810453 [export] suggest constant dim values in dynamic shapes fixes (#125458)
[https://www.internalfb.com/diff/D54924742](https://github.com/pytorch/pytorch/pull/121860) allowed specifying integer values for static dims in dynamic shapes. This changes suggested fixes to suggest the actual value instead of the current "None".

Test Plan: existing export tests cover this

Differential Revision: D56921142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125458
Approved by: https://github.com/avikchaudhuri
2024-05-06 17:44:19 +00:00
6ebec38453 Add ciflow/linux-aarch64 to auto labeler on mkldnn PR's (#125599)
Trigger Aarch64 CI on oneDNN changes, to detect issues like this: https://github.com/pytorch/pytorch/issues/125548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125599
Approved by: https://github.com/malfet, https://github.com/snadampal
2024-05-06 17:26:17 +00:00
e30e6d321f [MPS][BE] Introduce MetalShaderLibary class (#125550)
That factors out a repeated pattern of creating a library/fetching a func from source

Typical usecase
```cpp
static MetalShaderLibrary lib(SHADER_SOURCE);
...

id<MTLComputePipelineState> cplState = lib.getPipelieStateForFunc("kernel_name")
```
- Make it possible to use with templated sources
- Add `scalarToMetalTypeString(const Tensor&)` variant to avoid repeated `scalarToMetalTypeString(t.scalar_type())` calls in the code

I.e. it makes no functional changes, but reduces MPS codebase size by 365 lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125550
Approved by: https://github.com/kulinseth
2024-05-06 17:15:47 +00:00
7bf6ed01ac [inductor] Remove symbol exports in C shim for Windows (#125472)
Summary:
This shim exports symbols on Windows, which can lead to symbol clashes at link time in the following scenario:
1. A DLL imports libtorch
2. A binary imports libtorch, and also depends on the DLL in (1)

Under that scenario, the symbols exported from `shim.h` can clash at link time.

Given that AOTInductor only works for PyTorch2, and PyTorch2 doesn't currently work for Windows, we can work around this problem by simply removing the symbols export on Windows. In the long term, this will need to be figured out when Windows support is added & tested for PyTorch2.

Differential Revision: D56936696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125472
Approved by: https://github.com/desertfire
2024-05-06 14:43:11 +00:00
b6bcd09173 Get rid of tabular and sizes, beef up verbosity of output graph (#125507)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125507
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #125505
2024-05-06 13:41:58 +00:00
71bec453b1 [xla hash update] update the pinned xla hash (#124599)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124599
Approved by: https://github.com/pytorchbot
2024-05-06 12:19:42 +00:00
60efb1060a Make codegen dynamic test faster (#125569)
Let's early exit + avoid an unnecessary split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125569
Approved by: https://github.com/kadeng
2024-05-06 12:09:19 +00:00
24b64fc482 [HOP][inductor] Support pytrees as associative_scan input (#122137)
This allows `associative_scan` to take an arbitrary pytree of tensors,
which is flattened to their leaves before calling the `associative_scan`
higher order operator.

I also add support in inductor to generate code for scanning over sequences
of tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122137
Approved by: https://github.com/lezcano, https://github.com/Chillee
ghstack dependencies: #119430
2024-05-06 11:29:28 +00:00
68a1f787c8 [inductor][cpp] move some common cpp utils to cpp_utils.py (#125152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125152
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-05-06 04:30:30 +00:00
fc183f0bde [Inductor] Properly package target info for triton.compile (#125553)
Triton updated the interface for `triton.compile` 5162346487

The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.

Re-submitting this after https://github.com/pytorch/pytorch/pull/125241 was reverted due to an unrelated CI issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125553
Approved by: https://github.com/huydhn
2024-05-06 01:36:36 +00:00
1dd42e42c4 [BE]: Try TCH autofixes on torch/ (#125536)
Tries TCH autofixes and see what breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536
Approved by: https://github.com/ezyang
2024-05-05 23:13:59 +00:00
ccbac091d2 Revert "Add write_record_metadata to PyTorchFileWriter (#125184)"
This reverts commit dd92637f445d2787f83829079276f71b1ad1fc7c.

Reverted https://github.com/pytorch/pytorch/pull/125184 on behalf of https://github.com/izaitsevfb due to breaks internal builds, see D56962076 ([comment](https://github.com/pytorch/pytorch/pull/125184#issuecomment-2094976897))
2024-05-05 22:40:00 +00:00
1b1d593c8c Don't call item() into torch.scalar_tensor uselessly (#125373)
Fixes https://github.com/pytorch/pytorch/issues/125368

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125373
Approved by: https://github.com/Skylion007
2024-05-05 22:38:16 +00:00
ecd62746e3 Also pull size/stride info from example_value (#125505)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125505
Approved by: https://github.com/jansel
2024-05-05 22:27:46 +00:00
d1a3271a55 [ez]2->3 shards for asan slow (#125499)
One of the shards has been timing out recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125499
Approved by: https://github.com/huydhn
2024-05-05 21:02:44 +00:00
94c4855e75 [Inductor max autotune] Make autotune_select_algorithm more robust (#124928)
This diff makes sure that a custom exception is thrown when no valid
choices remain during autotuning. This allows to gracefully fall back
to a default choice, even if that default choice has not been passed to
autotune_select_algorithm.

Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning.
( An error is being logged, though).

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124928
Approved by: https://github.com/int3
ghstack dependencies: #125406
2024-05-05 20:10:10 +00:00
58d8388ed3 Remove Inductor IRs for legacy functional collectives (#124992)
This PR completely removes the Inductor IR for legacy functional collectives:
- Removed the `CollectiveKernel` hiearchy and `Wait`, as well as the corresponding lowerings. These IRs are target (i.e. Python) specific and don't model node dependencies propoerly (e.g. they rely on `never_reuse_buffers` for correct behavior). They've been superceded by `ir._CollectiveKernel`.
- Removed `InPlaceHint` and the scheduler logic for handling it. `InPlaceHint` is a codegen-time buffer reuse mechanism controlled by the IR's codegen. It's a bit hacky and overlaps with the default buffer reuse mechanism. Removing it since it is only used by legacy functional collectives.
- Removed `OutputBuffer` and `MultiOutputNoSizeAssert` which are designed for and only used by legacy functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124992
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-05-05 19:49:58 +00:00
235b4d6ec2 [FX] Update type hints in torch.fx._compatibility.py (#125469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469
Approved by: https://github.com/Skylion007
ghstack dependencies: #125468
2024-05-05 19:30:22 +00:00
30c9fd96f6 [FX] Add missing forbidden mutation methods in immutable collections (#125468)
Add `list.sort`, `list.reverse`, `dict.__ior__`, and `dict.setdefault`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125468
Approved by: https://github.com/Skylion007
2024-05-05 19:30:22 +00:00
7c59720ba7 [comm] Ensure ncclComm is not aborted before checking exception (#124466)
Differential Revision: D56347560

More details in this pytorch issue: https://github.com/pytorch/pytorch/issues/124468

It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
    dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```

What can happen is this:

1. dist.destroy_process_group() calls into shutdown() and then calls into abort: b2f6cfd9c0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1095)
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; b2f6cfd9c0/torch/csrc/distributed/c10d/NCCLUtils.hpp (L388).
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.

So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted

Some more longer term discussion in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124466
Approved by: https://github.com/shuqiangzhang, https://github.com/yoyoyocmu, https://github.com/kwen2501
2024-05-05 18:55:48 +00:00
99e4909677 Remove assertion for cat target_func (#125540)
Summary:
We remove the assertion for target_func being cat.
The reason is that we have multiple flavors of concat, such as
cat/cat.default/cat_slice/cat_slice_cat/...
Assertion here is causing multiple times of false positives.

Test Plan: Removing assertion code only.

Differential Revision: D56971387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125540
Approved by: https://github.com/hl475
2024-05-05 18:17:47 +00:00
650a248d3e Rename is_unspecialized to pass_arg_as_tensor, add comment (#125496)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125496
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419, #125483, #125494
2024-05-05 16:57:50 +00:00
12da7ee58f Don't use wrap_fx_proxy_cls for wrap_symint (#125494)
We use very little of the code in wrap_fx_proxy_cls, so dupe it out.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125494
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419, #125483
2024-05-05 16:57:50 +00:00
617e473da5 Split wrap_symint out of wrap_unspecialized_primitive (#125483)
While there are some similarities, they are also quite different (one
handles Numpy numbers while the other handles ints.  I am also going to
add a wrap_symfloat soon which will do even more different behavior.
So split these out for clarity.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125483
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419
2024-05-05 16:57:50 +00:00
10f673541e [Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK (#125406)
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops.

This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992

Test Plan:
 - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement
 - CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406
Approved by: https://github.com/jansel
2024-05-05 15:28:45 +00:00
f70bd71a48 [FSDP2] Computed grad divide factors at runtime (#125484)
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.

There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)

Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}  + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.

Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.

We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).

**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
2024-05-05 14:11:33 +00:00
dba689bbfd Revert "[FSDP2] Computed grad divide factors at runtime (#125484)"
This reverts commit 9aa7699185e4ec39077e3046dfd63244dffa9ddb.

Reverted https://github.com/pytorch/pytorch/pull/125484 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I am trying to restore ROCm distributed failures in trunk 9aa7699185 ([comment](https://github.com/pytorch/pytorch/pull/125484#issuecomment-2094646996))
2024-05-05 06:12:01 +00:00
cyy
8a0529e986 [2/2] Remove Caffe2 db and distributed code (#125533)
This PR follows #125092 to remove caffe2/db/* and caffe2/distributed/* .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125533
Approved by: https://github.com/kit1980
2024-05-05 05:10:17 +00:00
7f0c5eb023 Added some more flex attention tests (#125487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125487
Approved by: https://github.com/yanboliang
2024-05-04 23:42:40 +00:00
6d30803d64 Revert "[Inductor] Properly package target info for triton.compile (#125241)"
This reverts commit 8a1af95b0979d85c4fe32a75e797323ad81f298d.

Reverted https://github.com/pytorch/pytorch/pull/125241 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing inductor tests on ROCm 8a1af95b09 ([comment](https://github.com/pytorch/pytorch/pull/125241#issuecomment-2094472886))
2024-05-04 22:28:16 +00:00
084d818e71 Revert "try to fix the warning in distribute_tensor (#125476)"
This reverts commit 2b41e1d6fc05428008875e3cfe8be17184e57491.

Reverted https://github.com/pytorch/pytorch/pull/125476 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125476#issuecomment-2094468740))
2024-05-04 22:25:32 +00:00
a32ad828dc Revert "Don't call item() into torch.scalar_tensor uselessly (#125373)"
This reverts commit 2b4fe183db00db88749f8524f3b4a69ca80da0ec.

Reverted https://github.com/pytorch/pytorch/pull/125373 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but there are real failures on the PR that sneak in during the log classifier outage ([comment](https://github.com/pytorch/pytorch/pull/125373#issuecomment-2094464241))
2024-05-04 22:22:36 +00:00
f04c8471a4 [dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421, #124522
2024-05-04 22:08:56 +00:00
5ba777f46e [guards][cpp-guards] Optimize NN module getattr guards (#124522)
Improves the guard overhead of MobileBert model with nn module guards from 92000 units to 20000 units.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124522
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421
2024-05-04 22:08:56 +00:00
76a26a885d Add module tracker (#125352)
This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack.
If it is too problematic to review, I can re-split, just let me know.
This does:
- Cleanup context manager use in test_flop_counter
- Remove need for mod argument in FlopCounterMode, warning about it
- Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook).
- Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352
Approved by: https://github.com/mikaylagawarecki
2024-05-04 18:33:35 +00:00
1a20b4ef3f [dynamo] handle inactive nullcontexts across graph breaks (#125518)
whoops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125518
Approved by: https://github.com/yanboliang
2024-05-04 12:52:20 +00:00
6f70d22277 Extend torch.utils._sympy.symbol for more Inductor symbols (#125419)
I'm still missing a few, cdzq at least

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419
Approved by: https://github.com/lezcano
ghstack dependencies: #125395
2024-05-04 09:05:00 +00:00
5cd7c75bd9 [pipelining] Add tracing frontend (#125448)
This PR allows user to transform a model into a pipeline representation with split stages, according to a split spec.
```
def pipeline(
    module: torch.nn.Module,
    num_chunks: int,
    example_args: Tuple[Any, ...],
    example_kwargs: Optional[Dict[str, Any]] = None,
    split_spec: Optional[Dict[str, SplitPoint]] = None,
    split_policy: Optional[Callable[[fx.GraphModule], fx.GraphModule]] = None,
) -> Pipe:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125448
Approved by: https://github.com/H-Huang
ghstack dependencies: #125273
2024-05-04 09:00:25 +00:00
2b4fe183db Don't call item() into torch.scalar_tensor uselessly (#125373)
Fixes https://github.com/pytorch/pytorch/issues/125368

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125373
Approved by: https://github.com/Skylion007
2024-05-04 08:07:13 +00:00
5ef50d75f8 Don't short circuit if shape is same (#125188)
This is more unbacked SymInt friendly.  If this does not work, my back
up plan is to short circuit only if it is statically known equal.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125188
Approved by: https://github.com/albanD
2024-05-04 07:11:08 +00:00
cyy
83845a7c78 [1/2] Remove caffe2 db and distributed from build system (#125092)
This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 db, distributed and some binaries have been removed.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125092
Approved by: https://github.com/malfet
2024-05-04 06:48:46 +00:00
2b41e1d6fc try to fix the warning in distribute_tensor (#125476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125476
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #125475
2024-05-04 05:25:13 +00:00
b62e89c1b8 [dynamo] Do not turn on record relay with TORCH_COMPILE_DEBUG (#125488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125488
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-05-04 05:10:31 +00:00
ff061baa94 [comm_mode] adding some initial c10d ops to CommDebugMode (#125475)
looks like we can make it work :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125475
Approved by: https://github.com/awgu
2024-05-04 04:20:46 +00:00
d4727fd4eb [TD][ez] Better check for is pr or not (#125485)
You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch.

get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number`

If we fail to find something like this, then assume it is on the main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485
Approved by: https://github.com/huydhn
2024-05-04 03:08:44 +00:00
0302dc68bf [Reland] Fakify script object inputs and attributes for non-strict ex… (#125490)
A re-land of #124239.

This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.

The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.

Concretely, we

fakify all the script object inputs, and module attributes (gathered by constant_attrs).
patch the module's attributes with fakified script object
right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490
Approved by: https://github.com/angelayi
2024-05-04 02:39:42 +00:00
bfd5bb0c44 [c10d] only PG0 should dump when monitoring thread timed out (#125356)
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
 unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
2024-05-04 00:43:20 +00:00
eqy
d325c55896 Add CUDA paths to CODEOWNERS (#125409)
CC @ptrblck @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125409
Approved by: https://github.com/albanD
2024-05-04 00:29:39 +00:00
8a1af95b09 [Inductor] Properly package target info for triton.compile (#125241)
Triton updated the interface for `triton.compile` 5162346487

The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125241
Approved by: https://github.com/jansel
2024-05-04 00:10:53 +00:00
9aa7699185 [FSDP2] Computed grad divide factors at runtime (#125484)
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.

There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)

Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}  + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.

Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.

We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).

**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
2024-05-03 23:44:05 +00:00
996bb74077 [FSDP2] Added HSDP grad acc tests and some minor changes (#125479)
This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125479
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431
2024-05-03 23:44:05 +00:00
b96b1e8cff [Distributed] Add P2P versions of *object_list operations (#124379)
This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream.

With this change, sending and receiving arbitrary picklable python objects is possible.

Relevant issue: https://github.com/pytorch/pytorch/issues/3473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-03 23:22:58 +00:00
f2ab96a57e [dynamo] fix crash when context manager is passed to a function (#125321)
Fix https://github.com/pytorch/pytorch/issues/125274. Main change was to reconstruct `ContextWrappingVariables` as objects in general, but we can replace them with the class on the caller side when generating the resume function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125321
Approved by: https://github.com/jansel
2024-05-03 23:01:30 +00:00
59abd1dccb Fix lint after PR 122611 (#125512)
Fix lint after https://github.com/pytorch/pytorch/pull/122611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125512
Approved by: https://github.com/clee2000
2024-05-03 22:58:20 +00:00
4abcf36dde Make c10::Error empty backtrace as an optional argument (#122611)
Summary: Split from the main diff in the stack.

Test Plan: Build validation should be enough.

Reviewed By: ezyang

Differential Revision: D55313410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122611
Approved by: https://github.com/ezyang
2024-05-03 22:50:00 +00:00
a783fef990 [AOTI] Add a missing mypy ignore (#125508)
Summary: Caused by https://github.com/pytorch/pytorch/pull/125397, but somehow was not caught by CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125508
Approved by: https://github.com/izaitsevfb
2024-05-03 22:32:31 +00:00
2b5ae2611e s390x: use runtime detection for vectorization support (#123936)
s390x: use runtime detection for vectorization support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123936
Approved by: https://github.com/malfet, https://github.com/jansel, https://github.com/xuhancn
2024-05-03 21:34:37 +00:00
5503c29357 Introduce torch.utils._sympy.symbol (#125395)
This provides utilities for creating and querying properties on
sympy.Symbol.  I want to use this refactor to get a better handle on how
the 's' prefix is being used in Inductor.  To start, I only do
symbolic_shapes code because that's what I'm familiar with.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125395
Approved by: https://github.com/Skylion007
2024-05-03 21:24:23 +00:00
1a578df57c [FSDP2] Added test to show rank 0 broadcast for HSDP replicas (#125431)
This PR shows a simple utility to broadcast the parameters across replicas for HSDP:
```
replicate_group = mesh.get_group("replicate")
for param in model.parameters():
    # E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
    # replicating on dim-0, broadcast with sources 0, 1, 2, 3
    src_rank = dist.get_process_group_ranks(replicate_group)[0]
    torch.distributed.broadcast(
        param.to_local(), src=src_rank, group=replicate_group
    )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125431
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-03 21:17:35 +00:00
c941fee7ea [CPP extention] Baton lock is called regardless the code version (#125404)
Greetings!

Fixes #125403

Please assist me with the testing as it is possible for my reproducer to miss the error in the code. Several (at least two) threads should enter the same part of the code at the same time to check file lock is actually working

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125404
Approved by: https://github.com/ezyang
2024-05-03 21:10:39 +00:00
645baef05d s390x: remove workaround for sleef issue (#124730)
This workaround is no longer needed since sleef was updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124730
Approved by: https://github.com/soulitzer
2024-05-03 20:52:05 +00:00
b1a7455b99 [Inductor cutlass backend] Fix cutlass_utils.get_max_alignment() for strided layouts. (#124930)
Fixes cutlass_utils.get_max_alignment() which was so far not checking the alignment properly. Basically
the method so far assumed that the passed layout is contiguous and row-major, which does not have to be true.

Test Plan:
CI - test_cutlass_backend.py to prevent regressions
Added unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124930
Approved by: https://github.com/int3
ghstack dependencies: #124929
2024-05-03 20:50:26 +00:00
a988b4ed76 [AOTI] Generate mul_Scalar instead of mul_Tensor (#125397)
Summary: Fix https://github.com/pytorch/pytorch/issues/117365. When the second argument to aten.mul.Tensor is a scalar (e.g. scale factor), the cpp wrapper expects to generate a call to mul_Scalar when fallback happens (e.g. Complex dtype).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125397
Approved by: https://github.com/chenyang78
ghstack dependencies: #125329
2024-05-03 18:35:42 +00:00
e84a5b6cc0 [AOTI] Add missing std::move for constant args (#125329)
Summary: fix https://github.com/pytorch/pytorch/issues/123187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125329
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-05-03 18:35:42 +00:00
d6052a35d4 [RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods (#125394)
FSDP only runs its pre/post-forward hooks on `nn.Module.forward`. This means that if the user runs a custom method meant as a forward pass, then FSDP will not all-gather the parameters. Examples include HuggingFace models' `generate()` (https://github.com/pytorch/pytorch/issues/123962, https://github.com/pytorch/pytorch/issues/100069) or others (https://github.com/pytorch/pytorch/issues/109385).

This PR adds a monkey patching API `register_fsdp_forward_method(module: nn.Module, method_name: str)` to allow FSDP pre/post-forward hooks to run on the method. The function is a no-op if the passed-in `module` is not an FSDP module so that the register function can be called even if the FSDP wrapping changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125394
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-03 18:31:28 +00:00
52f9128a0d [AMD] Fix cutlass path in inductor (#125463)
Summary: Trunk is broken because fbcode triton-amd doesn't have cutlass path

Test Plan: It now runs.

Differential Revision: D56923833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125463
Approved by: https://github.com/Skylion007
2024-05-03 18:02:58 +00:00
e10b2ba357 Script for compiling count + time of test at file granularity (#125322)
Adds script for compiling # of tests + total time take at the file granularity
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125322
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-05-03 17:35:44 +00:00
12a69afa6d [export] Fix deserializer node meta handling. (#125454)
Summary: The code seems not needed because serializer shouldn't make any meaningful decision about what goes to node metadata.

Test Plan: CI

Differential Revision: D56918543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125454
Approved by: https://github.com/angelayi
2024-05-03 16:51:08 +00:00
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
a99ada5b27 call super().__post_init__ in ForeachFuncinfo.__post_init__ (#125457)
obviously the current main branch's `ForeachFuncInfo`'s dunder post init doesn't `super().__post_init__()` which does some setup including setting `dtypesIfCUDA` and `dtypesIfROCM`.

Fixes #125295
related: #125001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125457
Approved by: https://github.com/janeyx99
2024-05-03 14:54:15 +00:00
79af814369 [FSDP] Added private _unshard API (#124304)
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">

We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).

Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
2024-05-03 13:14:15 +00:00
ca98c2a932 inductor: Add Conv3d support (#124361)
This PR is to add Conv3d support in inductor. Basicly reuse and expand Conv2d logic and unit tests to Conv3d.

Conv3d inductor support will improve the performance of C2D_R50, I3D_R50, I3D_R101, Slow and SlowFast-R50 from OOB models.

  | C2D_R50 | I3D_R50 | I3D_R101 | Slow | SlowFast-R50
-- | -- | -- | -- | -- | --
eager | 15.805 | 13.909 | 11.639 | 12.101 | 6.606
Compile w/o conv3d | 17.244 | 14.893 | 12.109 | 13.015 | 6.603
Compile w/ conv3d | 21.212 | 17.707 | 14.974 | 16.130 | 8.537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124361
Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
2024-05-03 10:24:14 +00:00
489b4586e9 [optim]fix ut and sgd kernel (#124904)
- Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs`
 - We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue,
 - I set a manual seed to reproduce purpose if there is any numerical failure
 - I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel.
 - I make it check both `cuda` and `cpu`.
 - I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-03 09:13:24 +00:00
bebefcf845 Driver folder check (#117548)
Added extra check for driver folders for Libtorch, as stat struct does not recognize driver folders, so torch.save should work for them as well. (e.g. save model.pt directly under C: )

Fixes [#111121](https://github.com/pytorch/pytorch/issues/111121) and #105488

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117548
Approved by: https://github.com/malfet
2024-05-03 09:10:11 +00:00
e5cc7ada67 skip triton template precompilation in 311.0-3.11.7 to workaround 311 cpython bug (#125446)
Fix for https://github.com/pytorch/pytorch/issues/125374. We dont have CI for this specific versions, but I verified locally. THere is a cpython bug from 3.11.0->3.11.7 where the ast parsing state is global, and errors with multiple threads. when dust settles a little around the new process based compilation we can look into migrating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125446
Approved by: https://github.com/Chillee
ghstack dependencies: #125289
2024-05-03 08:28:32 +00:00
dd92637f44 Add write_record_metadata to PyTorchFileWriter (#125184)
Add `PyTorchFileWriter.write_record_metadata(record_name, num_bytes)` that
- writes the zipfile header/end of central directory metadata for an entry*
- reserves `num_bytes` in the zipfile for the payload.

*Since the payload is not provided, the CRC32 computation is skipped and 0s are written in the corresponding entry of the zipfile header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125184
Approved by: https://github.com/albanD
2024-05-03 07:29:52 +00:00
4c84789743 [vision hash update] update the pinned vision hash (#123227)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123227
Approved by: https://github.com/pytorchbot
2024-05-03 05:55:29 +00:00
071ee40793 [dynamo][nn module] Check for duplicate tensors in register_attr_or_module (#125421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125421
Approved by: https://github.com/mlazos
ghstack dependencies: #125439
2024-05-03 05:08:09 +00:00
ef757a5c00 [export] use tree_map for _flatten_dynamic_shapes (#125415)
Summary:
Fixing the implementation of `_flatten_dynamic_shapes()`, to follow how `_process_dynamic_shapes()` does it. The previous implementation would misinterpret some nested dynamic shapes specs, causing it to miss out on some shapes specs, for example with nested inputs/constant input tuples:

```
inputs = (
    (2, 1),
    (
        torch.randn(2, 1),
        torch.randn(2, 2),
        torch.randn(2, 3),
    )
)

dynamic_shapes = (
    (None, None),
    (
        None,
        None,
        None,
    )
)
```
This would get interpreted as 2 shapes specs for 2d and 3d tensors. Fixing so this doesn't happen.

Test Plan: Existing export tests

Differential Revision: D56894923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125415
Approved by: https://github.com/angelayi
2024-05-03 04:59:17 +00:00
394ec2da30 Remove GPU Check from Basic Chrome Trace test (#125430)
Summary: Remove the check to make sure all GPU labels are enumerated when CUDA is available. There are some systems where CUDA is available but we do not print any GPU labels (because GPU is not available).

Test Plan: Test in regression with ciflow/periodic label

Differential Revision: D56906893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125430
Approved by: https://github.com/izaitsevfb
2024-05-03 04:51:10 +00:00
8706da2bad [dynamo][cpp-guards] Improve recompilation reason logic for NO_TENSOR_ALIASING guard (#125439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125439
Approved by: https://github.com/williamwen42
2024-05-03 04:49:41 +00:00
d156cb2e12 Fix mem size mismatch from split/chunk in const folding (#125199)
Summary:
The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch  when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529).

This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size.

Test Plan:
Before this change: (18432 = 48 * 64 * 2 * 3)
 ```
RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter
```

```
buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch
```
```
Ran 1 test in 7.048s

OK
```

Reviewed By: jfix71

Differential Revision: D56663931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125199
Approved by: https://github.com/jfix71
2024-05-03 04:42:38 +00:00
a40d6df448 [MPS] Native nonzero implementation (#125355)
Fixes https://github.com/pytorch/pytorch/issues/124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355
Approved by: https://github.com/kulinseth
2024-05-03 03:50:58 +00:00
e15da7856c [MPS] Fix overflow in cumsum when dtype is bool (#125318)
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)

A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.

It turns out this issue also affects boolean types:

```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```

In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.

Fixes #96614 #106112 #109166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
2024-05-03 01:19:24 +00:00
acac7aa70f [CI] Unskip Linalg tests on ARM (#125377)
Removes obscure "Issue with numpy version on arm" added by https://github.com/pytorch/pytorch/pull/82213
And replaces it with 4 targeted skips:
 - test_addmv for `float16`
- test_vector_norm for `float16`, `bfloat16` and `float32`

Followups to fix them are tracked in https://github.com/pytorch/pytorch/issues/125438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125377
Approved by: https://github.com/kit1980
2024-05-03 01:18:52 +00:00
d18a6f46d0 Adding Compare in torch.utils.benchmark documentation (#125009)
`torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation.

I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009
Approved by: https://github.com/mikaylagawarecki
2024-05-03 00:50:54 +00:00
4440d0755a Support custom layout call under torch dispatch mode (#125379)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125379
Approved by: https://github.com/jbschlosser
2024-05-02 23:44:12 +00:00
7551755cec Update tolerance for flex fp32 (#125444)
# Summary
Updates the tolerances to account for internal failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125444
Approved by: https://github.com/kit1980
2024-05-02 23:34:18 +00:00
3b5f6b10ad [Inductor] default block size for head_dim = 256 for flex attention (#125380)
## H100
### torch.bfloat16
No major change, as expected.
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     1.122 |              |             |             |             |            |             |                |
| Max     |     1.437 |            1 |          16 |         512 |         512 |        128 | head_bias   | torch.bfloat16 |
| Min     |     0.895 |            1 |          16 |        1024 |        1024 |         64 | head_bias   | torch.bfloat16 |
```
### torch.float32
Before: OOM when ```head_dim``` = 256
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.231 |              |             |             |             |            |             |               |
| Max     |     3.760 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.532 |            1 |          16 |         512 |         512 |        256 | causal_mask | torch.float32 |
```

## A100
### torch.bfloat16
Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     0.587 |              |             |             |             |            |               |                |
| Max     |     0.960 |            1 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.017 |            8 |          16 |        4096 |        4096 |        256 | relative_bias | torch.bfloat16 |
```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.756 |              |             |             |             |            |             |                |
| Max     |     0.931 |            1 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     0.467 |           16 |          16 |        1024 |        1024 |        256 | noop        | torch.bfloat16 |
```

### torch.float32
Before: OOM when ```head_dim``` = 256
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.386 |              |             |             |             |            |             |               |
| Max     |     7.584 |           16 |          16 |         512 |         512 |         64 | noop        | torch.float32 |
| Min     |     0.948 |            1 |          16 |         512 |         512 |        256 | causal_mask | torch.float32 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125380
Approved by: https://github.com/drisspg
2024-05-02 22:51:07 +00:00
5c7b71dccf [DCP] Adds strict option to DefaultPlanner (#123869)
~Users may have custom use cases for the `strict` parameter in load. In my mind, if we automatically call `state_dict` and `load_state_dict` in save/load, we need to support the same functionality in `nn.Modules`.~

It turns out this is actually not related to nn.Module's strict param. Since `state_dict` is called inside `dcp.load`, it's actually impossible to create a model such that the following would raise an error:
```
state_dict = module.state_dict()
module.load_state_dict(state_dict, strict=True)
```

The issue is actually just when there are elements in `state_dict` which do not exist in the checkpoint. This PR adds the ability to configure this behavior through the DefaultSavePlanner (see tests).

Concretely, if module has extra attributes not present in the checkpoint, we will only raise an error if `DefaultLoadPlanner.allow_partial_load==False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123869
Approved by: https://github.com/fegin
2024-05-02 22:50:32 +00:00
2c8237c6aa [ATen-VK] Resolve compiler_flags to allow Mac build (#125361)
Summary:
## `-Wmissing-prototypes`

In ATen-Vulkan, we often define functions in `.cpp` files without declaring them in `.h` files without hiding them in an anonymous namespace.

Example: [`Packing.cpp`'s channel_image_repacking()](f1f142c44f/aten/src/ATen/native/vulkan/impl/Packing.cpp (L299-L348))

On Mac, this results in a `-Wmissing-prototypes` warning, which is disabled in this change.

## `-Wshadow`

In `Adapter.cpp`, we overwrite a variable called `properties`, which we fix in this change as opposed to disabling the warning.

Test Plan: CI

Differential Revision: D56850324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125361
Approved by: https://github.com/SS-JIA
2024-05-02 22:26:39 +00:00
55c705b602 [dynamo] add trace_bytecode logging artifact (#125360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125360
Approved by: https://github.com/ezyang
2024-05-02 22:01:00 +00:00
a0e2f62edd Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit 9e24c263f998819f849bb8293323213101e9aefc.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))
2024-05-02 21:36:18 +00:00
b1b03992d0 Merge the pyi files into py files of optimizer (#125153)
Merge the interfaces in pyi files into py files in `torch/optim`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125153
Approved by: https://github.com/janeyx99
2024-05-02 21:29:31 +00:00
edad82fc90 Add private helper for determining which version of FA2 closest matches kernel version (#123653)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123653
Approved by: https://github.com/mikaylagawarecki
2024-05-02 21:28:23 +00:00
0199ce8d6c [pipelining] Add microbatch split and merge utils (#125273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125273
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776, #124875, #124958
2024-05-02 21:09:47 +00:00
1657f7e262 [Doc] Update docstrings for torch/random.py (#125265)
Updates the docstrings for torch/random.py to clarify what device / RNG each function operates on.

While trying to understand the difference between
```
state = torch.random.get_rng_state()
some_code
torch.random.set_rng_state(state)
```
and
```
with torch.random.fork_rng():
    some_code
```
I found out that there was a note about this in the docstring that wasn't being rendered on the website. I fixed that note and added additional clarifications on other functions in this file.

Test Plan:
Built the docs and verified that everything renders correctly.

<img width="911" alt="Screenshot 2024-04-30 at 2 22 08 PM" src="https://github.com/pytorch/pytorch/assets/9263852/f219bc35-89bd-4f5b-ba60-255b089499a4">

<img width="901" alt="Screenshot 2024-04-30 at 2 22 13 PM" src="https://github.com/pytorch/pytorch/assets/9263852/c141e7fa-afc9-4c66-b460-96668ce35606">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125265
Approved by: https://github.com/Balandat, https://github.com/lezcano
2024-05-02 20:55:23 +00:00
fc76764a56 Always pass down kernel_file and grid as string (#125384)
From my test with Ads production workload, I found sometime kernel_file is None and grid is a tuple. It will crash since ExecutionTraceObserver expects string for both kernel_file and grid. This PR is to make sure  kernel_file and grid are always passed down as string. Need to find the root cause why kernel_file is none.

Unit test:
     buck test  @mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125384
Approved by: https://github.com/davidberard98, https://github.com/sraikund16
2024-05-02 20:43:20 +00:00
dae574c713 Don't make replacements for i variables (#125398)
This was introduced in https://github.com/pytorch/pytorch/pull/110262
but actually it looks like they were trying to hit unbacked SymInt.
Now that unbacked SymInt is renamed to u, this code is no longer
necessary

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125398
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2024-05-02 20:38:09 +00:00
4f62494bf9 [DCP] Move async logic into filesystem for better encapsulation (#124944)
This logic is specific to FilesystemWriter, and now has a better place to live due to the new AsyncStager class

Differential Revision: [D56578436](https://our.internmc.facebook.com/intern/diff/D56578436/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124944
Approved by: https://github.com/fegin
ghstack dependencies: #122965, #124939
2024-05-02 20:31:33 +00:00
2bbfb70831 ignore unsupported module from flop counter (#125346)
Summary:
Torchscript modules do not support forward hooks and thus can't work with flop_counter context manager for hierarchical output by passing a module to FlopCounterMode on construction.

Currently any module that includes a script module causes an exception to be thrown so adding a try/catch to ignore any script modules for forward hooks.

Test Plan: CI Signals

Differential Revision: D56850661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125346
Approved by: https://github.com/842974287
2024-05-02 20:30:52 +00:00
799f1460af [DCP] Provides default AsyncStager (#124939)
Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939
Approved by: https://github.com/fegin
ghstack dependencies: #122965
2024-05-02 19:48:54 +00:00
3741fb3680 [DCP] Introduce async staging extension points (#122965)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #124944
* #124939
* __->__ #122965

Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)

*This PR is now ready for merge and is not an RFC*

Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)

This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a  thread to avoid GIL issues.

A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
    #     this step accomplishes staging and includes the usual 'planning' calls (issue 1)
    buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
    dcp.save(state_dict, storage_writer=buffered_writer)

    final_storage_writer = FileSystemWriter()
    mp.spawn(      # issue2 is gone, do whatever you want here
	dcp.save,    # or some custom sub-process method which calls dcp.save under the hood
        buffered_writer.state_dict,   # lot's of way's to do this, not really the most important part
	checkpoint_id=checkpoint_id,
	storage_writer=storage_writer,
	planner=planner,
	process_group=process_group, # this actually wouldn't work, but again not the pt.
      )
      # leaving out the rest of the details for managing your extra special subprocess.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
2024-05-02 19:01:55 +00:00
da991fac22 [ROCm][CI] upgrade CI to ROCm 6.1 (#124300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124300
Approved by: https://github.com/malfet
2024-05-02 17:16:02 +00:00
1eb7b8eb60 [PT2D] Ensure the trace rules are correct with distributed (#125333)
Summary:
1. Avoid using `torch._dynamo.disable`.
2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125333
Approved by: https://github.com/yanboliang
2024-05-02 16:28:38 +00:00
e93b57a570 Add propagate_real_tensors mode for unbacked (#125115)
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.

This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.

I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:

```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```

Potential later follow ups:

* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
2024-05-02 15:28:26 +00:00
fb1bfe1156 Get cutlass_library import working under fbcode (#125257)
Differential Revision: D56764089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125257
Approved by: https://github.com/chenyang78
2024-05-02 15:17:10 +00:00
8046de3512 [Inductor cutlass backend] Remove epilogue nodes from Kernel call (#124929)
Minor refactoring:
Remove unused "fused epilogue node" arguments from some method  Kernel call signatures.

Test Plan:
Covered by current tests in test_cutlass_backend.py - no functional change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124929
Approved by: https://github.com/eellison
2024-05-02 13:02:31 +00:00
a13a0a2479 [dynamo][easy] Simple fixes to prepare for nn module guards (#125316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125316
Approved by: https://github.com/williamwen42
ghstack dependencies: #125275
2024-05-02 12:08:11 +00:00
0b70026d3b Do not pass none to has_pending_mutation (#125359)
#fix https://github.com/pytorch/pytorch/issues/125315

Several failures when inlining nn module is enabled are due to passing None to has_pending_mutation
from previous code, it sounds like its expected for variable to be none when not found, In that case we should skip it and not call has_pending_mutation
this is tested in https://github.com/pytorch/pytorch/pull/125354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125359
Approved by: https://github.com/mlazos
2024-05-02 09:08:22 +00:00
aa7be72cc5 Convert ForeachFuncInfo to dataclass (#125001)
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99, https://github.com/jeffdaily
2024-05-02 04:19:09 +00:00
da5d2d9b3e Hotfix: restore CPP guard string in structured trace (#125303)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125303
Approved by: https://github.com/albanD
2024-05-02 03:57:19 +00:00
fff7a31800 fix torchdeploy issue on sharddim_alltoall op (#125344)
Summary: fix torchdeploy issues when registering the distributed op, similar to what functional collective did

Differential Revision: D56850434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125344
Approved by: https://github.com/XilunWu, https://github.com/fegin
2024-05-02 03:38:34 +00:00
f59ce798f9 [ROCm] TunableOp for scaled_mm (#123987)
Adds a new ScaledGemmTunableOp implementation using hipblaslt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123987
Approved by: https://github.com/jianyuh
2024-05-02 03:06:31 +00:00
5ea54839c9 Make min(stride, strides[idx]) in collapse_view_helper size oblivious (#125301)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125301
Approved by: https://github.com/albanD
2024-05-02 02:39:58 +00:00
b119e1bcc2 Fix refcount handling for dtype, layout and memory format (#125271)
Finish fixing https://github.com/pytorch/pytorch/issues/124868
re-use our wrap() utils as much as possible and NewRef in other places.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125271
Approved by: https://github.com/colesbury
2024-05-02 02:34:34 +00:00
4731130ea8 Add a code comment about torch._check_is_size in tensor_split (#125292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125292
Approved by: https://github.com/albanD
2024-05-02 02:25:38 +00:00
a9309502af Revert "Refactoring to remove unused variable (#125252)"
This reverts commit b094622bc954e179ddb8649652b87d2a81d7d500.

Reverted https://github.com/pytorch/pytorch/pull/125252 on behalf of https://github.com/drisspg due to going to land codev ([comment](https://github.com/pytorch/pytorch/pull/125252#issuecomment-2089394606))
2024-05-02 01:49:57 +00:00
b03fb49ed8 Revert "[dynamo] use lazy disable dynamo for manual seed (#125196)"
This reverts commit 8320b770fd9dc4671bc9eb0d535e14173e95cf45.

Reverted https://github.com/pytorch/pytorch/pull/125196 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/125196#issuecomment-2089355842))
2024-05-02 00:57:39 +00:00
9e24c263f9 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
2024-05-01 23:58:35 +00:00
f1f142c44f Revert "Fakify script object inputs and attributes for non-strict export (#124239)"
This reverts commit ecc2e034f7e55bf9ff7f4e5df4e9086a5c92caaa.

Reverted https://github.com/pytorch/pytorch/pull/124239 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124239#issuecomment-2089305447))
2024-05-01 23:56:00 +00:00
9022f131b5 [inductor] switch assume_aligned_inputs to False (#124336)
In #123319, we guard some behavior behind the `assume_aligned_inputs` config option. If we set this to `False`, then the behavior added in #123319 becomes the default behavior. See the referenced PR for more details about the behavior affected.

Side effects:
* It's possible that this will hurt performance in some scenarios. For example, if an unaligned input is used in a matmul, it might be better to perform the clone to align it first.
* This will occasionally cause recompiles. Specifically: the check we perform (`(storage_offset * get_dtype_size(dtype)) % ALIGNMENT == 0`) can be guarded on if the storage_offset becomes dynamic. storage_offset becomes dynamic during automatic_dynamic_shapes after a shape or stride changes. Previously, this was increasing graph breaks in cpu inductor torchbench tests (but is fixed by more carefully guarding checks on alignment, so that we don't run them and generate guards unless actually needed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124336
Approved by: https://github.com/eellison
2024-05-01 23:49:27 +00:00
c281d3a0cb Enable UFMT on test_indexing&test_view_ops (#125112)
Part of https://github.com/pytorch/pytorch/issues/123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125112
Approved by: https://github.com/ezyang
2024-05-01 23:44:53 +00:00
9043ccafdf Require nnz==0 in sparse meta tensors (#125221)
As in the title and per discussion starting at https://github.com/pytorch/pytorch/pull/117907#issuecomment-2082426468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125221
Approved by: https://github.com/amjames, https://github.com/ezyang
2024-05-01 23:41:49 +00:00
46f326eff5 explicitly reset stderr/stdout in precompilation (#125289)
I was seeing a weird bug where after running max-autotune my stdout would be misdirected. other people have not been able to repro this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125289
Approved by: https://github.com/shunting314, https://github.com/mlazos
2024-05-01 23:41:36 +00:00
6f5f405b05 [ncclx] Rename NCCL-EXP to NCCLX (#125238)
Reviewed By: kryanchun

Differential Revision: D56534548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125238
Approved by: https://github.com/kwen2501
2024-05-01 23:29:55 +00:00
6cfb55dd5d Add a variable for some testcases. (#124708)
Some testcases can use 'TEST_PRIVATEUSE1_DEVICE_TYPE' to make adapting these testcases on others device more convenient.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124708
Approved by: https://github.com/albanD
2024-05-01 23:19:12 +00:00
c451d108da Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
Addresses issue #124518, adds isin_Tensor_Tensor_out.

Tests added to test_mps.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-05-01 23:14:05 +00:00
506eda538b Fix windows build error not propagating (#125306)
* Fixes https://github.com/pytorch/pytorch/issues/124886
* Kind of similar to https://github.com/pytorch/pytorch/pull/109393

I think what happens is `exit` and `exit /b` propagate the errorlevel correctly, but `exit /b` only exists the currently running batch script and not the entire cmd.exe (or whatever program is running the batch script), so `exit /b` exits with errorlevel 1, but the the parent cmd exits with 0, and bash sees cmd's 0

I think `goto fail` and `exit` are the same thing when the batch script is run from a bash script so either would work in this case?  But the `goto fail` method might be better if someone happens to run the script on cmdline

I assumed that anywhere anyone was exiting after checking the error code, they did want to exit completely, and I'm pretty sure that being inside a parenthesis counts as being a different script, so I changed everything to goto fail just in case, this might be too aggressive?

Logs after this change for a build failure on cuda:
https://github.com/pytorch/pytorch/actions/runs/8912185834/job/24475087535?pr=125306
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7599/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+git3171c11
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+git3171c11 -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8

(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 goto fail

(base) C:\actions-runner\_work\pytorch\pytorch>exit /b 1
Error: Process completed with exit code 1.
```

vs original
https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7604/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+gite09f98c
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+gite09f98c -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8

(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 exit /b
+ assert_git_not_dirty
+ [[ win-vs2019-cuda11.8-py3 != *rocm* ]]
+ [[ win-vs2019-cuda11.8-py3 != *xla* ]]
++ git status --porcelain
++ grep -v '?? third_party'
++ true
+ git_status=
+ [[ -n '' ]]
+ echo 'BUILD PASSED'
BUILD PASSED
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125306
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/atalman
2024-05-01 22:06:47 +00:00
599a2e25f1 Reland "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347)" (#125288)
Re-land of https://github.com/pytorch/pytorch/pull/123347.

The original PR broke internal because of a circular import due to importing dynamo in the DTensor code. The new version uses `torch._dynamo_disable` to work around

This reverts commit 9d88339b535f57cd0e2926c9ac4c2542e4490aac.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125288
Approved by: https://github.com/ezyang, https://github.com/yanboliang, https://github.com/yoyoyocmu, https://github.com/anijain2305, https://github.com/fegin
ghstack dependencies: #124398, #124399, #124400
2024-05-01 21:56:01 +00:00
9e9ba61fde AOTAutograd: force tangents to be contiguous when subclass inner tensor is noncontiguous (#124400)
Fixes https://github.com/pytorch/pytorch/issues/124397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124400
Approved by: https://github.com/ezyang, https://github.com/yoyoyocmu
ghstack dependencies: #124398, #124399
2024-05-01 21:56:01 +00:00
5173cbe260 fix FakeTensor creation on noncontiguous subclasses (#124399)
Fixes https://github.com/pytorch/pytorch/issues/125287

Fixes https://github.com/pytorch/pytorch/issues/124090, context on the issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124399
Approved by: https://github.com/soulitzer
ghstack dependencies: #124398
2024-05-01 21:56:01 +00:00
7058563078 support as_python_constant on PlacementClassVariable (#124398)
Fixes an error for torchtitan + internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124398
Approved by: https://github.com/ezyang, https://github.com/wanchaol, https://github.com/yoyoyocmu
2024-05-01 21:56:01 +00:00
2d794bcb8a Delete NegateSource handling, I think it's dead (#125311)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125311
Approved by: https://github.com/Skylion007
2024-05-01 21:36:50 +00:00
746da8755c switch tests from constrain_as* to torch._check* (#125253)
To fix data-dependent errors we want to recommend that people use `torch._check*` APIs. The `constrain_as*` APIs should be fully subsumed by them, and in the future we should kill them entirely.

Differential Revision: D56774333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253
Approved by: https://github.com/ezyang
2024-05-01 21:01:27 +00:00
dbcf123105 Upgrade submodule oneDNN to v3.4 (#122472)
## Improvements
This upgrade fixes the following issues:
- https://github.com/pytorch/pytorch/issues/120982

This upgrade brings the following new features:
- Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450)

## Validation results on CPU
No regression was found.

1. NLP models accuracy/inference/training

Model Name | Mode| Precision | New | Baseline | New/Baseline
-- | -- | -- | -- | -- | --
bert-large | accuracy | fp32 | 93.15325 | 93.15325 | 100.00%
bert-large | accuracy | bf16 | 93.20125 | 93.20125 | 100.00%
bert-large | accuracy | int8 | 92.66641 | 92.66641 | 100.00%
LCM | accuracy | fp32 | 44.11152 | 44.11154 | 100.00%
LCM | accuracy | bf16 | 43.57667 | 43.65096 | 100.17%
ViT | accuracy | fp32 | 0.8033 | 0.8033 | 100.00%
ViT | accuracy | bf16 | 0.8031 | 0.8031 | 100.00%
ViT | accuracy | int8 | 0.7985 | 0.7985 | 100.00%
yolov7 | accuracy | fp32 | 0.512 | 0.512 | 100.00%
yolov7 | accuracy | bf16 | 0.504 | 0.504 | 100.00%
yolov7 | accuracy | int8 | 0.507 | 0.507 | 100.00%
bert-large | realtime | fp32 | 37.433 | 39.136 | 95.65%
bert-large | realtime | bf16 | 166.592 | 160.134 | 104.03%
bert-large | realtime | int8 | 230.876 | 222.594 | 103.72%
ViT | realtime | fp32 | 288.19 | 282.05 | 102.18%
ViT | realtime | bf16 | 755.42 | 741.1 | 101.93%
ViT | realtime | int8 | 1060.94 | 1092.47 | 97.11%
yolov7 | realtime | fp32 | 17.06927 | 16.47995 | 103.58%
yolov7 | realtime | bf16 | 54.68561 | 54.00723 | 101.26%
yolov7 | realtime | int8 | 78.38271 | 77.63214 | 100.97%
bert-large | throughput | fp32 | 47.142 | 47.341 | 99.58%
bert-large | throughput | bf16 | 200.365 | 200.806 | 99.78%
bert-large | throughput | int8 | 144.999 | 145.295 | 99.80%
LCM | throughput | fp32 | 0.54913 | 0.54897 | 100.03%
LCM | throughput | bf16 | 1.062417 | 1.07772 | 98.58%
stable-diffusion | throughput | fp32 | 0.03301 | 0.0331 | 99.73%
stable-diffusion | throughput | bf16 | 0.08773 | 0.08849 | 99.14%
stable-diffusion | throughput | int8 | 0.0491 | 0.05024 | 97.73%
ViT | throughput | fp32 | 342.55 | 346.47 | 98.87%
ViT | throughput | bf16 | 1263.4 | 1268.32 | 99.61%
ViT | throughput | int8 | 1331.3 | 1345.32 | 98.96%
yolov7 | throughput | fp32 | 115.313 | 115.612 | 99.74%
yolov7 | throughput | bf16 | 323.364 | 323.747 | 99.88%
yolov7 | throughput | int8 | 388.137 | 384.236 | 101.02%
bert-large | train_phase1 | fp32 | 34.223 | 34.309 | 99.75%
bert-large | train_phase1 | bf16 | 90.372 | 88.453 | 102.17%
bert-large | train_phase2 | fp32 | 7.307 | 7.318 | 99.85%

Data Type | Geomean
-- | --
fp32 | 99.88%
bf16 | 100.70%
int8 | 99.88%
all | 100.16%

2. Torchbench cpu userbenchmark inference & training

Test suite | Geomean Ratio (New/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 0.99x
jit_llga_throughtput_fp32 | 1.01x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.00x
eager_throughtput_fp32_train | 1.00x

3. Inductor quantization (static & dynamic) accuracy & performance

Config | Performance geomean ratio (New/baseline) | Accuracy ratio (New/baseline)
-- | -- | --
Static quant PTQ | 0.99x | 1.00x
Static quant PTQ_CPP_WRAPPER | 0.98x | 1.00x
Static quant QAT | 0.99x | 1.00x
Dynamic quant PTQ | 1.00x | 1.00x

4. Dynamo benchmarks

Precision | Shape | Wrapper | Thread | Ratio   old/new GEOMEAN | Ratio   old/new GEOMEAN
-- | -- | -- | -- | -- | --
  |   |   |   | Eager | Inductor
Float32 | Static | Default | Multiple | 0.998776 | 1.002091
  |   |   | Single | 1.014086 | 1.01054
Float32 | Dynamic | Default | Multiple | 1.00386 | 1.005975
  |   |   | Single | 1.011036 | 1.008317
AMP | Static | Default | Multiple | 0.996965 | 1.005117
  |   |   | Single | 1.00092 | 0.995666
AMP | Dynamic | Default | Multiple | 0.9959 | 0.995048
  |   |   | Single | 1.002569 | 0.994085

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122472
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/atalman
2024-05-01 20:59:17 +00:00
c99617706e Add lintrunner as dev dependency (#125304)
As per title. We expect people to use it before pushing any PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125304
Approved by: https://github.com/Skylion007
2024-05-01 20:08:03 +00:00
197612c84c ProcessGroupWrapper support custom backend (#124447)
Fixes #ISSUE_NUMBER
In current code, ProcessGroupWrapper works only for `GLOO, NCCL, UCC` when `TORCH_DISTRIBUTED_DEBUG=DETAIL`.
I read the ProcessGroupWrapper code,find that communication_op in ProcessGroupWrapper is just communication_op in origin_backend + runCollectiveChecks in gloo, like allreduce:
82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L406-L411)

`runCollectiveChecks` is used to `collective finger print` for tensors and run gloo's `monitoredBarrier`.
82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L586-L590)
I dont know why ProcessGroupWrapper doesn't work for all backend, but I think custom backend can support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124447
Approved by: https://github.com/kwen2501
2024-05-01 19:59:55 +00:00
b4ccc615cd Do exact type match on int so we don't pick up bool here too (#125305)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125305
Approved by: https://github.com/Skylion007
2024-05-01 19:46:36 +00:00
a216d87c6b [export] Fix for unflattening modules with duplicate tensors (#125192)
In the given test case, we have a ModuleList of 3 modules (`norm.0`, `norm.1`, `norm.2`) which share the same `weight` and `bias` tensors. However when we trace, they all end up pointing to one state dict name, (ex. `norm.2`).
```
graph():
    %p_norms_0_weight : [num_users=0] = placeholder[target=p_norms_0_weight]
    %p_norms_0_bias : [num_users=0] = placeholder[target=p_norms_0_bias]
    %p_norms_1_weight : [num_users=0] = placeholder[target=p_norms_1_weight]
    %p_norms_1_bias : [num_users=0] = placeholder[target=p_norms_1_bias]
    %p_norms_2_weight : [num_users=3] = placeholder[target=p_norms_2_weight]
    %p_norms_2_bias : [num_users=3] = placeholder[target=p_norms_2_bias]
    %input_ : [num_users=1] = placeholder[target=input_]
    %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %native_layer_norm_1 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_1, 0), kwargs = {})
    %native_layer_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem_3, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_2, 0), kwargs = {})
    return (getitem_6,)
```
This causes an error in the unflattener where after constructing the submodules for `norm.0`, it will have the graph pointing to `norm.2.weight` and `norm.2.bias`:
```
graph():
    %p_norms_2_bias : [num_users=1] = placeholder[target=p_norms_2_bias]
    %p_norms_2_weight : [num_users=1] = placeholder[target=p_norms_2_weight]
    %input_ : [num_users=1] = placeholder[target=input_]
    %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    return getitem
```
Since the attributes are not within the same scope of the graph, (`norm.0` vs. `norm.2`), they will not be added to the subgraph, causing an error.

So this PR handles the duplicate state dict attributes by modifying the `inputs_to_state` dict to map from node names to a list of possible state dict target names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125192
Approved by: https://github.com/zhxchen17
2024-05-01 19:12:50 +00:00
af67704dcc [privateuse1] _refs.masked_fill support privateuse1 when value.device.type is cpu (#124835)
_refs.masked_fill support privateuse1 when value.device.type is cpu.

1. maybe I should consider whether this modification meets the expectations of other privateuse1 devices,
2. add TestCase

Fixes #124693

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124835
Approved by: https://github.com/albanD
2024-05-01 18:57:14 +00:00
07422fd0b9 add missing space to first cmake append (#125294)
the first append not having a space incorrectly merges it to any previous arguments, like `-allow-unsupported-compiler` in my case which results in a silly error: `unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'`

full log:
```
python setup.py develop
Building wheel torch-2.4.0a0+git75fa54a
-- Building version 2.4.0a0+git75fa54a
cmake3 -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/code/pytorch/torch -DCMAKE_PREFIX_PATH=/code/pytorch/.venv/lib/python3.12/site-packages;/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-13.2.0-noa2f4oqalxzqvsebhuntndewgt4gq4h:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zstd-1.5.6-z3guwm4l5rmmsv4g4wvkej3ri3bppeja:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zlib-ng-2.1.6-kwi4ljobodjgv5eetnga4bow6crdlacl:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpc-1.3.1-nuwa2snyzm265lsupa2dkmxxyhiqcv7e:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpfr-4.2.1-wepuwobwttxbtz3nguimxa2mlljjozsi:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gmp-6.2.1-ashy6kiitonxv2f365f4q3beggzf3646:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-runtime-14.0.1-wmogkqrzn7t57dogaake2hmhjbod27gs -DNUMPY_INCLUDE_DIR=/code/pytorch/.venv/lib64/python3.12/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/code/pytorch/.venv/bin/python -DPYTHON_INCLUDE_DIR=/usr/include/python3.12 -DPYTHON_LIBRARY=/usr/lib64/libpython3.12.so.1.0 -DTORCH_BUILD_VERSION=2.4.0a0+git75fa54a -DUSE_NUMPY=True /code/pytorch
-- /usr/lib64/ccache/c++ /code/pytorch/torch/abi-check.cpp -o /code/pytorch/build/abi-check
-- Determined _GLIBCXX_USE_CXX11_ABI=1
-- Current compiler supports avx2 extension. Will build perfkernels.
-- Current compiler supports avx512f extension. Will build fbgemm.
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - failed
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message):
  The CUDA compiler

    "/usr/local/cuda-12/bin/nvcc"

  is not able to compile a simple test program.
  It fails with the following output:
    Change Dir: '/code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl'

    Run Build Command(s): /code/pytorch/.venv/bin/ninja -v cmTC_ee207
    [1/2] /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler   -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all  "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
    FAILED: CMakeFiles/cmTC_ee207.dir/main.cu.o
    /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler   -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all  "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
    gcc: error: unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'
    ninja: build stopped: subcommand failed.

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  cmake/public/cuda.cmake:47 (enable_language)
  cmake/Dependencies.cmake:44 (include)
  CMakeLists.txt:758 (include)

-- Configuring incomplete, errors occurred!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125294
Approved by: https://github.com/albanD
2024-05-01 18:35:54 +00:00
bf6acf9add [ROCm] Add extra cuda_to_hip_mappings.py (#125108)
Adding extra mappings discovered when hipifying the backward CUDA kernel of the Mamba model (https://github.com/state-spaces/mamba/).

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125108
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2024-05-01 18:31:02 +00:00
c8d2a55273 Intel GPU: specify the tolerance for torchbench models (#125213)
We encountered some model accuracy failures as the tolerance is critical. In general, we align with CUDA practice. This PR intends to adjust the tolerance for Torchbench models for training mode on Intel GPU devices and aligns with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125213
Approved by: https://github.com/desertfire
2024-05-01 17:45:15 +00:00
e3627d05e7 [CMake] Add NVPL BLAS/LAPACK option (#125268)
This PR add a [NVPL](https://docs.nvidia.com/nvpl/introduction.html) BLAS/LAPACK option to CMake for `aarch64` (ARM) machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125268
Approved by: https://github.com/albanD
2024-05-01 17:26:28 +00:00
39eb5d4fa4 Add Sanity Testing to Pytorch Profiler (#124773)
Summary: In the recent weeks, we have encountered bugs in both the normal synchronous trace and on-demand tracing. This diff on its own does sanity checking to make sure the profiler does not have spans that extend past the boundaries that we expect. It also checks some basic properties of the tracings we expect to see. Right now the sanity tests check some basic properties to make sure that the tracings are not completely broken. Requests/suggestions for other properties are welcome.

Test Plan: Run the tests in OSS and Buck

Reviewed By: aaronenyeshi

Differential Revision: D56374298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124773
Approved by: https://github.com/aaronenyeshi
2024-05-01 16:59:35 +00:00
4d410155b2 Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809)"
This reverts commit e09f98c705e4851414cd8ddf21949177af2b13aa.

Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))
2024-05-01 16:02:02 +00:00
e16f1ee4cc [ez][CI] Move test_modules and test_schema_check off CI_SERIAL_LIST (#125193)
* Related https://github.com/pytorch/pytorch/pull/124085

As in title, move test_modules and test_schema_check off CI_SERIAL_LIST
If things fail, they can get the serialTest decorator instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125193
Approved by: https://github.com/huydhn
2024-05-01 15:48:48 +00:00
8fde9a988c CI: Extending unit test coverage for aarch64 linux (#125255)
Adding core, dynamo and inductor unit tests for aarch64 linux CI runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125255
Approved by: https://github.com/malfet, https://github.com/atalman
2024-05-01 15:37:52 +00:00
b094622bc9 Refactoring to remove unused variable (#125252)
Summary: Removed unused variable for running encoder

Test Plan: buck test //caffe2/test:transformers

Differential Revision: D56771972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125252
Approved by: https://github.com/drisspg
2024-05-01 15:17:45 +00:00
e09f98c705 Include support for the scatter gather cuda kernels to allow for comp… (#124809)
Fixes #121965

This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.

C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.

Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.

Environment:
3080 & WSL 2. `nvcc` is at 12.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-05-01 14:31:31 +00:00
e421f1b4a8 docs: torch.nn.utils.rnn: docs improve (#123559)
docs: `torch.nn.utils.rnn`: docs improve
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123559
Approved by: https://github.com/mikaylagawarecki
2024-05-01 14:27:37 +00:00
a2715144c3 Add NEON-accelerated int8mm for bfloat16 (#125290)
As apparently `vshlq_u32` is faster than `vcvt_f32_f16`

Refactor NEON `tinygemm_kernel` to rely on `load_as_float32x4` and `load_as_float32x4x2` and implement them for float16 (using vcvt), bfloat16 (using left shift) and plain float32 (not using anything)

As result stories110M run at 60 tokens/sec with f16, but at 66 tokens/sec with bf16 and  75 tokens/sec with f32, though more bandwith demand starts to favor reduced floating types as model size gets bigger.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125290
Approved by: https://github.com/mikekgfb
2024-05-01 14:04:49 +00:00
9fbb4dfc12 Fix AttributeError when doing mock patch for FileTimerServerTest.test_expired_timers (#125144)
Fix the patch failure, and we should patch the function where it is used, not where it is defined.
Failure info:
```bash
root@cambricon-PowerEdge-C4140:/workspace# python file_based_timer_test.py -k test_expired_timers
/opt/conda/lib/python3.10/site-packages/torch/_custom_ops.py:253: DeprecationWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
  return torch.library.impl_abstract(qualname, func, _stacklevel=2)
E
======================================================================
ERROR: test_expired_timers (__main__.FileTimerServerTest)
tests that a single expired timer on a process should terminate
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2757, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1376, in patched
    with self.decoration_helper(patched,
  File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1358, in decoration_helper
    arg = exit_stack.enter_context(patching)
  File "/opt/conda/lib/python3.10/contextlib.py", line 492, in enter_context
    result = _cm_type.__enter__(cm)
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1447, in __enter__
    original, local = self.get_original()
  File "/opt/conda/lib/python3.10/unittest/mock.py", line 1420, in get_original
    raise AttributeError(
AttributeError: <module 'torch.distributed.elastic.timer' from '/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/timer/__init__.py'> does not have the attribute 'log_debug_info_for_expired_timers'

To execute this test, run the following from the base repo dir:
     python file_based_timer_test.py -k test_expired_timers

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.792s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125144
Approved by: https://github.com/gag1jain
2024-05-01 12:08:04 +00:00
47ba7a76e2 [ATen][CUDA][AMP] Fix dtype mismatch in linalg_vector_norm (#125175)
Fixes #125174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125175
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-05-01 10:57:12 +00:00
c59cce38a9 [MacOS][CPUInductor] Fix includes to system Python (#125285)
On MacOS 14.4, system Python is configured to point to a non-existing include dir
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('include'))"
/Library/Python/3.9/include
```

Workaround the issue by composing path to include folder from `stlib` config, which points to
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('stdlib'))"
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125285
Approved by: https://github.com/kit1980
2024-05-01 10:39:13 +00:00
52142192d4 [pipelining] Add stage backward function (#124958)
This is a helper function which:
1. computes the gradients for the stage inputs, and
2. accumulates gradients for the stage module's parameters.

A unit test for this function is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124958
Approved by: https://github.com/wconstab
ghstack dependencies: #124776, #124875
2024-05-01 07:56:58 +00:00
aead440c62 [Inductor] Further tune block size for templated attention on H100 (#125286)
Run a script to enumerate and get the best default block size for templated attention.

A100 -> no change, check numbers at #125139
H100
## torch.bfloat16

Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.103 |              |             |             |             |            |               |                |
| Max     |     1.322 |            8 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     0.829 |            1 |          16 |        1024 |        1024 |        128 | relative_bias | torch.bfloat16 |

```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.137 |              |             |             |             |            |               |                |
| Max     |     1.442 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.bfloat16 |
| Min     |     0.913 |            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |
```

## torch.float32
Before:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|---------------|
| Average |     2.269 |              |             |             |             |            |               |               |
| Max     |     3.740 |           16 |          16 |        1024 |        1024 |         64 | noop          | torch.float32 |
| Min     |     0.761 |            1 |          16 |         512 |         512 |        128 | relative_bias | torch.float32 |
```
After:
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype         |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|---------------|
| Average |     2.489 |              |             |             |             |            |             |               |
| Max     |     3.755 |           16 |          16 |        4096 |        4096 |         64 | noop        | torch.float32 |
| Min     |     1.609 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.float32 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125286
Approved by: https://github.com/Chillee
2024-05-01 07:34:08 +00:00
c511aed27f [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-05-01 06:53:49 +00:00
c3c4465f50 Add has_guarded_code to CompilationMetrics (#125279)
While studying some tlparse, I noticed that CompilationMetrics was reporting that there was no error for frames that have no nodes. I'm pretty sure we don't actually install a frame in this situation. has_guarded_code will tell us if that's the case, because it says if the GuardedCode object is None or not.

Actually, while working on this, I was wondering if we can ever trigger the "skip this frame entirely, do not trace it ever again" codepath, as best as I could tell, it's impossible for this to happen by the time we get to compilation metrics block.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125279
Approved by: https://github.com/yanboliang
2024-05-01 06:12:05 +00:00
cyy
081f41a920 Use BFloat16 in distributed quantization when supported by NCCL (#125113)
This PR enables BFloat16 in torch/csrc/distributed/c10d/quantization/quantization_gpu.cu .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125113
Approved by: https://github.com/kwen2501
2024-05-01 05:43:35 +00:00
14857e71c2 Export torch.jit.interface from torch.jit package (#125209)
Seems like this symbol was overlooked when other symbols were exported from `torch.jit`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125209
Approved by: https://github.com/ezyang
2024-05-01 05:38:05 +00:00
75a8e9ee77 [inductor] better cache clearing in fx graph cache tests (#125280)
Summary: There's a shortcoming in the FX graph cache tests in that they don't fully clear all inductor in-memory caches when testing the cache-hit path: We were previously accessing the FX graph cache correctly, but when loading the source object using the PyCodeCache.load_by_key_path() method, _that_ path was serving entries out of memory. To better mimic what happens during warm start (i.e., a new process), we should clear all in-memory caches.

Test Plan: updated the unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125280
Approved by: https://github.com/eellison
2024-05-01 04:47:46 +00:00
787afc5180 Add LR as tensor tests (#123750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750
Approved by: https://github.com/janeyx99
2024-05-01 04:46:49 +00:00
1c905f1be3 [EZ][BE] Don't import pathlib twice (#125260)
It was imported once as `import pathlib` and second time as `from pathlib import Path`

Stick to the 2nd flavor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125260
Approved by: https://github.com/kit1980
2024-05-01 04:08:16 +00:00
abaa717350 [FSDP2] Removed logic to save and remove pre-backward hook handles (#125269)
1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`).
2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See https://github.com/pytorch/pytorch/pull/118118 for original motivation.)

For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule.

If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125269
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190, #125191
2024-05-01 03:51:30 +00:00
37c993546d [dynamo][guards] Bug fix for set_export_info (#125275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125275
Approved by: https://github.com/yanboliang
2024-05-01 03:46:26 +00:00
4d5f8070c4 add a decomposition for select_scatter (#124426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426
Approved by: https://github.com/peterbell10
2024-05-01 03:23:18 +00:00
e9ce23985f [TorchScript] attach target function to OSError when source can't be found (#125248)
Before, it would be hard to figure out which function/module in particular was causing the OSError. Now we'll try to print the function/module string.

Differential Revision: [D56768365](https://our.internmc.facebook.com/intern/diff/D56768365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125248
Approved by: https://github.com/eellison
2024-05-01 03:18:55 +00:00
8f31988088 [C10D] Document 'tag' limitation for nccl send/recv (#125278)
Existing documentation on isend/irecv also applies to send/recv. This PR
copies the doc/warning to send/recv ops as well.

Note: tag may be supplied, but will be ignored when used with nccl
backend.

Fixes #94819 #125079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125278
Approved by: https://github.com/kwen2501
2024-05-01 02:53:30 +00:00
74e8817311 [inductor] Minor fixes to various tests before enabling fx graph caching in OSS by default (#125258)
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
2024-05-01 02:34:01 +00:00
0506e95433 [dynamo] support inactive context managers across graph breaks (#125203)
Fix https://github.com/pytorch/pytorch/issues/124900.

When we reconstruct `ContextWrappingVariables`s, we only reconstruct the context class, not the object. Normally, contexts are active (via `with ctx:`) and we initialize the context object in the resume function. But for the case of inactive contexts (contexts declared ahead of time before the `with` block), we do not reconstruct them properly in the optimized bytecode or resume function. So this PR adds initialization for inactive contexts in the resume function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125203
Approved by: https://github.com/jansel
2024-05-01 01:49:09 +00:00
1b9d353e4f [Torch] Add more mm kernel choices (#125000)
Differential Revision: D56616836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125000
Approved by: https://github.com/htyu
2024-05-01 01:40:24 +00:00
a59dc14877 Keep node.meta when fusing subgraph (#125261)
Summary: When CapabilityBasedPartitioner creates the fused subgraph as the call_module node, it didn't populate the node.meta["val"] field.

Test Plan: OSS CI

Differential Revision: D56789259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125261
Approved by: https://github.com/zhxchen17
2024-05-01 01:38:28 +00:00
0ee5c14163 [PT2][Optimus] Read the patterns from the config instead of hard-code passes (#125136)
Summary: Due to the compatitbility issue, we hard coded the passes to do the pattern optimization. Here, we revisit the method since it has been a while for the changes into production packages. We instead read from the config to decide whether we do the specific pattern optimization, which makes followup pattern add easier.

Differential Revision: D56659934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125136
Approved by: https://github.com/jackiexu1992
2024-05-01 01:35:30 +00:00
25691558d9 Change templated_attention -> flex_attention (#125251)
# Summary

Change all the names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125251
Approved by: https://github.com/Chillee, https://github.com/yanboliang
2024-05-01 01:08:48 +00:00
a7023b89f8 Use torch._check for safety assert in _reshape_view_helper (#125187)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125187
Approved by: https://github.com/albanD
2024-05-01 00:40:31 +00:00
1bcbc9158f Add CUDA 12.4 workflows (#121684)
Reference: https://github.com/pytorch/pytorch/pull/98492

Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121684
Approved by: https://github.com/atalman
2024-04-30 23:03:24 +00:00
d6c713884a [dynamo, 3.12] xfail refleaking tests due to buggy getattr_static (#125062)
For tracking https://github.com/pytorch/pytorch/issues/124302 so that we can re-enable the test once 3.12 updates with the bug fix for https://github.com/python/cpython/issues/118013.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125062
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-04-30 22:40:47 +00:00
c12c85e919 Revert "[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729)" (#125246)
This reverts commit 62b5738a8bf325d79468b839b8412b87cb9951c1.

https://github.com/pytorch/pytorch/pull/119729/ regresses cudagraph dashboard. Moving the one-time per iteration loss from CPU to CUDA is somehow causing a lot of copies:

current (top) vs with revert (bottom)
![image](https://github.com/pytorch/pytorch/assets/9547562/62dfbf66-7edc-4a3c-ba7f-1ec057fba950)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125246
Approved by: https://github.com/eellison
2024-04-30 22:39:53 +00:00
9fec26e231 Fix typo under torch/_inductor directory (#119658)
This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658
Approved by: https://github.com/colesbury
2024-04-30 22:28:56 +00:00
ca0f070065 Revert "Add registration API for torch.compile-eager (#121387)"
This reverts commit 61e937f3d6b904d6706594c1b3cfd7d0e56f9663.

Reverted https://github.com/pytorch/pytorch/pull/121387 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/121387#issuecomment-2087541956))
2024-04-30 22:13:04 +00:00
00dd4d55e3 Refactored _remove_auto_functionalization_from_graph_helper (#125180)
Summary:
Refactored the function to remove multiple list slicing and use unused variable.

Test Plan:
python test/run_test.py

Reviewers: @drisspg

Subscribers:

Tasks: [T187526123](https://www.internalfb.com/intern/tasks/?t=187526123) [T93492332](https://www.internalfb.com/intern/tasks/?t=93492332)

Tags: @pytorchbot merge -r viable/strict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125180
Approved by: https://github.com/drisspg
2024-04-30 21:44:07 +00:00
ea347fa6ce Revert "Fix & optimze open device registration test. (#124712)"
This reverts commit f03cf9d4dc8ebe85552f450678988cac4e959da3.

Reverted https://github.com/pytorch/pytorch/pull/124712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124712#issuecomment-2086971499))
2024-04-30 20:00:37 +00:00
c1a3fcfa47 [pipelining] Add util and debug facilities (#124875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124875
Approved by: https://github.com/H-Huang
ghstack dependencies: #124776
2024-04-30 19:41:41 +00:00
75fa54a9d1 Revert "Convert ForeachFuncInfo to dataclass (#125001)"
This reverts commit 9466335ae4cb049efd3f4c2b32b2115ba00694f3.

Reverted https://github.com/pytorch/pytorch/pull/125001 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is breaking on ROCm 9466335ae4 ([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674))
2024-04-30 19:05:53 +00:00
56e4cbc69d Fixes two build problems on ROCM 6.1 + Ubuntu 22.04 (#118216)
Fixes two build problems on ROCM 6.1 + Ubuntu 22.04

### Inconsistency value of CMAKE_PREFIX_PATH between `.ci/pytorch/build.sh` and Build Instructions

Current `CMAKE_PREFIX_PATH` points to the base environment of the conda (commonly `/opt/conda`). However the conda environment used in the CI should be `/opt/conda/envs/py_<VRESION>`, which is supplied by `$CONDA_PREFIX`.

This divergence may cause libstdc++ version conflicts because the base conda environment may ship a different libstdc++ than the `pv_<VERSION>`, and/or the system default environment. One notable issue is on our internal CI system this script failed to build AOTriton library on Ubuntu 22.04 due to libstdc++ version conflicts between HIP compiler and conda base environment.

This PR fixes this and make sure the CI script follows the official build instruction.

### Incorrect `tinfo` was linked on Ubuntu 22.04 due to flaws in parsing of `os-release`

The code to parse /etc/os-release is incorrect and the distribution info was parsed as `PRETTY_Ubuntu` instead of `Ubuntu`. `libtinfo` will not be linked into the binary due to this flaw. Thus, cpp unit tests failed to build because of missing symbols from `libtinfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118216
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet, https://github.com/atalman
2024-04-30 18:58:48 +00:00
90258e8369 forward fix preferred blas backend and windows CI (#125080)
PR #122106 broke windows tests. The feature should have been disabled for Windows but was not disabled correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125080
Approved by: https://github.com/clee2000
2024-04-30 18:38:31 +00:00
04a241947a [dtensor] delete the old unused mesh_alltoall (#124879)
as titled, as we have a dedicated comm op, this is not needed anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #124871, #124872
2024-04-30 18:30:34 +00:00
00df0d3e94 [dtensor] implement shard dim change with alltoall (#124872)
as titled, we implement a dedicated communication op to allow efficient
sharding dimension change using alltoall, to replace our previous
allgather + local chunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #124871
2024-04-30 18:30:34 +00:00
02e7800b3f [Torch][Timer] Skip expired timer logging for empty expired timers (#125039)
Summary: same as title

Test Plan: unit tests

Differential Revision: D56636566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125039
Approved by: https://github.com/kurman
2024-04-30 18:28:49 +00:00
3946fa1c12 Fix bug in get_update_constraint (#125194)
Summary: Title

Test Plan: CI

Differential Revision: D56726321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125194
Approved by: https://github.com/pianpwk
2024-04-30 18:21:29 +00:00
07958c538c Setup initial testing harness and cache key generation for AOTAutograd Cache (#124642)
This doesn't introduce any new behavior, but sets up a basic cache key generation mechanism that I can test. From here I will:

- Add checks on the ops in an input FXGraph to make sure they are safe to cache. We'll be conservative in the first version here.
- Add serialization for FX graphs
- Save these FX graphs to disk in the cache
- Support graphs with more complicated ops like higher order ops and specialized nn modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124642
Approved by: https://github.com/aorenste
2024-04-30 18:17:38 +00:00
8242fb62a7 [quant][pt2e] Fix conv-bn weight + bias per channel QAT (#125208)
Summary: This commit fixes the pattern matching for conv-bn
during QAT fusion where both weight and bias are quantized per
channel. Previously this failed because weights and biases used
the same example kwargs for their scales and zero points,
causing these qparams to be tied during pattern matching.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_bn_per_channel_weight_bias
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_bn_per_channel_weight_bias

Reviewers: jerryzh168, angelayi

Subscribers: jerryzh168, angelayi, supriyar

Differential Revision: [D56740694](https://our.internmc.facebook.com/intern/diff/D56740694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125208
Approved by: https://github.com/angelayi
2024-04-30 18:12:25 +00:00
05be0fb62d [minimizer] Add exclusion function to minimizer base (#124504)
Summary:
Add exclusion list to minimizer:
1. some operations cannot be lowered when constructing subgraphs; this usually happens when they are isolated from operation group
2. exclude them in search strategies for automation

Reviewed By: jimone1

Differential Revision: D56327289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124504
Approved by: https://github.com/jfix71
2024-04-30 18:02:46 +00:00
80046c315b Add templated attention BLOCK_M & BLOCK_N default size for different head_dim (#125139)
Run different head_dims [64, 128], which are the most popular ones across major GPT models.
Enumerate different ```BLOCK_M``` and ```BLOCK_N``` candidates [16, 32, 64, 128], and get the best config as default one.

## Before
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.704 |              |             |             |             |            |             |                |
| Max     |     0.953 |            1 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     0.482 |            1 |          16 |        4096 |        4096 |        128 | causal_mask | torch.bfloat16 |
```
## After
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.823 |              |             |             |             |            |             |                |
| Max     |     0.926 |            1 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     0.723 |            1 |          16 |         512 |         512 |        128 | causal_mask | torch.bfloat16 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125139
Approved by: https://github.com/Chillee
2024-04-30 17:40:44 +00:00
cyy
04c6424fbf Remove caffe2 image and video (#125045)
This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 image and video folders are removed along with the related CMake code.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125045
Approved by: https://github.com/eqy, https://github.com/albanD
2024-04-30 17:31:57 +00:00
a03b9a2189 fix: typo (#125226)
Fixes spelling error: spacial is an incorrect spelling of spatial

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125226
Approved by: https://github.com/Skylion007
2024-04-30 16:57:39 +00:00
d699ade0cb [dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681)
Differential Revision: [D56723769](https://our.internmc.facebook.com/intern/diff/D56723769)
Co-authored-by: Sam Larsen <slarsen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124681
Approved by: https://github.com/masnesral
ghstack dependencies: #124592
2024-04-30 16:54:16 +00:00
254128c16e [inductor] Remove usage of device_interface from _inductor.runtime (#124592)
Differential Revision: [D56723770](https://our.internmc.facebook.com/intern/diff/D56723770)
Co-authored-by: Sam Larsen <slarsen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592
Approved by: https://github.com/masnesral
2024-04-30 16:54:16 +00:00
5f4c6d9b49 Upgrade nightly wheels to rocm6.1 (#124811)
Follow-up to https://github.com/pytorch/builder/pull/1789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124811
Approved by: https://github.com/malfet
2024-04-30 16:30:19 +00:00
9466335ae4 Convert ForeachFuncInfo to dataclass (#125001)
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99
2024-04-30 16:19:42 +00:00
ecc2e034f7 Fakify script object inputs and attributes for non-strict export (#124239)
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.

The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.

Concretely, we
1. fakify all the script object inputs, and module attributes (gathered by constant_attrs).
2. patch the module's attributes with fakified script object
3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239
Approved by: https://github.com/zou3519
2024-04-30 15:57:25 +00:00
ab80a59677 CI: add opt-in aarch64 linux workflow (#121284)
Triggered by `ciflow/linux-aarch64` and runs only `test_modules`, `test_mkldnn`, `test_mkldnn_fusion` and `test_openmp` as test for now.
TODOS:
 - Enable sscache for fast CI
 - Extend to a more reasonable test coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284
Approved by: https://github.com/atalman, https://github.com/malfet
2024-04-30 15:10:56 +00:00
b7d67e476d upload pt2 cprofile stats to manifold (#125162)
Summary:
https://fb.workplace.com/groups/257735836456307/permalink/657458576484029/

upload cprofile to manifold

D56696397 has a script to convert profiler stats to dot graphs (see its test plan)

Test Plan:
non-MAST
`TORCH_COMPILE_CPROFILE=1 buck2 run mode/opt mode/inplace //pytorch/benchmark:run -- ads_mc_igctr_mc3_v0 -d cuda -t train --torchdynamo inductor --profile --profile-export-chrome-trace`

https://www.internalfb.com/manifold/explorer/pyper_traces/tree/compilation_cprofile/test/20240428_234002_7562397568

MAST
`buck2 run mode/opt aps_models/ads/icvr:icvr_launcher -- mode=mast_ctr_cvr_cmf_rep launcher.fbl_entitlement=ai_infra_training_rnd_tc features=ctr_cvr_conso_cmf_pipeline_features_455876776_3teach model=ctr_cvr_cmf_when_rep_config_msmn_3teach model_name=ctr_cvr_when model.when_arch.use_extended_residual_contexts=True optimizers.dense_default.lr_schedule.0.max_iters=20000 training.planner.storage_reservation_policy=FixedPercentage training.planner.storage_reservation_percentage=0.72 data_loader.dataset.batch_size=2048 trainer.garbage_collection.garbage_collection_interval=100 model.when_arch.layer_norm_init_weight=0.3 optimizers.dense_default.lr_schedule.0.value=0.001 model.when_arch.customized_mlp_init_scale=0.3 launcher.num_workers=128 launcher.max_retries=10 launcher.data_project=oncall_ads_model_platform launcher.hardware=ZIONEX_80G data_loader.dataset.table_ds="[2024-01-01]" launcher.job_name=test_inductor_logging`

https://www.internalfb.com/manifold/explorer/pyper_traces/tree/compilation_cprofile/aps-test_inductor_logging-745febb51a

Generating dotty files from D56696397
```
Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile1.profile ...
P1225733598: https://www.internalfb.com/intern/paste/P1225733598/
Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733598
Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile10.profile ...
P1225733629: https://www.internalfb.com/intern/paste/P1225733629/
Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733629
Generating dot file from cprofile stats /home/daohang/aps-test_inductor_logging-745febb51a/0/0/_compile0.profile ...
P1225733649: https://www.internalfb.com/intern/paste/P1225733649/
Dotty: https://www.internalfb.com/intern/graphviz/?paste=1225733649
```

Differential Revision: D56679561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125162
Approved by: https://github.com/anijain2305
2024-04-30 15:05:01 +00:00
2480e8b8a1 Add MAP_SHARED option for torch.load(mmap=True) (#124889)
Fixes #124528

Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load`

4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)

~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~

Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889
Approved by: https://github.com/albanD
2024-04-30 15:02:19 +00:00
761a7b84ba [Dynamo] Fix alias issue with respect to wrapped numbers (#124731) (#124774)
This PR fixes an issue presented when calling `aten.alias(int)` raises a TypeError.

```python
import torch
import torch.autograd.forward_ad as fwAD

def f(x):
    return 4312491 * x

device = "cpu"

with torch._subclasses.fake_tensor.FakeTensorMode():
    with fwAD.dual_level():
        x = torch.randn(3, device=device)
        y = torch.ones_like(x)
        dual = fwAD.make_dual(x, y)
        f(dual)
```

The test case above illustrates this bug.
1) `4312491` turns into a tensor that is a wrapped number
2) Forward mode AD calls `aten::alias` internally
3) The wrapped number (`4312491`) becomes a python integer
4) `aten.alias(int)` raises a `TypeError`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124774
Approved by: https://github.com/albanD, https://github.com/zou3519
2024-04-30 14:11:46 +00:00
9aed5dcfe6 Clarify wording in docstring for CosineAnnealingWarmRestarts within lr_scheduler.py (#125161)
- Clarifies wording in the docstring for `CosineAnnealingWarmRestarts` within `lr_scheduler.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125161
Approved by: https://github.com/janeyx99
2024-04-30 14:01:22 +00:00
e3db465029 Re-enable nightly testing for linux and macos binaries (#123390)
Related to: https://github.com/pytorch/pytorch/issues/123225

The skip tests logic lives here:
https://github.com/pytorch/builder/blob/main/run_tests.sh#L19

Linux builds are using check_binary:
https://github.com/pytorch/pytorch/actions/runs/8627625694/job/23649245546#step:16:339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123390
Approved by: https://github.com/ZainRizvi
2024-04-30 12:53:40 +00:00
07d3af8e6a Added ARC test jobs to all build jobs in the unstable bucket (#125142)
Added ARC test jobs to all build jobs in the unstable bucket
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125142
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
2024-04-30 09:32:22 +00:00
dc514df2af [inductor] add triton code to SchedulerNode.debug_str (#125091)
Here is an example print: https://gist.github.com/shunting314/75c161368a833a535bd0d240b8099d7e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125091
Approved by: https://github.com/jansel
ghstack dependencies: #125090
2024-04-30 08:27:53 +00:00
a587a93f4c [inductor][easy] add buffer layout to SchedulerNode.debug_str (#125090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125090
Approved by: https://github.com/jansel
2024-04-30 08:27:53 +00:00
e0d2c24de1 Fix device type issue in _get_device_handle (#124390)
Fix #124327

`device_type`, the first arg of [init_device_mesh()](a0466061e1/torch/distributed/device_mesh.py (L503)),  does not support types with indexes, such as `cuda:0`.
If `cuda:0` is used as a parameter, `_get_device_handle()` will not correctly return `torch.cuda`.
So the exception should be thrown before creating DeviceMesh object.

> See https://github.com/pytorch/pytorch/issues/124327#issuecomment-2062551161,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124390
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-04-30 06:59:56 +00:00
5e5f890273 [dynamo][source] Remove inspect getattr_static from AttrSource (#125200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125200
Approved by: https://github.com/jansel
2024-04-30 06:44:25 +00:00
8320b770fd [dynamo] use lazy disable dynamo for manual seed (#125196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125196
Approved by: https://github.com/fegin, https://github.com/yanboliang
2024-04-30 06:04:22 +00:00
e7846447e0 dynamic shapes builder API (#124898)
This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors.

This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils.

With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since  such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs.

Differential Revision: D56551992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898
Approved by: https://github.com/zhxchen17
2024-04-30 03:59:49 +00:00
31801918e9 Add pooling support for 3d channels last (#116305)
Part of a multi-PR work to improve #59168

Meant to complete
Write native kernels for AvgPool3d
Write native kernels for MaxPool3d
Write native kernels for AdaptiveAvgPool3d
Write native kernels for AdaptiveMaxPool3d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116305
Approved by: https://github.com/ezyang
2024-04-30 03:51:49 +00:00
16e8431963 Fix hybrid sparse COO tensor conversion to meta tensor (#125120)
As in the title.

Addresses a bug reported in https://github.com/pytorch/pytorch/pull/117907#issuecomment-2080035379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125120
Approved by: https://github.com/ezyang, https://github.com/amjames
2024-04-30 03:43:42 +00:00
74b7c56517 [Autotune] Use half the number of warps for reduction tuning on AMD. (#125084)
I was seeing for a reduction kernel and a given block size, on AMDGPU, the vectorization bandwidth (16-byte) for a thread was not fully leveraged while it was not a problem for NVGPU. It appeared that each thread got fewer data to process as a whole row were processed by more threads, and the number of elements each thread got was not enough to saturate full vectorization. On AMDGPU, a warp has 64 lanes compared to 32 on the NV side. Therefore I'm tuning down the default number of warps (8 for NV) for AMD. I'm seeing 10% speed up for an internal benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125084
Approved by: https://github.com/shunting314
2024-04-30 02:38:34 +00:00
0969f01d73 [FSDP2] Accumulated in reduce_dtype if not syncing grads (#125191)
For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch.

Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead.

One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125191
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190
2024-04-30 02:19:13 +00:00
631d2b87f1 [FSDP2] Fixed fp32 param dtype/bf16 reduce dtype test (#125190)
The unit test for fp32 `param_dtype` and bf16 `reduce_dtype` was disabled. This PR debugs the issue and identifies the root cause as numeric differences between NCCL bf16 all-reduce vs. bf16 reduce-scatter. We address this by having the baseline use reduce-scatter -> all-gather to implement all-reduce.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125190
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-04-30 02:15:33 +00:00
2369ee49cc Update torch-xpu-ops pin (ATen XPU implementation) (#125011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125011
Approved by: https://github.com/EikanWang
2024-04-30 01:18:19 +00:00
724c7491d0 Revert " [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)"
This reverts commit b3fd94d15ef49c99ffa32a8226d1f00b0cc26f68.

Reverted https://github.com/pytorch/pytorch/pull/124987 on behalf of https://github.com/ezyang due to broke downstream extensions ([comment](https://github.com/pytorch/pytorch/pull/124987#issuecomment-2083956511))
2024-04-30 00:37:53 +00:00
e7631d6eae Revert "CI: add aarch64 linux workflow (#121284)"
This reverts commit 32cf04cb7f7aa14aff4d1cf40517d5de797550e7.

Reverted https://github.com/pytorch/pytorch/pull/121284 on behalf of https://github.com/malfet due to Test only changes has not been reverted ([comment](https://github.com/pytorch/pytorch/pull/121284#issuecomment-2083925890))
2024-04-30 00:24:11 +00:00
744f341aa4 Fix ref leak in dtype.to_complex()/to_real() (#125154)
By using `Py_NewRef`

Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS`

Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing.

Replace
```cpp
auto dtype = (PyObject*)torch::getTHPDtype(current_dtype);
Py_INCREF(dtype);
return dtype;
```
with a more compact/streamlined equivalent
```cpp
return Py_NewRef(torch::getTHPDtype(current_dtype));
```

Fixes https://github.com/pytorch/pytorch/issues/124868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125154
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-04-29 23:59:27 +00:00
4d717cd7c3 [TD] Enable td on cpu windows (#125049)
yolo

Also
* Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered)
* Don't run test removal on slow tests - I'm not touching that yet

I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-29 23:39:54 +00:00
8ee6105f84 Fix edge case in cudagraph pool detection (#124981)
When we do cudagraph warmup, we record which outputs are in the cudagraph pool, so subsequently when we invoke a cudagraph and need to reclaim its memory we can free the prior run's outputs and make them error on access.

In warmup, we detect this by ignoring outputs which are an alias of an input that is not a prior output. We did this by checking data pointer. In very rare situations, a data pointer of a non cudagraph input might get reallocated to a cudagraph pool and causes us to ignore it.

This was happening with  gpt-fast error with gemma 2 when coordinate_descent_tuning was set to False.

This updates so that we check aliasing with non-cudagraph inputs by looking at storage pointer..

Unrelated: saw very weird behavior where an output had the same data pointer as a supposedly live input but not the same cdata 🤔  I would think that is not possible.

```
out[0]._cdata in  [ref()._cdata for ab in non_cudagraph_inps_storage_refs] # False
out[0].data_ptr() in  [ref().data_ptr() for ab in non_cudagraph_inps_storage_refs] # True
```

Differential Revision: [D56607721](https://our.internmc.facebook.com/intern/diff/D56607721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124981
Approved by: https://github.com/ezyang
2024-04-29 23:37:34 +00:00
e1e6ef753b [dtensor] use str for reduce_op (#125172)
This PR use str for reduce_op directly instead of the c10d enum. Since
our functional collective already uses str, there's no reason that we
need the c10d enum anymore as that requires a conversion

Also the str hash + eq performance is actually significantly faster than
the c10d type, so this would somewhat improves the CPU overhead too

Some local cpu benchmarks on `1000000` hash operations:

```
Hash performance for string type: 0.039897 seconds
Hash performance for integer type: 0.304665 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l
2024-04-29 23:30:24 +00:00
ccaf03fd89 Fix: nn.Parameter return type identified as Tensor instead of nn.Parameter (#125106)
Fixes #125105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125106
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-29 23:25:23 +00:00
26f8d96cab Fix typo in compile docstring regarding default cache_size_limit (#125145)
Docstring of `torch.compile` specifies that default `torch._dynamo.config.cache_size_limit` equals to `64`, while the value is `8` in the corresponding py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125145
Approved by: https://github.com/kit1980
2024-04-29 22:47:43 +00:00
8c219251c5 Add backwards support to FlexAttention (#123902)
# Summary
This is part one of adding backwards support to FlexAttention.

This PR focuses on the eager implementation and wiring up enough of the templated_attention_backward(name change soon 😉) to get through aot_eager.

Notably this does not actually wire up the triton template just yet in order to make this PR easier to review. That will be the next follow up PR.

#### Structure
We pass both the forward and backward graph to the backwardsHOP since these are both needed to be inlined into the calculation for backwards:
- the forward graph is needed in order to re-compute the scores
- the joint graph is needed in order to construct the correct gradients  post softmax_grad calc

### Attatched AOT Graph
https://gist.github.com/drisspg/ce4c041f8df8a5a7983c5174705cf2b5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123902
Approved by: https://github.com/Chillee
2024-04-29 22:34:22 +00:00
720e5f306d Update CODEOWNERS - Dataloader (#125181)
Fixes #124473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125181
Approved by: https://github.com/gokulavasan, https://github.com/albanD
2024-04-29 21:37:18 +00:00
faee0e5ee8 [ez][CI] Move test_linalg and test_sparse_csr off CI_SERIAL_LIST (#125068)
* https://github.com/pytorch/pytorch/pull/124649 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125068
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-29 21:22:35 +00:00
946e202c07 [export] Restore user input names to unlifted graph modules (#124765)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/122842

Currently, calling ep.module() on an ExportedProgram leads to a GraphModule with a default forward signature (e.g. arg_0, arg_1, ...). This leads to original placeholder names disappearing for retracing/re-exporting.

Fixing this issue by creating a forward_arg_names field (will take renaming suggestions for this), that stores the positional & keyword arg names that are used. These names aren't present in the call_spec currently stored, and requires a major version bump for the ExportedProgram schema.

Test Plan: Tests exist for export, but names are now changed from generic (e.g. arg_0, arg_1) to follow user inputs (e.g. x, y)

Differential Revision: D56484994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124765
Approved by: https://github.com/zhxchen17
2024-04-29 20:58:17 +00:00
f1d1e3246f Revert "[dtensor] implement shard dim change with alltoall (#124872)"
This reverts commit 6b79469d2437531fa506b48d42488be512a87f4d.

Reverted https://github.com/pytorch/pytorch/pull/124872 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 f7f018a0ed.  Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))
2024-04-29 20:26:16 +00:00
3bd67dab32 Revert "[dtensor] delete the old unused mesh_alltoall (#124879)"
This reverts commit f7f018a0ed442f92eb5270150ced7b6117773368.

Reverted https://github.com/pytorch/pytorch/pull/124879 on behalf of https://github.com/clee2000 due to broke distributed/tensor/parallel/test_tp_examples.py::DistTensorParallelExampleTest::test_transformer_training_is_seq_parallel_True https://github.com/pytorch/pytorch/actions/runs/8882762411/job/24389191482 f7f018a0ed.  Bad TD ([comment](https://github.com/pytorch/pytorch/pull/124872#issuecomment-2083599445))
2024-04-29 20:26:15 +00:00
3d1dd79b80 make sure to stopTrace() on exception (#125131)
If there's an exception during collection it can result in the profiler never being stopped properly. As a result all subsequent tests that use profiling will also fail - even if they pass in isolation.

I'm hoping this fixes the flakyness in #124253, #124220, #82720, #119346, #119364, #119490, #119526, #119537 (and the currently closed #82864).

Before:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_matmul_dim_fp16_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_basic_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_close_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_complex_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_open_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_refcounts - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_sparse_tensors - RuntimeError: Can't disable Kineto profiler when it's not running
==================================================================================================== 16 failed, 26 passed, 53 skipped in 25.51s ====================================================================================================
```
After:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969...
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969, please...
==================================================================================================== 7 failed, 35 passed, 53 skipped in 31.51s =====================================================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125131
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-04-29 19:07:37 +00:00
a434d1487b Fix EtcdServer leak in etcd_server_test.py file (#125121)
As stated in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125121
Approved by: https://github.com/Skylion007
2024-04-29 18:59:05 +00:00
fab5bd5359 [checkpoint] Improve error message when use_reentrant=True is used with .grad() (#125155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125155
Approved by: https://github.com/albanD
2024-04-29 18:57:35 +00:00
f03cf9d4dc Fix & optimze open device registration test. (#124712)
Fixes #100152

1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Fix wrong backend meta registry mechanism when compiling with clang++( compiling with g++ work well)(introduced by static variable in inline function)
3. Refactor the tests and make it more flexible
4. Disable the two tests temporarily
    - test_open_device_storage_pin_memory
    - test_compile_autograd_function_aliasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124712
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-29 18:55:38 +00:00
32cf04cb7f CI: add aarch64 linux workflow (#121284)
aarch64 linux workflow is triggered for ciflow/aarch64 tags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121284
Approved by: https://github.com/atalman, https://github.com/malfet
2024-04-29 18:25:40 +00:00
ae13c7e593 Revert "[Meta Tensor] fix meta inplace set storage (#123880)"
This reverts commit cccae9355191a807040fb40a65178c4d7fe3f084.

Reverted https://github.com/pytorch/pytorch/pull/123880 on behalf of https://github.com/izaitsevfb due to breaks cpu_inductor_torchbench (detectron2_fasterrcnn) ([comment](https://github.com/pytorch/pytorch/pull/123880#issuecomment-2083366385))
2024-04-29 18:19:42 +00:00
96cc73dc13 [oss][torch.package] fix multiple error messages within PackageExporter (#124943)
Summary:
fixes two issues:
- when exporting with debug=True, the list of error-causing modules and a dependency path to them is not printed correctly, there's a missing newline after the path, meaning the name of the module for the next error is on the wrong line, which makes the output a confusing mess to read
- when a pickled object references more than one mocked module directly, the error message incorrectly repeats the same information, claiming the referenced attribute is present in several different libraries, because the if condition references the last seen module name while walking the pickle ops, not the module name from the enclosing block `for module_name in all_dependencies:`. this is confusing because one error will print as O(all_dependencies) errors, all with different module names but the same attribute name

Differential Revision: D56578035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124943
Approved by: https://github.com/JonAmazon, https://github.com/houseroad
2024-04-29 18:11:28 +00:00
f7f018a0ed [dtensor] delete the old unused mesh_alltoall (#124879)
as titled, as we have a dedicated comm op, this is not needed anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124879
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #124871, #124872
2024-04-29 17:22:30 +00:00
6b79469d24 [dtensor] implement shard dim change with alltoall (#124872)
as titled, we implement a dedicated communication op to allow efficient
sharding dimension change using alltoall, to replace our previous
allgather + local chunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124872
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #124871
2024-04-29 17:22:30 +00:00
8d46ab4104 [dtensor] move pad/unpad_tensor to separate utils (#124871)
as titled, 1. pad/unpad is a general util not specific to the Shard
placement, 2. for the propose of the next PR, move these two out of Shard
placement itself, and give additional pad_dim argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/XilunWu
2024-04-29 17:22:25 +00:00
935a946241 [RFC][FSDP2] Renamed FSDP to FSDPModule (#124955)
This PR renames the `FSDP` class to `FSDPModule`. This is a BC breaking change. The rationale is that `FSDPModule` is more descriptive since `fully_shard` is a module-level API (applied to a `module` arg), so the `FSDP` class will always correspond to a module.

Also, users commonly import `FullyShardedDataParallel` as `FSDP`, so this can help avoid some name conflict in some cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124955
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #124651, #124741, #124767, #124768, #124780, #124787
2024-04-29 16:33:18 +00:00
da44d2f7fb split out flop counting its own method (#125061)
Summary: Modularizing code for reuse by splitting __torch_dispatch__ to move flop counting to its own method.

Test Plan: unit tests

Differential Revision: D56644523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125061
Approved by: https://github.com/842974287
2024-04-29 14:13:44 +00:00
e5e623af4b Codegen runtime asserts in Inductor (#124874)
This completely subsumes https://github.com/pytorch/pytorch/pull/120816

This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.

Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!

So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.

There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.

**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar`  entirely; we assume that ShapeEnv will be good enough to capture all of these.

**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification

**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.

**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py

**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.

TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
2024-04-29 10:19:29 +00:00
e498e28b2f Remove API that allows for extra deferred runtime asserts during lowering (#124864)
I want to generate runtime assert nodes during lowering, which means
that I need a finalized list of asserts by the time I start lowering.
This means this runtime assert introduced in
https://github.com/pytorch/pytorch/pull/113839 must go.  Fortunately,
this runtime assert was never exercisable, apparently, and the test
still "passes" without it.  I replace it with a compile time test.  We
can revisit if this assert fails in practice.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124864
Approved by: https://github.com/jansel
2024-04-29 10:19:29 +00:00
303880e16b Update gen.py aoti_fm install dir (#125087)
Summary: make it consistent with all the other install dir

Test Plan: Sandcastle

Differential Revision: D56660301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125087
Approved by: https://github.com/frank-wei
2024-04-29 08:25:16 +00:00
cyy
5585138db9 Remove caffe2 contrib and experiments (#125038)
This PR tries to decompose #122527 into a smaller one.
To be noted, this was inspired and is co-dev with @r-barnes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125038
Approved by: https://github.com/malfet
2024-04-29 06:27:13 +00:00
555f1aeb02 Fix module buffer mutation (#124586)
Fixes #124583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124586
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-04-29 06:05:12 +00:00
06b845dedc Make metadata serialization more strict (#124411)
Summary: When I was debugging an issue, this silent error makes the debugging harder. It is better to error earlier with more descriptive error message.

Test Plan: None

Differential Revision: D56312433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124411
Approved by: https://github.com/zhxchen17
2024-04-29 02:11:40 +00:00
cc06c00a56 Don't run auto grad safe mode when predispatch is on (#125066)
Summary: Title

Test Plan: CI

Differential Revision: D56646678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125066
Approved by: https://github.com/zhxchen17
2024-04-29 01:53:23 +00:00
e3b9b71684 [BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126)
Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126
Approved by: https://github.com/ezyang
2024-04-28 21:44:33 +00:00
3e1fb96964 [BE]: RUF018 - ban assignment in assert (#125125)
Ban assignment inside of assert. Python code should ideally not break with assertions disabled. Adds a ruff lint rule to enforce this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125125
Approved by: https://github.com/ezyang
2024-04-28 21:41:36 +00:00
a05b2ae302 Enable UFMT on test/test_dataloader.py (#124710)
Part of: #123062

Ran lintrunner on:

- test/test_custom_op_testing.py (already deleted)
- test/test_dataloader.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124710
Approved by: https://github.com/soulitzer
2024-04-28 21:21:51 +00:00
hun
518ab48e85 Enable UFMT on test/test_functionalization.py (#123926)
Part of  #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123926
Approved by: https://github.com/ezyang, https://github.com/statelesshz
2024-04-28 17:02:34 +00:00
cccae93551 [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-04-28 17:01:12 +00:00
6761b49551 Ensure autocast device_type is a string + Unit test (#125014)
Reviving #124873 (already approved) to resolve CLA issues

Fixes #124738

(Marked as draft until I get local unit tests to run)

Edit: Tests passing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125014
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
2024-04-28 16:27:30 +00:00
1a0b247762 [dynamo] Bug fix for LOAD_GLOBAL and STORE_GLOBAL (#125002)
Earlier globals of inlined functions from other files were not handled correctly. We were not tracking mutations on them. They were colliding with the same global name in the parent function etc. This PR overrides the LOAD/STORE_GLOBAL for inline tx and tracks mutation on them separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125002
Approved by: https://github.com/jansel
ghstack dependencies: #125097, #125107
2024-04-28 15:24:17 +00:00
0f139b04b3 [dynamo] Fix test (#125107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125107
Approved by: https://github.com/jansel
ghstack dependencies: #125097
2024-04-28 15:24:17 +00:00
49ca2b3429 [BE]: Apply RUF025 perf fixups (#125104)
Uses `dict.fromkeys()` for more efficient dict construction. Automatically generated by RUF025 (prev).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125104
Approved by: https://github.com/ezyang
2024-04-28 15:09:21 +00:00
94b328ee45 add likely/unlikely macro for unsupport c++20 compiler. (#124997)
# Issue:
Intel validation team found some low version gcc which not support c++20 will occur below issue:
```cmd
[2024-04-13T08:03:25.142Z] g++ /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/TH -I/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/THC -I/root/anaconda3/envs/pytorch/include/python3.8 -L/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib -L/root/anaconda3/envs/pytorch/lib -L/root/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/lib -ltorch -ltorch_cpu -lgomp -ltorch_python -lc10 -mavx2 -mfma -DCPU_CAPABILITY_AVX2 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D C10_USING_CUSTOM_GENERATED_MACROS -o /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.so
[2024-04-13T08:03:25.142Z]
[2024-04-13T08:03:25.142Z] Output:
[2024-04-13T08:03:25.142Z] /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp: In function ‘T parse_arg(PyObject*, size_t) [with T = long int; PyObject = _object; size_t = long unsigned int]’:
[2024-04-13T08:03:25.142Z] /tmp/torchinductor_root/vd/cvdytwwwlhi63ofh3pwzqfpjga4w4xe7bjfdoavpblbo5khzf3b2.cpp:117:10: error: expected identifier before ‘[’ token
[2024-04-13T08:03:25.142Z] [[unlikely]] throw std::runtime_error("expected int arg");
[2024-04-13T08:03:25.142Z] ^
```

The season is `unlikely` need c++20 attribute, ref: https://en.cppreference.com/w/cpp/language/attributes/likely

# Solution:
Add MACRO to enable non-c++20 attribute GNU compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124997
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-28 07:03:12 +00:00
42a192db0f Fix Conv BN folding with deadcode (#124808)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/124286
The TorchBenchmark includes a method called `run_n_iterations` which runs model multiple times.
43f4e71daa/benchmarks/dynamo/common.py (L2272-L2276)

https://github.com/pytorch/pytorch/pull/123399 enables tracing into a `UserDefinedObjectVariable` that's an instance method.    It will trace the model into FX graph multiple times within `run_n_iterations`. Then, in the Inductor, `Conv-BN folding` at the module level will fuse the same Conv-BN module multiple times in this case, which leads to accuracy failures. This PR addresses the issue by ensuring that each Conv-BN module is fused only once.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_functional_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_bn_with_multi_bn_share_conv
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_functional_bn_with_multi_bn_share_conv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124808
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-04-28 06:29:40 +00:00
c1e0dea023 Delete unused param 'OP' in KERNEL_PRIVATEUSEONE (#125008)
Parameter 'OP' is unused but occupies  a position that will cause the length of \_\_VA_ARGS\__  less than expected.
Missed this diff in https://github.com/pytorch/pytorch/pull/124050.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125008
Approved by: https://github.com/FFFrog, https://github.com/leslie-fang-intel
2024-04-28 06:17:16 +00:00
5f7c4181b5 Correcting valid device name of privateuse1 (#125018)
"privateuseone" is an invalid string for privateuse1 backend, the correct one should be returned from _get_privateuse1_backend_name().
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125018
Approved by: https://github.com/aaronenyeshi
2024-04-28 06:04:34 +00:00
c5b1a4c269 [inductor] share more cse cache during swap buffer (#124921)
`swap_buffer` will make the `cse_cache` cannot be shared inside/outside of the lambda function scope.
For example,

```
auto tmp8 = -std::numeric_limits<float>::infinity();
auto tmp9 = [&]
{
    auto tmp12 = -std::numeric_limits<float>::infinity();
    return tmp12;
}
```
`tmp12` should not be created since it is same with `tmp8`.

We make the `cse_cache` as a read only cache inside the scope (because it is unsafe to expose cache inside the scope,the outside scope cannot use it.)

**Test Plan**
```
python test/inductor/test_torchinductor.py -k test_AllenaiLongformerBase_repro_cpu
```
the `static_cast<int>(256)` will only occur once after this PR since the inside scope can share the cse buffer outside the scope.

Before this PR,
```
cpp_fused_copy_full_like_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_root/ub/cub6x5nmhqhp7xapkb3dlgjxef3t2bnkx7y7n4z2f4z5obnecxpy.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr1)
{
    #pragma omp parallel num_threads(128)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(12L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(512L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = c10::convert<int>(x1);
                            auto tmp1 = static_cast<int>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int>(x3);
                                auto tmp5 = at::vec::Vectorized<int>::arange(tmp4, 1);
                                auto tmp6 = static_cast<int>(257);
                                auto tmp7 = at::vec::Vectorized<int>(tmp6);
                                auto tmp8 = at::vec::VecMask<int,1>(tmp5 < tmp7);
                                auto tmp10 = at::vec::VecMask<float,1>::from(tmp2);
                                auto tmp11 = tmp8 & tmp10;
                                auto tmp9 = [&]
                                {
                                    auto tmp12 = -std::numeric_limits<float>::infinity();
                                    return tmp12;
                                }
                                ;
                                auto tmp13 =
                                [&]
                                {
                                    if (tmp11.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(at::vec::Vectorized<float>(tmp9()))::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), at::vec::Vectorized<float>(tmp9()), tmp11.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp14 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                                auto tmp15 = static_cast<int>(3);
                                auto tmp16 = tmp14 < tmp15;
                                auto tmp18 = tmp16 & tmp2;
                                auto tmp17 = [&]
                                {
                                    auto tmp19 = c10::convert<int>(x3);
                                    auto tmp20 = at::vec::Vectorized<int>::arange(tmp19, 1);
                                    auto tmp21 = static_cast<int>(256);
                                    auto tmp22 = at::vec::Vectorized<int>(tmp21);
                                    auto tmp23 = at::vec::VecMask<int,1>(tmp20 >= tmp22);
                                    auto tmp25 = at::vec::VecMask<float,1>::from(tmp18);
                                    auto tmp26 = tmp23 & tmp25;
                                    auto tmp24 = [&]
                                    {
                                        auto tmp27 = tmp26.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                        return tmp27;
                                    }
                                    ;
                                    auto tmp28 =
                                    [&]
                                    {
                                        if (tmp26.all_zero())
                                        {
                                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                        }
                                        else
                                        {
                                            return decltype(tmp24())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp24(), tmp26.template cast<float,1>());
                                        }
                                    }
                                    ()
                                    ;
                                    auto tmp29 = static_cast<float>(0.0);
                                    auto tmp30 = at::vec::Vectorized<float>(tmp29);
                                    auto tmp31 = decltype(tmp28)::blendv(tmp30, tmp28, tmp23.template cast<float,1>());
                                    return tmp31;
                                }
                                ;
                                auto tmp32 = tmp16 ? tmp17() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                                auto tmp33 = static_cast<float>(0.0);
                                auto tmp34 = at::vec::VecMask<float,1>::from(tmp16);
                                auto tmp35 = at::vec::Vectorized<float>(tmp33);
                                auto tmp36 = decltype(tmp32)::blendv(tmp35, tmp32, tmp34.template cast<float,1>());
                                auto tmp37 = decltype(tmp13)::blendv(tmp36, tmp13, tmp8.template cast<float,1>());
                                return tmp37;
                            }
                            ;
                            auto tmp38 = tmp2 ? tmp3() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp39 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                            auto tmp40 = static_cast<int>(3);
                            auto tmp41 = tmp39 < tmp40;
                            auto tmp42 = [&]
                            {
                                auto tmp43 = c10::convert<int>(x3);
                                auto tmp44 = at::vec::Vectorized<int>::arange(tmp43, 1);
                                auto tmp45 = static_cast<int>(256);
                                auto tmp46 = at::vec::Vectorized<int>(tmp45);
                                auto tmp47 = at::vec::VecMask<int,1>(tmp44 >= tmp46);
                                auto tmp49 = at::vec::VecMask<float,1>::from(tmp41);
                                auto tmp50 = tmp47 & tmp49;
                                auto tmp48 = [&]
                                {
                                    auto tmp51 = tmp50.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                    return tmp51;
                                }
                                ;
                                auto tmp52 =
                                [&]
                                {
                                    if (tmp50.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(tmp48())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp48(), tmp50.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp53 = static_cast<float>(0.0);
                                auto tmp54 = at::vec::Vectorized<float>(tmp53);
                                auto tmp55 = decltype(tmp52)::blendv(tmp54, tmp52, tmp47.template cast<float,1>());
                                return tmp55;
                            }
                            ;
                            auto tmp56 = tmp41 ? tmp42() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp57 = static_cast<float>(0.0);
                            auto tmp58 = at::vec::VecMask<float,1>::from(tmp41);
                            auto tmp59 = at::vec::Vectorized<float>(tmp57);
                            auto tmp60 = decltype(tmp56)::blendv(tmp59, tmp56, tmp58.template cast<float,1>());
                            auto tmp61 = at::vec::VecMask<float,1>::from(tmp2);
                            auto tmp62 = decltype(tmp38)::blendv(tmp60, tmp38, tmp61.template cast<float,1>());
                            tmp62.store(out_ptr1 + static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x3=static_cast<long>(512L); x3<static_cast<long>(513L); x3+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<int64_t>(x1);
                            auto tmp1 = static_cast<int64_t>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int64_t>(x3);
                                auto tmp5 = static_cast<int64_t>(257);
                                auto tmp6 = tmp4 < tmp5;
                                auto tmp7 = [&]
                                {
                                    auto tmp8 = -std::numeric_limits<float>::infinity();
                                    return tmp8;
                                }
                                ;
                                auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                                auto tmp10 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                                auto tmp11 = static_cast<int64_t>(3);
                                auto tmp12 = tmp10 < tmp11;
                                auto tmp13 = [&]
                                {
                                    auto tmp14 = c10::convert<int64_t>(x3);
                                    auto tmp15 = static_cast<int64_t>(256);
                                    auto tmp16 = tmp14 >= tmp15;
                                    auto tmp17 = [&]
                                    {
                                        auto tmp18 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                        return tmp18;
                                    }
                                    ;
                                    auto tmp19 = tmp16 ? tmp17() : static_cast<decltype(tmp17())>(0.0);
                                    auto tmp20 = static_cast<float>(0.0);
                                    auto tmp21 = tmp16 ? tmp19 : tmp20;
                                    return tmp21;
                                }
                                ;
                                auto tmp22 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                                auto tmp23 = static_cast<float>(0.0);
                                auto tmp24 = tmp12 ? tmp22 : tmp23;
                                auto tmp25 = tmp6 ? tmp9 : tmp24;
                                return tmp25;
                            }
                            ;
                            auto tmp26 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            auto tmp27 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                            auto tmp28 = static_cast<int64_t>(3);
                            auto tmp29 = tmp27 < tmp28;
                            auto tmp30 = [&]
                            {
                                auto tmp31 = c10::convert<int64_t>(x3);
                                auto tmp32 = static_cast<int64_t>(256);
                                auto tmp33 = tmp31 >= tmp32;
                                auto tmp34 = [&]
                                {
                                    auto tmp35 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                    return tmp35;
                                }
                                ;
                                auto tmp36 = tmp33 ? tmp34() : static_cast<decltype(tmp34())>(0.0);
                                auto tmp37 = static_cast<float>(0.0);
                                auto tmp38 = tmp33 ? tmp36 : tmp37;
                                return tmp38;
                            }
                            ;
                            auto tmp39 = tmp29 ? tmp30() : static_cast<decltype(tmp30())>(0.0);
                            auto tmp40 = static_cast<float>(0.0);
                            auto tmp41 = tmp29 ? tmp39 : tmp40;
                            auto tmp42 = tmp2 ? tmp26 : tmp41;
                            out_ptr1[static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0))] = tmp42;
                        }
                    }
                }
            }
        }
    }
}
''')
```
After this PR,
```
cpp_fused_copy_full_like_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_root/ub/cub6x5nmhqhp7xapkb3dlgjxef3t2bnkx7y7n4z2f4z5obnecxpy.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr1)
{
    #pragma omp parallel num_threads(128)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(12L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(512L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = c10::convert<int>(x1);
                            auto tmp1 = static_cast<int>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int>(x3);
                                auto tmp5 = at::vec::Vectorized<int>::arange(tmp4, 1);
                                auto tmp6 = static_cast<int>(257);
                                auto tmp7 = at::vec::Vectorized<int>(tmp6);
                                auto tmp8 = at::vec::VecMask<int,1>(tmp5 < tmp7);
                                auto tmp10 = at::vec::VecMask<float,1>::from(tmp2);
                                auto tmp11 = tmp8 & tmp10;
                                auto tmp9 = [&]
                                {
                                    auto tmp12 = -std::numeric_limits<float>::infinity();
                                    return tmp12;
                                }
                                ;
                                auto tmp13 =
                                [&]
                                {
                                    if (tmp11.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(at::vec::Vectorized<float>(tmp9()))::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), at::vec::Vectorized<float>(tmp9()), tmp11.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp14 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                                auto tmp15 = static_cast<int>(3);
                                auto tmp16 = tmp14 < tmp15;
                                auto tmp18 = tmp16 & tmp2;
                                auto tmp17 = [&]
                                {
                                    auto tmp19 = at::vec::Vectorized<int>(tmp1);
                                    auto tmp20 = at::vec::VecMask<int,1>(tmp5 >= tmp19);
                                    auto tmp22 = at::vec::VecMask<float,1>::from(tmp18);
                                    auto tmp23 = tmp20 & tmp22;
                                    auto tmp21 = [&]
                                    {
                                        auto tmp24 = tmp23.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                        return tmp24;
                                    }
                                    ;
                                    auto tmp25 =
                                    [&]
                                    {
                                        if (tmp23.all_zero())
                                        {
                                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                        }
                                        else
                                        {
                                            return decltype(tmp21())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp21(), tmp23.template cast<float,1>());
                                        }
                                    }
                                    ()
                                    ;
                                    auto tmp26 = static_cast<float>(0.0);
                                    auto tmp27 = at::vec::Vectorized<float>(tmp26);
                                    auto tmp28 = decltype(tmp25)::blendv(tmp27, tmp25, tmp20.template cast<float,1>());
                                    return tmp28;
                                }
                                ;
                                auto tmp29 = tmp16 ? tmp17() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                                auto tmp30 = static_cast<float>(0.0);
                                auto tmp31 = at::vec::VecMask<float,1>::from(tmp16);
                                auto tmp32 = at::vec::Vectorized<float>(tmp30);
                                auto tmp33 = decltype(tmp29)::blendv(tmp32, tmp29, tmp31.template cast<float,1>());
                                auto tmp34 = decltype(tmp13)::blendv(tmp33, tmp13, tmp8.template cast<float,1>());
                                return tmp34;
                            }
                            ;
                            auto tmp35 = tmp2 ? tmp3() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp36 = c10::convert<int>(c10::div_floor_integer(x1, 256L));
                            auto tmp37 = static_cast<int>(3);
                            auto tmp38 = tmp36 < tmp37;
                            auto tmp39 = [&]
                            {
                                auto tmp40 = c10::convert<int>(x3);
                                auto tmp41 = at::vec::Vectorized<int>::arange(tmp40, 1);
                                auto tmp42 = at::vec::Vectorized<int>(tmp1);
                                auto tmp43 = at::vec::VecMask<int,1>(tmp41 >= tmp42);
                                auto tmp45 = at::vec::VecMask<float,1>::from(tmp38);
                                auto tmp46 = tmp43 & tmp45;
                                auto tmp44 = [&]
                                {
                                    auto tmp47 = tmp46.template cast<float,1>().template loadu<float,1>(in_ptr0 + static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0)));
                                    return tmp47;
                                }
                                ;
                                auto tmp48 =
                                [&]
                                {
                                    if (tmp46.all_zero())
                                    {
                                        return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                    }
                                    else
                                    {
                                        return decltype(tmp44())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp44(), tmp46.template cast<float,1>());
                                    }
                                }
                                ()
                                ;
                                auto tmp49 = static_cast<float>(0.0);
                                auto tmp50 = at::vec::Vectorized<float>(tmp49);
                                auto tmp51 = decltype(tmp48)::blendv(tmp50, tmp48, tmp43.template cast<float,1>());
                                return tmp51;
                            }
                            ;
                            auto tmp52 = tmp38 ? tmp39() : at::vec::Vectorized<float>(static_cast<float>(0.0));
                            auto tmp53 = static_cast<float>(0.0);
                            auto tmp54 = at::vec::VecMask<float,1>::from(tmp38);
                            auto tmp55 = at::vec::Vectorized<float>(tmp53);
                            auto tmp56 = decltype(tmp52)::blendv(tmp55, tmp52, tmp54.template cast<float,1>());
                            auto tmp57 = at::vec::VecMask<float,1>::from(tmp2);
                            auto tmp58 = decltype(tmp35)::blendv(tmp56, tmp35, tmp57.template cast<float,1>());
                            tmp58.store(out_ptr1 + static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x3=static_cast<long>(512L); x3<static_cast<long>(513L); x3+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<int64_t>(x1);
                            auto tmp1 = static_cast<int64_t>(256);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = c10::convert<int64_t>(x3);
                                auto tmp5 = static_cast<int64_t>(257);
                                auto tmp6 = tmp4 < tmp5;
                                auto tmp7 = [&]
                                {
                                    auto tmp8 = -std::numeric_limits<float>::infinity();
                                    return tmp8;
                                }
                                ;
                                auto tmp9 = tmp6 ? tmp7() : static_cast<decltype(tmp7())>(0.0);
                                auto tmp10 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                                auto tmp11 = static_cast<int64_t>(3);
                                auto tmp12 = tmp10 < tmp11;
                                auto tmp13 = [&]
                                {
                                    auto tmp14 = tmp4 >= tmp1;
                                    auto tmp15 = [&]
                                    {
                                        auto tmp16 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                        return tmp16;
                                    }
                                    ;
                                    auto tmp17 = tmp14 ? tmp15() : static_cast<decltype(tmp15())>(0.0);
                                    auto tmp18 = static_cast<float>(0.0);
                                    auto tmp19 = tmp14 ? tmp17 : tmp18;
                                    return tmp19;
                                }
                                ;
                                auto tmp20 = tmp12 ? tmp13() : static_cast<decltype(tmp13())>(0.0);
                                auto tmp21 = static_cast<float>(0.0);
                                auto tmp22 = tmp12 ? tmp20 : tmp21;
                                auto tmp23 = tmp6 ? tmp9 : tmp22;
                                return tmp23;
                            }
                            ;
                            auto tmp24 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            auto tmp25 = c10::convert<int64_t>(c10::div_floor_integer(x1, 256L));
                            auto tmp26 = static_cast<int64_t>(3);
                            auto tmp27 = tmp25 < tmp26;
                            auto tmp28 = [&]
                            {
                                auto tmp29 = c10::convert<int64_t>(x3);
                                auto tmp30 = tmp29 >= tmp1;
                                auto tmp31 = [&]
                                {
                                    auto tmp32 = in_ptr0[static_cast<long>((-256L) + x3 + (513L*(static_cast<long>(x1) % static_cast<long>(256L))) + (262656L*(c10::div_floor_integer(x1, 256L))) + (787968L*x2) + (9455616L*x0))];
                                    return tmp32;
                                }
                                ;
                                auto tmp33 = tmp30 ? tmp31() : static_cast<decltype(tmp31())>(0.0);
                                auto tmp34 = static_cast<float>(0.0);
                                auto tmp35 = tmp30 ? tmp33 : tmp34;
                                return tmp35;
                            }
                            ;
                            auto tmp36 = tmp27 ? tmp28() : static_cast<decltype(tmp28())>(0.0);
                            auto tmp37 = static_cast<float>(0.0);
                            auto tmp38 = tmp27 ? tmp36 : tmp37;
                            auto tmp39 = tmp2 ? tmp24 : tmp38;
                            out_ptr1[static_cast<long>(x3 + (513L*x1) + (525312L*x2) + (6303744L*x0))] = tmp39;
                        }
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124921
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #124597
2024-04-28 04:33:25 +00:00
57790fd088 [inductor] share cse cache during vectorized indirect load (#124597)
Fix https://github.com/pytorch/pytorch/issues/123502

`swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer.
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124597
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-28 01:02:48 +00:00
7478b7f1ca Add common used score_mod functions for templated attention (#124670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124670
Approved by: https://github.com/Chillee
2024-04-27 21:04:52 +00:00
df08140de2 [dynamo] Collect cell_and_freevars correctly (#125097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125097
Approved by: https://github.com/Skylion007
2024-04-27 20:39:54 +00:00
7aa6bd7fa0 Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735)
This ensures that first argument to record_shapeenv_event is a ShapeEnv
so we can appropriately short circuit when recording is not in progress.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123735
Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD
2024-04-27 20:36:40 +00:00
9ce58542ba Ignore torch/distributed/_tensor/_collective_utils.py for TOR901 (#125082)
Fixes https://github.com/pytorch/pytorch/issues/125050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125082
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-04-27 20:14:02 +00:00
b4a008209a Expose tensor check from guard for reusing (#124836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124836
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-04-27 18:35:35 +00:00
f0a5a0d298 OSS: Capture triton kernel in ET (#124775)
This DIFF is to capture triton kernels in execution trace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124775
Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi
2024-04-27 18:01:18 +00:00
8246f42864 Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)
Fixes #65307

For consistency with Python Array API (https://data-apis.org/array-api/latest/API_specification/constants.html) and NumPy  (https://numpy.org/devdocs/reference/constants.html), I added `torch.newaxis = None`.

Note that the consistency is directly mentioned also in the `__init__.py`, right above the added export.

The `torch.newaxis` is also mentioned in #110636.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125026
Approved by: https://github.com/lezcano
2024-04-27 16:40:51 +00:00
9bf53b128c [codemod] Remove unused variables in caffe2/aten/src/ATen/test/scalar_test.cpp (#125041)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D56587751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125041
Approved by: https://github.com/Skylion007
2024-04-27 15:53:16 +00:00
905318818d [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/core.cpp +2 (#125047)
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.

This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.

To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D56614179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125047
Approved by: https://github.com/Skylion007
2024-04-27 15:52:56 +00:00
61e937f3d6 Add registration API for torch.compile-eager (#121387)
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.

In this PR, we intend to provide a registration API dedicated to eager-through-torch.compile. The major workflow of this API will be as follows.

- Load cache
- Check cache according to the input tensors
  - Cache Hit: Run the cached kernel directly
  - Cache Miss: Run the AOTI to produce kernel and run the produced kernel. If AOTI fails to produce the kernel, invoke the python fallback function.

Currently, this PR always fallback to python kernel now and cache mechanism will be implemented in another PR - https://github.com/pytorch/pytorch/pull/116368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121387
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/zou3519, https://github.com/jgong5
2024-04-27 12:49:58 +00:00
620d808da0 [Pytorch 2] Forward fix for broken test (#125065)
Summary:
This is a forward Hotfix for T186742340.

Some recent changes in Pytorch / Inductor ( D56458606) led to aten.addmm operators being inserted twice into the list of choices to select from during autotuning. This appears to have triggered a test failure in fbcode.

This fix prevents the aten operators being added twice to the list of choices for autotuning.

Test Plan:
* Pytorch CI
 * CUDA_LAUNCH_BLOCKING=1 buck2 test 'fbcode//mode/opt' fbcode//accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test -- --exact 'accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test - test_compile_pt2 (accelerators.pytorch.lib.pt2_utils.tests.compile_pt2_test.TestCompilePT2)'

Differential Revision: D56642879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125065
Approved by: https://github.com/eellison
2024-04-27 10:27:44 +00:00
d4a1b3e093 Make c10d_functional ops call into _c10d_functional ops (#124979)
This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops.

We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979
Approved by: https://github.com/wanchaol
2024-04-27 08:08:02 +00:00
91a4740e72 Disable the CUDA fast path for split_with_sizes_copy when capturing (#125052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125052
Approved by: https://github.com/awgu, https://github.com/eellison, https://github.com/eqy
2024-04-27 07:59:39 +00:00
cyy
b3fd94d15e [Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
2024-04-27 07:22:27 +00:00
ce503c1b40 Dynamo x autograd.Function supports setup_context (#124802)
Fixes part of #118397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124802
Approved by: https://github.com/zou3519
2024-04-27 04:57:13 +00:00
eqy
a866bfff45 [cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs

Will also collect benchmark data,

CC @drisspg

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-27 04:15:49 +00:00
5944a53555 [MPS] Fix nextafter for negative values (#125029)
By changing the logic to on older MacOS:
```cpp
bits += ((input > 0) ^ (input > other)) ? 1 : -1;
```
And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available)

TODO:
  - Add tests for infs and denorms

Fixes https://github.com/pytorch/pytorch/issues/124985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029
Approved by: https://github.com/Skylion007
2024-04-27 02:58:05 +00:00
35b332882b [Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387)
As the title
**Test plan**
python test/test_quantization.py -k test_linear_binary

Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122387
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #123240
2024-04-27 02:40:57 +00:00
dc4c75ba72 elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982)
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.

This uses 2 approaches for different areas:

* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.

At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.

Test Plan:
This is testing using many small tests running on a remote cluster.

{D56549942}

```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```

Differential Revision: D56605193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
2024-04-27 02:21:44 +00:00
1a6fef15ef [compiled autograd] verbose logs for debugging cache misses (#124980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124980
Approved by: https://github.com/jansel
ghstack dependencies: #124954
2024-04-27 01:10:37 +00:00
43a7ab2a21 [compiled autograd] introduce verbose logs, add autograd node info to graph (#124954)
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
2024-04-27 01:10:37 +00:00
e592a609fd [Quant][ONEDNN] improve performance of qconv by reducing integration overhead (#123240)
## Description
Framework overhead is found to be big for the onednn qconv op (used for quantization with PT2E X86Inductor backend). This PR reduces the integration overhead by modifying the implementation of qconv.

## performance results
Running quantized Resnet50 on an Intel(R) Xeon(R) Platinum 8490H machine
Before
```
Average latency: 8.378 ms.
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
onednn::qconv2d_pointwise        86.54%       6.954ms        87.42%       7.025ms     132.547us            53
```
After
```
Average latency: 6.255 ms.
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
onednn::qconv2d_pointwise        85.05%       6.381ms        85.98%       6.451ms     121.717us            53
```
Test script:
```python
import torch
import torchvision
import time
import copy
import numpy as np
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import (
    prepare_pt2e,
    convert_pt2e,
)
import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq
from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer

torch._inductor.config.cpp.enable_kernel_profile=True
torch._inductor.config.profiler_mark_wrapper_call = True
torch._inductor.config.freezing = True
torch._inductor.config.cpp_wrapper = True

def bench_model(model, inputs):
    times =[]
    with torch.no_grad():
        for _ in range(5): # warm-up
            output = model(inputs)
        for _ in range(20):
            start_time = time.time()
            output = model(inputs)
            end_time = time.time()
            times.append(end_time - start_time)
        print ('Average latency: %0.3f ms.' % (np.median(times) * 1000.0))

        with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as p:
            out_ipex = model(inputs)
        print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1))

def pt2e_ptq(m, example_inputs):

    m = m.eval()

    exported_model = capture_pre_autograd_graph(m, example_inputs)
    quantizer = X86InductorQuantizer()
    quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
    prepared_model = prepare_pt2e(exported_model, quantizer)

    _ = prepared_model(*example_inputs)

    converted_model = convert_pt2e(prepared_model)
    torch.ao.quantization.move_exported_model_to_eval(converted_model)
    with torch.no_grad():
        optimized_model = torch.compile(converted_model)
        _ = optimized_model(*example_inputs)
        _ = optimized_model(*example_inputs)

    bench_model(optimized_model, *example_inputs)

    return optimized_model

if __name__ == "__main__":

    data = torch.randn(16, 3, 224, 224)
    model_fp = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT)
    pt2e_ptq(copy.deepcopy(model_fp), (data,))
```

Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123240
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-04-27 00:52:45 +00:00
368f5212fa [cpu] [inductor] decompose bmm for memory bound in lowering (#124826)
Fixes #124697. Resolve the issue of large regression of GPT-FAST MOE with `coordinate_descent_tuning` disabled.

To get better perf for memory bound case, we decompose bmm in lowering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124826
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-27 00:19:10 +00:00
ebb8905e0c [cpu] add VecConvert between 8bits and 16bits (#124828)
The perf benefit was found in https://github.com/pytorch/pytorch/issues/124697#issuecomment-2071658300.

The PR adds intrinsic specializations between int8/uint8 and bf16/fp16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124828
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-27 00:17:44 +00:00
fd24d8c05a [dynamo][nn module] Use correct sources for _call_impl (#124970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124970
Approved by: https://github.com/jansel
ghstack dependencies: #124779, #124627
2024-04-26 23:18:30 +00:00
43069c460e Correct check for Boolean list input type (#124899)
Summary:
This diff fixes a bug in PyTorch where when creating a tensor from a List of booleans, PyTorch was throwing an error.

This fix resolves that issue. All credit goes to swolchok for identifying the root cause of the issue and suggesting this fix.

Test Plan: Running our model end to end works as expected and no error occurs.

Differential Revision: D55990810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124899
Approved by: https://github.com/zhxchen17
2024-04-26 22:25:43 +00:00
be2c09725a [dtensor][experimental] local_map (#123676)
**Summary**
This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`.

As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`.

`local_map` returns a new function that does the following:

1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding.
2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives.
3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676
Approved by: https://github.com/wanchaol
2024-04-26 22:23:59 +00:00
83e7b9d25f [Inductor] Support fusion of chained reductions even if keepdims=True (#124843)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124843
Approved by: https://github.com/shunting314
2024-04-26 21:50:52 +00:00
a68a8c0f6b Disable test_binary_op_list_error_cases in test_foreach (#125046)
It's really flaky

ex
* https://github.com/pytorch/pytorch/issues/124636
* https://github.com/pytorch/pytorch/issues/124529

there are more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125046
Approved by: https://github.com/huydhn
2024-04-26 21:25:38 +00:00
c6b7504d47 Fix torch.library.register_fake's module reporting (#125037)
torch.library.register_fake reports the python module the fake impl is
located in. This is used to check against
`m.set_python_module("foo.bar")` calls in C++.

The module reporting logic was wrong in most cases. This PR fixes it.

Test Plan:
- exhaustive tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125037
Approved by: https://github.com/williamwen42
2024-04-26 20:53:33 +00:00
cd06c73cbd [Inductor Cutlass backend] Improved GEMM template (#124577)
Improves the Cutlass backend GEMM template:

 * Adds code which allows to create stand-alone test runners for Cutlass GEMM Kernels, which allows (manual) debugging of, for example, CUDA IMA errors or similar problems which occur in practice. Includes some utility code and tests to actually compile and run these standalone tests.
 * Cleans up the GEMM template code through various refactorings
 * Eliminates code sections and options that are unneccessary now that epilogue fusions are being removed.
 * Limits the scope of a workaround for (flaky) Cutlass issues with bias broadcasting to neccessary cases.
 * Puts some CPU runtime checks into #if / #endif blocks, such that it's possible to compile CUTLASS Kernels with lower CPU overhead.
 * Add documentation comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124577
Approved by: https://github.com/jansel
ghstack dependencies: #124576
2024-04-26 20:03:20 +00:00
4a6dfbe480 Add label to label config to auto apply labels based on other labels (#125042)
* Implemented in https://github.com/pytorch/test-infra/pull/5127,
* Tested in malfet/delete me: https://github.com/malfet/deleteme/issues/85
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125042
Approved by: https://github.com/huydhn
2024-04-26 19:58:56 +00:00
4e2b4c6ed6 Fix broken docs (#124940)
These were causing doctest to be unhappy.

In particular the doc from #124496 caused #124771 to fail "trunk / win-vs2019-cpu-py3 / test" to fail when pushing. Not sure why it wasn't a problem on the original PR.

Testing:

`./test/run_doctests.sh`:
  before:
```
=== 4 warnings in 11.21 seconds ===
```
  after:
```
===  in 11.11 seconds ===
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124940
Approved by: https://github.com/zou3519, https://github.com/atalman, https://github.com/huydhn
2024-04-26 19:24:52 +00:00
9266e472e2 rename ort to maia in dynamo's ort backend. (#124967)
Fixes #124966

Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124967
Approved by: https://github.com/thiagocrepaldi
2024-04-26 19:09:29 +00:00
abcb42cdd2 Avoid COW materialize in various places (1) (#124984)
Most, not all, of these cases were found automatically with `git grep -n '^\s*\<const\>.*\*.*=.*\<data_ptr\>'`

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124984
Approved by: https://github.com/Skylion007
2024-04-26 19:06:28 +00:00
2ea1e84d40 log pt2 config dict to signpost from inductor post grad (#124593)
Summary:
previous attempts don't work eventually.  D49720297 causes online train SEV due to extra importing.  D56299408 mitigates a tricky bug from Distributed Shampoo constructor but unfortutenaly didn't correct the scuba logging either.

see f552546983

Test Plan: {F1491621504}

Differential Revision: D56378270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124593
Approved by: https://github.com/anijain2305
2024-04-26 18:57:11 +00:00
91d565da0c [dynamo] Add support for tensor's is_complex method (#124927)
This PR is to add support for tensor's is_complex method in dynamo. Take the following code as an example:
```python
   def test_tensor_is_complex(x):
        if x.is_complex():
            return x + 1
        else:
            return x - 1
```
Before this fix, the is_complex() call will cause a graph break "torch.* op returned non-Tensor bool call_method is_complex". After this fix, the graph break can be avoided.

Fixes #122692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124927
Approved by: https://github.com/ezyang
2024-04-26 18:28:14 +00:00
781ea00c90 [TD] Query Github API for base (#122214)
A better query for the base commit of a PR.
Some ghstack PRs are not connected to main so git merge-base doesn't work.  Instead, use the Github API to query for the base of the PR, which should be more accurate

Sanity checked on one of Ed's ghstack PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214
Approved by: https://github.com/seemethere
2024-04-26 18:21:24 +00:00
858fdd8c40 Remove cppwrapper option on inductor benchmark workflow (#124971)
I'm restoring the `training` and `inference` options after github.com/pytorch/pytorch/pull/124795 and remove the not less-known `cppwrapper` option instead per @desertfire suggestion.  The total number of parameters remains at 10.

Also, the default choice for training and inference are explicitly spelled out when dispatching the workflow manually to catch dev attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124971
Approved by: https://github.com/ezyang
2024-04-26 17:41:24 +00:00
392dc45597 Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799
Approved by: https://github.com/drisspg
ghstack dependencies: #124444
2024-04-26 17:22:13 +00:00
b4d39a5de9 Revert "[TD] Query Github API for base (#122214)"
This reverts commit b003e0f29eeb4a810c47056400918924948b88c2.

Reverted https://github.com/pytorch/pytorch/pull/122214 on behalf of https://github.com/clee2000 due to failing on main due to mistake ([comment](https://github.com/pytorch/pytorch/pull/122214#issuecomment-2079732105))
2024-04-26 16:42:51 +00:00
8461e7ed9e Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)
Test the generic torch.Stream/Event with fake device gurad and hooks. Since we added a fake device backend, it is mutual exclusive to other backends. Tests will be skipped if TEST_CUDA or TEST_ROCM is true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
2024-04-26 16:17:54 +00:00
73744a2c00 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-26 16:17:54 +00:00
36af9c0d7d [Aten] Fix XPU convolution_overrideable input memory format. (#124841)
[Aten] Fix convolution_overrideable input memory format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124841
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-04-26 15:55:01 +00:00
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
609c958281 Fix mypy issues in fake_tensor.py (#124428)
fake_tensor.py had mypy error ignored. That seems less than desirable.

Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).

Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
2024-04-26 15:35:53 +00:00
8d12ba9acf add methods for open device in PackedSequence module. (#124923)
1) add is_{custom_device_name}() and {custom_device_name}() for open device register;
2) fix open device failed testcases.

@ezyang  @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124923
Approved by: https://github.com/ezyang
2024-04-26 15:26:20 +00:00
b003e0f29e [TD] Query Github API for base (#122214)
A better query for the base commit of a PR.
Some ghstack PRs are not connected to main so git merge-base doesn't work.  Instead, use the Github API to query for the base of the PR, which should be more accurate

Sanity checked on one of Ed's ghstack PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122214
Approved by: https://github.com/seemethere
2024-04-26 15:16:36 +00:00
6b54f9d3e1 Revert "fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037)"
This reverts commit f9379ebbbf1369aad8179cac4a2eb7d72f25739e.

Reverted https://github.com/pytorch/pytorch/pull/124037 on behalf of https://github.com/jeanschmidt due to introducing regressions in benchmark, see D56623194 for more details ([comment](https://github.com/pytorch/pytorch/pull/124037#issuecomment-2079574308))
2024-04-26 15:07:09 +00:00
6bef5e9f67 [CI] Add retry mechanism to check if the Docker daemon is running (#124728)
What is done:
* Skipped the 'Kill existing containers' step - ARC runners are always ephemeral.
* Added a retry mechanism to check if the Docker daemon is running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124728
Approved by: https://github.com/seemethere, https://github.com/zxiiro, https://github.com/ZainRizvi
2024-04-26 14:36:32 +00:00
2f3b0befed [BE]: Apply ruff FURB 118. (#124743)
Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-26 14:34:52 +00:00
fc2aa23c1e Test reland "AOTAutograd: gate view-replay behind config, not the def… (#124948)
A parallel attempt at landing https://github.com/pytorch/pytorch/pull/124945, but attempting to land through fbcode first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124948
Approved by: https://github.com/albanD
2024-04-26 13:16:26 +00:00
fc13c1c850 [aot_inductor] Enable test_aot_inductor tests for ROCm (#123393)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123393
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-04-26 13:15:35 +00:00
3d8585e501 [XPU] Add manual_seed and synchronize method (#124709)
This PR set the following device-specific settings for xpu(Intel GPU) specific:
1. Set the manual seed for xpu
2. Set the synchronization method for xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124709
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-04-26 12:32:12 +00:00
74afccdd80 [parametrization] fix requires_grad propagation (#124888)
Summary:
Previously the `requires_grad` is not propagated from original Tensor to decomposed tensors

Test Plan:
python test/test_parametrization.py -k test_register_parametrization_no_grad

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124888
Approved by: https://github.com/lezcano
2024-04-26 10:19:31 +00:00
d1b25596d5 Revert "Add common used score_mod functions for templated attention (#124670)"
This reverts commit ed120b08c4828c39f116cfe1fb39195c844be485.

Reverted https://github.com/pytorch/pytorch/pull/124670 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, more info can be found in D56571389 ([comment](https://github.com/pytorch/pytorch/pull/124670#issuecomment-2079084881))
2024-04-26 10:18:18 +00:00
bba59b718b Teach ShapeEnv that a <= b => a < b + 1 (#123436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123436
Approved by: https://github.com/ezyang
ghstack dependencies: #123342
2024-04-26 10:18:01 +00:00
fa5ea29863 Apply guard knowledge to all simplifications (#123342)
This was an oversight in a previous PR. We were just applying this
knowledge when the expression had an unbacked int
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123342
Approved by: https://github.com/ezyang
2024-04-26 10:18:00 +00:00
359ff49bf4 Revert "[dtensor] move pad/unpad_tensor to separate utils (#124871)"
This reverts commit 0b0eea222978e6b377e2c67f89902d5eb1aa7da3.

Reverted https://github.com/pytorch/pytorch/pull/124871 on behalf of https://github.com/jeanschmidt due to Broke internal tests, see D56587991 for more details ([comment](https://github.com/pytorch/pytorch/pull/124871#issuecomment-2079001103))
2024-04-26 09:30:34 +00:00
35a82d4a4a Revert "Refresh OpOverloadPacket if a new OpOverload gets added (#124654)"
This reverts commit 872eeb0d7deebb58915289756d8c786f68630547.

Reverted https://github.com/pytorch/pytorch/pull/124654 on behalf of https://github.com/jeanschmidt due to Broken lots of internal signals, check D56571345 for more details ([comment](https://github.com/pytorch/pytorch/pull/124654#issuecomment-2078940680))
2024-04-26 08:56:03 +00:00
7324ddd80c Revert "Delete erroneous print (#124972)"
This reverts commit 333f095d0779ecf0ce489ceecff35404abde8581.

Reverted https://github.com/pytorch/pytorch/pull/124972 on behalf of https://github.com/jeanschmidt due to Need to revert #124654 but this PR depends on it :( ([comment](https://github.com/pytorch/pytorch/pull/124972#issuecomment-2078936303))
2024-04-26 08:52:27 +00:00
19a83eacb5 add new API torch.amp.is_autocast_available (#124938)
# Motivation
expose `torch._is_autocast_available` to `torch.amp.is_autocast_available` as a public api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124938
Approved by: https://github.com/albanD
2024-04-26 08:45:20 +00:00
a46c27d961 Revert "Verify types in custom op schemas (#124520)"
This reverts commit 141888765bba129914448a9609ad5e182778cbdc.

Reverted https://github.com/pytorch/pytorch/pull/124520 on behalf of https://github.com/jeanschmidt due to Breaking internal tests check D56588015 for more details ([comment](https://github.com/pytorch/pytorch/pull/124520#issuecomment-2078917978))
2024-04-26 08:42:11 +00:00
9c7c81b897 [BE] Test everything against scipy-1.10.0 (#124983)
Which is the oldest one that does not have a memory leak regression tracked in CVE-2023-25399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124983
Approved by: https://github.com/kit1980
2024-04-26 07:03:34 +00:00
63d4dc5a80 Remove TMP_LIBKINETO_NANOSECOND flag from Compilation (#124734)
Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision.

Test Plan: Regression should cover this change

Reviewed By: aaronenyeshi

Differential Revision: D56444570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734
Approved by: https://github.com/aaronenyeshi
2024-04-26 06:57:03 +00:00
4ad291d07f [DeviceMesh] Removing mapping child_to_parent_mapping from _MeshEnv (#124890)
Summary: The mapping is no longer needed after https://github.com/pytorch/pytorch/pull/124780, as we are not going to re-create the pgs during mesh slicing.

Test Plan: CI

Differential Revision: D56499001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124890
Approved by: https://github.com/awgu
2024-04-26 06:40:36 +00:00
f131c2c199 Revert "Fix mypy issues in fake_tensor.py (#124428)"
This reverts commit 25c0d3f3f0b19b7ca88bc92e9dc56e391d18e010.

Reverted https://github.com/pytorch/pytorch/pull/124428 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934bfa1ff358b1c01d3754f2807cd04ee2.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
e607dc8abb Revert "Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735)"
This reverts commit 87bec7db4e55f329e077eb7003af2f4817cd4210.

Reverted https://github.com/pytorch/pytorch/pull/123735 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, more info in D56587358 ([comment](https://github.com/pytorch/pytorch/pull/123735#issuecomment-2078695590))
2024-04-26 06:10:58 +00:00
e323c681ad Update trymerge to honor the list of unstable failures from Dr.CI (#124965)
After https://github.com/pytorch/test-infra/pull/5131, we want to have trymerge to honor the list of unstable failures from Dr.CI because having the unstable keyword is the job name now doesn't cover all unstable jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124965
Approved by: https://github.com/clee2000
2024-04-26 05:10:50 +00:00
b3cf36cb7c Implement deepcopy / clone for SymNode, NestedIntSymNode (#121361)
**Motivation**: There's a Meta-internal use case that deepcopies a bunch of metadata, which includes shapes. When we try to use NestedTensor with this tool, it errors out when we try to deepcopy the metadata, because SymNodes cannot be deepcopied. The change here is to add an implementation of `__deepcopy__`.

**Implementation**:
1. `__deepcopy__` on SymNode calls clone()
2. Implement `clone()` in NestedIntSymNode, which previously didn't have this implemented

**Potential Issues**:
Right now, this works.

But, regarding (2): Eventually we'll have some mapping between the NestedSymIntNode and its corresponding offsets/lengths tensor (cc @soulitzer who is working on this). How should this work with `__deepcopy__`? Should the offsets/lengths tensor also be cloned, or should the new symint reference the same offsets as the old symint?

On one hand, we already have this issue with NestedIntSymNodeImpl::mul(): mul() creates a new NestedIntSymNodeImpl. On the other hand, `__deepcopy__` might imply different semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121361
Approved by: https://github.com/soulitzer
2024-04-26 04:18:29 +00:00
14430564ce [cudagraphs] add cudagraph_skips counter (#124804)
used in tests and benchmark csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804
Approved by: https://github.com/eellison
ghstack dependencies: #119729, #124700
2024-04-26 03:22:29 +00:00
855939904b [cudagraphs] add more info to skip messages (#124700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124700
Approved by: https://github.com/eellison
ghstack dependencies: #119729
2024-04-26 03:22:29 +00:00
62b5738a8b [benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729)
aten.div's output device will be its numerator's device. so it is acceptable to do cuda / cpu type divisions. post grad passes operate only on graphs and can't handle runtime graph inputs. so we change user code to move inputs to cuda for cudagraph. this affects any graph that has cpu tensors as graph inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119729
Approved by: https://github.com/eellison
2024-04-26 03:22:26 +00:00
769b1e6cdc [profiler] Split up profiler test file (#124856)
To help with issues on test time out split profiler test file into 4 files.
- profiler
- record_function
- execution_trace
- torch_tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124856
Approved by: https://github.com/shengfukevin, https://github.com/aaronenyeshi
2024-04-26 03:19:25 +00:00
f9a611a3ce Update Jinja to 3.1.3 (#124976)
To fix CVE-2024-22195

Also, delete unused docs/cpp/requirements.txt and functorch/docs/requirements.txt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124976
Approved by: https://github.com/kit1980
2024-04-26 02:57:55 +00:00
43f4e71daa Making _MeshEnv subclassing thread local (#124555)
With _mesh_resources being global var, when thread pg based testing is used (aka spawn_threads_and_init_comms()), the last rank with the same key would overwrite the formers. This isn't an issue in regular process-based runtime as logically each key is unique.

Example failure: https://github.com/pytorch/pytorch/actions/runs/8779134353/job/24087295785
```
RuntimeError: Could not resolve the process group registered under the name 8
or
Throwing assert not none error
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124555
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol
2024-04-26 02:45:42 +00:00
e913f77c60 Revert "Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)"
This reverts commit 9bccafc31c9d489b727155e95633efd19adbceaa.

Reverted https://github.com/pytorch/pytorch/pull/124799 on behalf of https://github.com/clee2000 due to broke tests but only on crossref https://github.com/pytorch/pytorch/actions/runs/8841521519/job/24279075171, added no td label so itll actually run this time ([comment](https://github.com/pytorch/pytorch/pull/124799#issuecomment-2078530797))
2024-04-26 02:35:14 +00:00
b2f521f376 Revert "remove empty partition (#124920)"
This reverts commit 98835fff9fd498472b0e8f49a3a4670d86f3c5b7.

Reverted https://github.com/pytorch/pytorch/pull/124920 on behalf of https://github.com/clee2000 due to I think Dr CI is wrong, the xla failure looks real 98835fff9f https://github.com/pytorch/pytorch/actions/runs/8840540357/job/24278180954 ([comment](https://github.com/pytorch/pytorch/pull/124920#issuecomment-2078495051))
2024-04-26 02:03:01 +00:00
9bccafc31c Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799
Approved by: https://github.com/drisspg
ghstack dependencies: #124444
2024-04-26 01:02:28 +00:00
7321005dd8 Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Differential Revision: [D56583900](https://our.internmc.facebook.com/intern/diff/D56583900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-26 01:02:28 +00:00
3a810bcf91 skip unsupported rocm test (#124968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124968
Approved by: https://github.com/jithunnair-amd, https://github.com/davidberard98
2024-04-26 00:36:30 +00:00
f9379ebbbf fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037)
fixes #123039

In abi mode, ExternKernelSchedulerNode generates code using `aoti_torch_tensor_copy_` which requires `AtenTensorHandle`, but the allocation generates ArrayRefTensor to allocate mem in stack. To fix this issue, this PR prevents ExternKernelSchedulerNode from using stack-mem-allocation in abi, and creates AtenTensorHandle instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124037
Approved by: https://github.com/desertfire
2024-04-26 00:16:16 +00:00
333f095d07 Delete erroneous print (#124972)
I forgot to remove it before landing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124972
Approved by: https://github.com/albanD
2024-04-26 00:07:54 +00:00
c4b6ed4609 guard_size_oblivious in unbind (#124959)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124959
Approved by: https://github.com/albanD
2024-04-25 23:45:14 +00:00
c715e76799 [inductor] optimize isa dry compile time. (#124602)
Fixes #100378
Original issue caused by startup dry compile need cost almost 1 second.

This PR add compiler version info, isa build options and pytorch version info to the test binary path hash.
So same compile, same isa and same pytorch can skip the dry compile.

Local test:
First time:
<img width="1588" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/d0b83f5d-849e-4f37-9977-3b0276e5a5a5">
We need to compile all c++ modules and it cost 16.5s.

Second time:
<img width="1589" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/44f07fb0-5a15-4342-b0f6-dfe2c880b5d3">
We skipped dry compile due to the same isa fingerprint. It is only cost 0.36s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124602
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-04-25 23:27:57 +00:00
db3a2d751c [MPS][BE] Error-check linear (#124952)
Validate that all arguments are on MPS devices and dtypes are expected

Fixes cryptic messages like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))"
RuntimeError: Placeholder storage has not been allocated on MPS device!
```
And hard crashes like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))"
```

Fixes https://github.com/pytorch/pytorch/issues/123995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952
Approved by: https://github.com/Skylion007
2024-04-25 23:25:20 +00:00
eqy
973d724e21 [CUDA] Fix 64-bit indexing in vol2col in conv3d (#124650)
Similar to #118005, fixes sometimes silent IMAs that occur

CC @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124650
Approved by: https://github.com/soulitzer
2024-04-25 23:21:43 +00:00
33fae4fcf4 Revert "Use recursive blob for package data (#119257)"
This reverts commit f20e3ae0c36146c962a5665018e9ad662a7cf211.

Reverted https://github.com/pytorch/pytorch/pull/119257 on behalf of https://github.com/malfet due to This likely caused https://github.com/pytorch/pytorch/issues/124941, not sure why warning about recursive grep was ignored ([comment](https://github.com/pytorch/pytorch/pull/119257#issuecomment-2078312309))
2024-04-25 23:08:22 +00:00
98835fff9f remove empty partition (#124920)
In some rare scenarios, the partitioner will produce an empty partition. it's a waste of time to compile an empty graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124920
Approved by: https://github.com/ezyang
2024-04-25 23:07:41 +00:00
724f8dd8c5 [export] Serialize empty list based on argument type (#123748)
Fixes https://github.com/pytorch/pytorch/issues/123480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123748
Approved by: https://github.com/zhxchen17
2024-04-25 23:03:27 +00:00
7bb89bcaa4 [export] Fix state dict reparametrization in non-strict. (#124847)
Summary:

There are multiple things implemented incorrectly in non strict for reparametrizing state dict:
1. The same fake tensor should be generated for duplicated weights.
2. We should snapshot state dict in the beginning to always hold the invariant that ep.state_dict == mod.state_dict()
3. We will overwrite real weights with fake weights if we don't restore the weights in LIFO ordering.
4. We don't turn on strict checking which could sliently fail on corner cases.

This diff aims to solve all these issues at once.

Test Plan: CI

Differential Revision: D56505020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124847
Approved by: https://github.com/pianpwk
2024-04-25 22:44:16 +00:00
4259e5d0e0 [inductor] Specialize on unguarded alignment of example inputs (#123319)
When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned.

In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone.

Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319
Approved by: https://github.com/eellison
2024-04-25 22:28:15 +00:00
8db42e7688 [EZ][GHF] Rephrase cancelled message (#124947)
To encourage people to reissue the command if merge timed out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124947
Approved by: https://github.com/kit1980, https://github.com/clee2000
2024-04-25 22:24:08 +00:00
00c5859aeb [dynamo] Add support for DELETE_SUBSCR (#123526)
Fixes #123317

Co-authored-by: Jason Ansel <jansel@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123526
Approved by: https://github.com/jansel
2024-04-25 22:07:24 +00:00
8c515a14fd [caffe2] Add build configuration for linux-arm64 (#124618)
Summary: This diff adds a new build configuration that works on linux-arm64.

Test Plan:
Before:
```
$ buck2 build @//arvr/mode/linux/jetson/opt :c10_ovrsource
BUILD FAILED
fbsource//xplat/caffe2/c10:c10_ovrsource is incompatible with cfg:linux-arm64-fbcode-platform010-aarch64-no-san#d47c4385e5d19fe0 (ovr_config//os:android unsatisfied), check the target's compatibility attributes
```

After:
```
$ buck2 build @//arvr/mode/linux/jetson/opt :c10_ovrsource
BUILD SUCCEEDED
```

Differential Revision: D56088211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124618
Approved by: https://github.com/izaitsevfb
2024-04-25 21:55:26 +00:00
84fb96130f [export] Fix check for optional tensor returns (#123739)
Sorry for the delay! Addressing issue in https://www.internalfb.com/diff/D55455000?dst_version_fbid=1599488570890576&transaction_fbid=776042617791884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123739
Approved by: https://github.com/zhxchen17
2024-04-25 20:51:26 +00:00
4b586a434f [ROCm] Triton upstream AMD backend integration (#121801)
Update ROCm-triton to use the AMD backend from https://github.com/openai/triton

Note: `test__int_mm` can be enabled after https://github.com/pytorch/pytorch/pull/122431 is landed

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121801
Approved by: https://github.com/nmacchioni, https://github.com/malfet
2024-04-25 20:44:27 +00:00
b8b04b26fb Forward fix for D56289438 (#124882)
Summary:
D56289438 from OSS breaks test
deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_merge_with_ibb_3 (deeplearning.aot_inductor.cpu.test.test_lowering_utils.CPULoweringTest)

The issue is that we use partial for aten.cat that shouldn't be directly failed out with assertion

Test Plan:
```
deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test - test_cpu_lower_merge_with_ibb_3
```

Differential Revision: D56541352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124882
Approved by: https://github.com/chenyang78
2024-04-25 18:42:09 +00:00
d5182bb75b Enable UFMT on test/test_cuda*.py (#124352)
Part of: #123062

Ran lintrunner on:

- test/test_cuda.py
- test/test_cuda_expandable_segments.py
- test/test_cuda_multigpu.py
- test/test_cuda_nvml_based_avail.py
- test/test_cuda_primary_ctx.py
- test/test_cuda_sanitizer.py
- test/test_cuda_trace.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124352
Approved by: https://github.com/ezyang
2024-04-25 18:31:08 +00:00
977dc5593a [EZ] Get rid of utf-8 quotes (#124932)
Replace `“important”` with `"important"` and `Taylor’s` with `Taylor's`

Fixes the obvious symptoms of https://github.com/pytorch/pytorch/issues/124897

Test plan: Download [wheel](https://github.com/pytorch/pytorch/actions/runs/8833051644/artifacts/1447995459) and check that generated VF.pyi does not have any unicode characters by running following command:
```
% python3 -c "x=open('_VF.pyi', encoding='utf-8').read();uc=[(i, x[i]) for i in range(len(x)) if ord(x[i])>127];print(uc);assert(len(uc)==0)"
```
2024-04-25 11:22:20 -07:00
751d9a319d [AOTI] Add a unit test (#124486)
Summary: from https://github.com/pytorch/pytorch/issues/123745, the test seems already fixed in the nightly, but still worth to add it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124486
Approved by: https://github.com/chenyang78
2024-04-25 18:05:10 +00:00
cyy
a8aed4ce3f Fix MPI_Group initialization errors (#124824)
Fixes MPI_Group initialization errors introduced in #124156, since MPI_Group is not a pointer in some MPI implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124824
Approved by: https://github.com/ezyang
2024-04-25 17:27:30 +00:00
29b22fbef9 Typo fix: s/nonzero/unique/ (#124935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124935
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-04-25 17:22:50 +00:00
93a319a4fc [export] kill _process_constraints() (#123985)
The process for populating range_constraints follows separate methods for non-strict (`make_constraints`), and strict (`_process_constraints`). The strict method is somewhat more convoluted, and the analysis that Dynamo performs for strict is already present as part of the non-strict process in make_constraints (produce_guards(), running the export constraint solver).

This PR kills _process_constraints() and replaces calls with make_constraints, without duplicating the work that Dynamo already does.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123985
Approved by: https://github.com/avikchaudhuri
2024-04-25 16:58:57 +00:00
9aeeb8e925 [Inductor Cutlass backend] Improve GEMM op filtering (#124576)
Add configurable allowlist / denylist regular expressions to make it possible to exclude certain
CUTLASS GEMM implementations ( for example "pingpong" Kernels due to undesired numerical behavior ).

Remove usage of old 2.x Cutlass Kernels entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124576
Approved by: https://github.com/jansel, https://github.com/eellison
2024-04-25 16:33:54 +00:00
e04c7b19f4 Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit 381653de63df4b1b31cc95531320caf83b1b60b3.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
4a1299cc0e Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)"
This reverts commit 355dc34f865036c4c625fcdafe54db846b2be2c2.

Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
3de78a1b48 [dynamo][cpp-guards] EQUALS MATCH - Cache first passing value (#124627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124627
Approved by: https://github.com/jansel
ghstack dependencies: #124779
2024-04-25 15:24:12 +00:00
87079f5e91 [DCP] Fix broken validate checkpoint api test (#124786)
This test appears broken, but is somehow not failing CI/CD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124786
Approved by: https://github.com/fegin, https://github.com/wz337
2024-04-25 14:50:58 +00:00
cdc66e9dc3 refactor autocast python APIs (#124479)
# Motivation
Refactor autocast usage scenario in `torch/amp/autocast_mode.py` and `torch/utils/checkpoint.py` to fix the bug - convention conflict between `torch.xxx.get_autocast_xxx_dtype` defined in `autocast_mode.py` and `torch.xxx.get_autocast_dtype` defined in `checkpoint.py`.

# Solution
Use device-agnostic APIs like `torch.get_autocast_dtype`, ..., instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124479
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #124359
2024-04-25 14:33:33 +00:00
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
44bb5da529 Fix mkl cmake not support static mkl on Windows. (#124925)
Fixes #124869

Fix mkl not support static library on Windows.
# Local test:
## MKL static:
![image](https://github.com/pytorch/pytorch/assets/8433590/9c6ee5f8-9844-4383-acbd-6b22aff06daa)
MKL backend check:
<img width="724" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/e45e12a5-2dfc-47a1-ad94-32a667bd4799">

## MKL shared, original path:
![image](https://github.com/pytorch/pytorch/assets/8433590/27a822c7-c4ab-4e5f-bbdb-8c4b085140e5)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124925
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-04-25 14:21:15 +00:00
25c0d3f3f0 Fix mypy issues in fake_tensor.py (#124428)
fake_tensor.py had mypy error ignored. That seems less than desirable.

Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees).

Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428
Approved by: https://github.com/malfet
2024-04-25 14:07:53 +00:00
87bec7db4e Refactor all top level usages of record_shapeenv_event to ShapeEnv class (#123735)
This ensures that first argument to record_shapeenv_event is a ShapeEnv
so we can appropriately short circuit when recording is not in progress.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123735
Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD
ghstack dependencies: #124310, #124314, #124316, #124394, #124739, #124782, #124785
2024-04-25 14:02:48 +00:00
61e05f2fb4 Don't ignore fresh unbacked symbols in AOTAutograd forward analysis (#124785)
This ensures we have correct SymInts when we allocate tangents.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124785
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739, #124782
2024-04-25 14:02:48 +00:00
b4597fffce Try to reuse old symbol name rather than new symbol name when renaming (#124782)
Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them.  Now, we do the replacement to preserve the old symbol.

Actually doing this is a bit tricky.  Here’s the order things happen when retracing data dependent:

1. Run fake tensor prop: allocate new unbacked SymInt
2. Run proxy tensor mode, calculate bindings and associate them with FX node
3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent

So the problem is when we calculate bindings in step (2), we don't know
what the original names are yet, we only find out later at (3).  But by
the time (3) runs, we've already stuffed some new bindings in
meta["unbacked_bindings"] and we don't know how to update them!  To fix
this, I introduce resolve_unbacked_bindings which post facto applies any
of the renamings we discovered in (3).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739
2024-04-25 14:02:42 +00:00
4c44e2b236 Improved unbacked SymInt input support in Inductor (#124739)
This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/

This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed:

* When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols)
* Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work
* Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt

Fixes https://github.com/pytorch/pytorch/issues/124652

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316, #124394
2024-04-25 13:29:53 +00:00
cyy
1ac402a96c [Distributed] [6/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124701)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124043.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124701
Approved by: https://github.com/ezyang
2024-04-25 11:39:23 +00:00
f6ce94dca5 Revert "[inductor] Remove usage of device_interface from _inductor.runtime (#124592)"
This reverts commit 5d45eb77f1aeb57f13391990215b518a607b3c7e.

Reverted https://github.com/pytorch/pytorch/pull/124592 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124592#issuecomment-2076957668))
2024-04-25 11:28:23 +00:00
58806d6531 [decomp] Remove dead device_hint function (#124849)
The only use of this function is in `_to_copy` but the result is never used,
so this is just dead code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124849
Approved by: https://github.com/lezcano
2024-04-25 11:25:51 +00:00
5f9ea26185 Revert "OSS: Capture triton kernel in ET (#124775)"
This reverts commit c55309e58f88dd37e41e80425fd84a71d4b51548.

Reverted https://github.com/pytorch/pytorch/pull/124775 on behalf of https://github.com/jeanschmidt due to need to revert so I can revert https://github.com/pytorch/pytorch/pull/124592 ([comment](https://github.com/pytorch/pytorch/pull/124775#issuecomment-2076954322))
2024-04-25 11:24:39 +00:00
3890848ec2 Revert "[ROCm] Triton upstream AMD backend integration (#121801)"
This reverts commit 9888d7495ece6b6df3b7334fc7c2a9d869359250.

Reverted https://github.com/pytorch/pytorch/pull/121801 on behalf of https://github.com/jeanschmidt due to need to revert so I can revert https://github.com/pytorch/pytorch/pull/124592 ([comment](https://github.com/pytorch/pytorch/pull/121801#issuecomment-2076951327))
2024-04-25 11:22:19 +00:00
e520233526 Revert "[dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681)"
This reverts commit 0792ceab4b6a61c6c217f65c3fecf51d75e65a9f.

Reverted https://github.com/pytorch/pytorch/pull/124681 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124681#issuecomment-2076937810))
2024-04-25 11:14:02 +00:00
f3af049b88 [DDP][PT2D] Fix the import issue (#124846)
As title

Differential Revision: [D56521582](https://our.internmc.facebook.com/intern/diff/D56521582/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124846
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #124421, #124422, #123424
2024-04-25 11:08:27 +00:00
0ca1ff3dce Revert "Add support for capturing tensors with score_mod (#124444)"
This reverts commit 7c253a777641791247f7fcc19fe5c60f24be32b9.

Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, check D56522566 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2076908582))
2024-04-25 10:56:38 +00:00
c0fd7894cc Revert "Fast standalone symbolize for unwinding (#123966)"
This reverts commit 772ae6da1eb9be1f4238ff993830c56488ecae13.

Reverted https://github.com/pytorch/pytorch/pull/123966 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, check D56522678 ([comment](https://github.com/pytorch/pytorch/pull/123966#issuecomment-2076821043))
2024-04-25 10:04:48 +00:00
2d7f709752 [Inductor] Force the parallel depth as outer loop fusion depth (#123899)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](aaec97a403/torch/_inductor/codegen/cpp.py (L2145-L2164)). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.

In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123899
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-04-25 09:50:46 +00:00
24ed909934 Revert "[CUDA] Fix 64-bit indexing in vol2col in conv3d (#124650)"
This reverts commit 71d92bace2b9ff6431976cda69c83df668d078f0.

Reverted https://github.com/pytorch/pytorch/pull/124650 on behalf of https://github.com/jeanschmidt due to Reverting to check if it introduced regressions for linux-focal-rocm6.0-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/124650#issuecomment-2076786795))
2024-04-25 09:46:21 +00:00
678662a557 Revert "Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)"
This reverts commit acc4cbea395c25410c26d6fd3c88c072ce24c918.

Reverted https://github.com/pytorch/pytorch/pull/124799 on behalf of https://github.com/jeanschmidt due to checking if this diff introduced regressions on linux-focal-py3.11-clang10 and linux-focal-py3.8-clang10 ([comment](https://github.com/pytorch/pytorch/pull/124799#issuecomment-2076756876))
2024-04-25 09:29:57 +00:00
48a016157d Revert "[benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729)"
This reverts commit c021c9b8e48b8e787b75fd69a3076beffffb8208.

Reverted https://github.com/pytorch/pytorch/pull/119729 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))
2024-04-25 09:26:25 +00:00
6a92b352ee Revert "[cudagraphs] add more info to skip messages (#124700)"
This reverts commit 0ed38c9b227f2099c77f4b34fbbe72afa176ac25.

Reverted https://github.com/pytorch/pytorch/pull/124700 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))
2024-04-25 09:26:25 +00:00
154157416c Revert "[cudagraphs] add cudagraph_skips counter (#124804)"
This reverts commit fdad16b85108209bc021107f312f4b221422a012.

Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))
2024-04-25 09:26:25 +00:00
7a6813b7b3 Revert "[cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)"
This reverts commit 64af899fdfc30c0c075d90bde111cec74ad9b4bb.

Reverted https://github.com/pytorch/pytorch/pull/122510 on behalf of https://github.com/jeanschmidt due to Breaking amd gpu builds ([comment](https://github.com/pytorch/pytorch/pull/122510#issuecomment-2076743868))
2024-04-25 09:22:37 +00:00
9d139eedcf [AOTI] set alignment for aot constant (#124272)
GPU copies the constant blob to aligned memory ([RAII_cudaMalloc](d0211e207c/torch/csrc/inductor/aoti_runtime/model.h (L46)
), [64-alignment](d0211e207c/torch/csrc/inductor/aoti_runtime/model.h (L324))) while CPU doesn't have this copy procedure for constant blob, which may result in sub-optimal performance when we want to directly use the constant blob buffer in the computation (for example when these constant blobs are the weight tensor to the oneDNN primitive).

We set the alignment to the `constant.o` directly so that there's no need to copy the data to an aligned memory for CPU (when using `--rename-section`, the original section name would need to be specified for `--set-section-alignment`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124272
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-25 08:37:44 +00:00
e68d65dae2 [dynamo][cpp-guards] Differentiate dict guards wrt to guarding on key order (#124779)
We guard on key order
1) When a key is a non-constant object
2) When we actually need key order - like .values, .items etc

For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779
Approved by: https://github.com/jansel
2024-04-25 08:20:35 +00:00
59a1f1f308 [dynamo][inline inbuilt nn modules] Do not inline for export (#124814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124814
Approved by: https://github.com/jansel
2024-04-25 06:35:31 +00:00
94af62b000 Updated test_graph_grad_scaling to use new OptimizerInfo infrastructure (#123581)
This PR targets the issue mentioned in #123451 , and solves the specific task to update`test_graph_grad_scaling` in `test/test_cuda.py` to use the new OptimizerInfo infrastructure.

`test_graph_grad_scaling` is moved to a new `TestCase` class called `TestCudaOptims` in order to use `instantiate_device_type_tests`. The test content remained the same. `@onlyCUDA` is applied to the new test; the original use of the wrapper function is also changed to a `@parametrize` decorator for better style.

If we think that this migration is successful, we can delete the original test item under `TestCuda`. Currently it is left untouched to avoid any unexpected issues.

Local linter passed.
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```

Local tests passed.
```
> python .\test\test_cuda.py -k test_graph_grad_scaling
Ran 7 tests in 0.458s
OK (skipped = 3)
```
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123581
Approved by: https://github.com/janeyx99
2024-04-25 06:29:20 +00:00
acc4cbea39 Made FlexAttention rewrite getitem calls to use aten.index in score_mod (#124799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124799
Approved by: https://github.com/drisspg
2024-04-25 06:19:55 +00:00
9a70e7f58c [Nested Tensor]Add unit test that cover the internal use cases (#124880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124880
Approved by: https://github.com/jbschlosser
2024-04-25 05:04:27 +00:00
8b2f8ee5ef [DDP][PT2D] Fix no_compiled_forward flag in the test (#124829)
As title

Differential Revision: [D56508696](https://our.internmc.facebook.com/intern/diff/D56508696/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124829
Approved by: https://github.com/yf225
ghstack dependencies: #124421, #124422, #123424
2024-04-25 04:55:39 +00:00
b21bf5e4e4 [foreach] Use same dtypes when dtypesIfCUDA is None (#124813)
in order to avoid accidentally testing cuda path with fewer dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124813
Approved by: https://github.com/janeyx99
2024-04-25 04:42:24 +00:00
84666389e1 [FX] Update opinfo tests (flattened diff) (#124657)
Summary:
This diff updates opinfo tests to compute more statistics. The results are described in this post:
https://fb.workplace.com/groups/ai.acceleration.team/permalink/825131926110067/

New features:
  - Optionally dump kernels to a directory
  - Optionally disable block pointers
  - Impose a time limit (2 min) on individual tests
  - Report a variety of specific error codes when a fails:
       - MIXED
       - FALLBACK
       - EXPORT_ERROR
       - COMPILE_ERROR
       - MULTIPLE_KERNELS
       - MISSING_KERNELS
       - TIMEOUT
   - Disable setting the RNG seed inside of opinfo, since Dynamo doesn't like this and it caused a lot of tests to fail which otherwise would be able to generate Triton.
   - Check each test's `(op,dtype)` pair against {HuggingFace, TIMM, TorchBench} benchmark logs, to see whether tests are representative of real-world usage.

Test Plan:
`buck2 test @//mode/{dev-nosan,mtia} fbcode//triton_mtia/python/test:` passed locally. This code is also exercised by the CI.

Added a bunch of new unit tests:
 - Dumping kernels to a directory
 - Disabling block pointers
 - Mocking various error conditions in inductor
   - No kernels
   - Multiple kernels
   - ATen fallback
   - Partial ATen fallback (mixed Triton + ATen)
   - `torch.export` raised exception
   - `torch.inductor._compile` raised exception
   - Timeout while running test
   - Test harness raised uncaught exception
   - Check that return code == Success when exceptions were raised
  - Checking whether various (op,dtype) combos are in benchmarks
     - Check that `aten.add.Tensor` IS in the benchmarks
     - Check that a made up op is NOT in them

Differential Revision: D56336160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124657
Approved by: https://github.com/eellison
2024-04-25 04:38:44 +00:00
4e340a7f8b [custom_op] setup_context fills in default values (#124852)
This is to mirror autograd.Function's setup_context behavior.
The PyTorch Dispatcher removes default values for "FC/BC reasons", but I
convinced myself there's no FC/BC problem for the setup_context API.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124852
Approved by: https://github.com/albanD
ghstack dependencies: #124637, #124805, #124806
2024-04-25 04:22:01 +00:00
fdad16b851 [cudagraphs] add cudagraph_skips counter (#124804)
used in tests and benchmark csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804
Approved by: https://github.com/eellison
ghstack dependencies: #119729, #124700
2024-04-25 03:38:09 +00:00
0ed38c9b22 [cudagraphs] add more info to skip messages (#124700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124700
Approved by: https://github.com/eellison
ghstack dependencies: #119729
2024-04-25 03:38:09 +00:00
c021c9b8e4 [benchmark][cudagraph] Explicitly call aten.div with CUDA denominator for cudagraphs (#119729)
aten.div's output device will be its numerator's device. so it is acceptable to do cuda / cpu type divisions. post grad passes operate only on graphs and can't handle runtime graph inputs. so we change user code to move inputs to cuda for cudagraph. this affects any graph that has cpu tensors as graph inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119729
Approved by: https://github.com/eellison
2024-04-25 03:38:09 +00:00
0b0eea2229 [dtensor] move pad/unpad_tensor to separate utils (#124871)
as titled, 1. pad/unpad is a general util not specific to the Shard
placement, 2. for the propose of the next PR, move these two out of Shard
placement itself, and give additional pad_dim argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124871
Approved by: https://github.com/awgu, https://github.com/wz337
2024-04-25 03:36:16 +00:00
13ab24f192 Reimplement unbacked symbol bindings in Inductor (#124394)
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and  **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
   * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
    * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
    * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
    * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
    * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
    * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too!  Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
    * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at  **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
    * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
    * **torch/fx/experimental/symbolic_shapes.py** - A few things
      * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
      * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
      * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
2024-04-25 02:08:59 +00:00
66b0156e0b Ban replacements with unbacked SymInt on both sides (#124316)
Fixes https://github.com/pytorch/pytorch/issues/123854

Important comment:

```
                # Never replace unbacked symbols with other unbacked symbols.
                # This is error prone because you can cause references to
                # unbacked symbols to time travel backwards.  E.g.,
                #
                # u1 = x.item()
                # ... use of u1 ...
                # u2 = y.item()
                # u3 = z.item()
                # torch._check(u1 == u2 + u3)
                #
                # If you replace u1 with u2 + u3, then the use of u1 now
                # references u2 and u3 prior to them actually being bound at
                # runtime.  It's pretty inconvenient to setup control
                # dependencies for substitutions, so ban it entirely.
```

This is kind of risky for the internal MRS workstream, because we added these substitutions upon their request in the first place. Fortunately, we still allow substitutions to backed SymInts and constants, and I believe that is what is actually load bearing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124316
Approved by: https://github.com/ColinPeppler, https://github.com/lezcano
ghstack dependencies: #124310, #124314
2024-04-25 02:08:59 +00:00
5e58227d27 Rebind and refresh unbacked bindings in FakeTensorUpdater (#124314)
Like the previous two PRs, this is doing the rebinding and binding computation, just in FakeTensorUpdater. FakeTensorUpdater modifies FX graph in place so its usage pattern is slightly different, but still pretty short.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124314
Approved by: https://github.com/IvanKobzarev, https://github.com/lezcano
ghstack dependencies: #124310
2024-04-25 02:08:55 +00:00
9692b954c6 FakeTensorProp works with unbacked bindings (#124310)
This is a partial revert of https://github.com/pytorch/pytorch/pull/124059

Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310
Approved by: https://github.com/lezcano
2024-04-25 02:08:51 +00:00
141888765b Verify types in custom op schemas (#124520)
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.

This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).

Test Plan:
- new tests

Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520
Approved by: https://github.com/albanD
2024-04-25 01:56:58 +00:00
050dd65a87 [onnx.export] Track new nodes added during _run_symbolic_function (#123027)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- This copies the shape and type from the node to the nodes that are produced by the export. However, for 1-to-N exports, which are very common, this doesn't make much sense and can give the graph in broken shape or type information. As far as I can tell, a shape inference pass is used to propagate the correct shape and type for all interemediate (and final) nodes.
- If there is a situation where this is necessary (shape inference turned off and only 1-to-1 ops are exported ??), perhaps this can be conditionally skipped. It does incur a quadratic cost. Another option is to set a global default for the metadata and
use that for all nodes that get created. Again, this meta data may not make sense for all ops and seems dangerous to do.
- Resolves (8) in #121422.

(partial fix of #121422)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123027
Approved by: https://github.com/BowenBao
2024-04-25 01:56:36 +00:00
4f398eed0b [custom_op] register_autograd supports non-tensor kwargonly-args (#124806)
The user does not need to return gradients for these args.

We also change how setup_context works to adapt to kwargonly-args. If
the user's op has no kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, output)`: we require that the
arguments have the same names.

If the user's op has kwonly-args, then their setup_context function must
look like `setup_context(ctx, inputs, keyword_only_inputs, output)`.
We require that the arguments have the same names.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124806
Approved by: https://github.com/albanD, https://github.com/williamwen42
ghstack dependencies: #124637, #124805
2024-04-25 01:51:02 +00:00
31522391a8 [custom_op] Blanket ban kwarg-only Tensors (#124805)
We can lift this if users ask for but I haven't seen an op that someone
would use with this api that uses a kwarg-only Tensor yet

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124805
Approved by: https://github.com/albanD, https://github.com/williamwen42
ghstack dependencies: #124637
2024-04-25 01:51:02 +00:00
2b1c13e3a3 [custom_op] fix schema inference for kwarg-only args (#124637)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124637
Approved by: https://github.com/williamwen42, https://github.com/albanD
2024-04-25 01:51:02 +00:00
c5e567c573 [Torch][Timer] Adding debug info logging interface for expired timers (#123883)
Summary:
Adding function to log additional debug information before killing the expired watchdog timers.

Additional information like stack trace can be added in the debug function using worker process IDs from expired timers.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test

Differential Revision: D56044153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883
Approved by: https://github.com/kurman
2024-04-25 01:15:52 +00:00
43313a506a Dont precompile if we search_autotune_cache but not max autotune is set (#124870)
Differential Revision: [D56534950](https://our.internmc.facebook.com/intern/diff/D56534950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124870
Approved by: https://github.com/xw285cornell
2024-04-25 01:07:21 +00:00
68225072e8 Match insignificant strides for sdpa inputs (#124859)
Fix for https://github.com/pytorch/pytorch/issues/124289.

There was a tensor which had a single, expanded element. inductor generated the strides as all 0, while sdpa expects a dense last dimension `t.stride(-1) == 1`. While these are equivalent, we still hit an error in the kernel. We could make fixes in sdpa, but matching the insignificant strides in inductor also works and I am less aware of the downstream sdpa kernel details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124859
Approved by: https://github.com/drisspg
ghstack dependencies: #124751
2024-04-24 23:44:23 +00:00
36c983a973 [DeviceMesh] Added DeviceMesh.from_group() (#124787)
This PR adds a `DeviceMesh.from_group()` static method to convert an existing process group to a device mesh.

Motivation: We need `DeviceMesh.from_group()` to allow FSDP2 to interoperate with distributed libraries that do not use `DeviceMesh` for all parallelisms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124787
Approved by: https://github.com/wanchaol
ghstack dependencies: #124651, #124741, #124767, #124768, #124780
2024-04-24 23:16:06 +00:00
cb94845b14 Force upsample to be float32 (#121324)
Fixes #121072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324
Approved by: https://github.com/albanD
2024-04-24 23:14:41 +00:00
02ed2992d9 [export] Capture tensor.to() under export. (#123732)
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to

Differential Revision: D55969674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123732
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-04-24 23:12:19 +00:00
4f29103749 [ez][CI] Move test_cuda off CI_SERIAL_LIST (#124649)
Tag test cases with large tensor with serial, also tag a few more that failed on a previous iteration of this PR

Move test_cuda and test_cuda_expandable_segments off the serial list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124649
Approved by: https://github.com/ZainRizvi
2024-04-24 22:04:23 +00:00
85b28ffc3a [quant][pt2e] Move batch norm op between eval/train for cuda (#123957)
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.

Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn

Reviewers: jerryzh168

Subscribers: jerryzh168, leslie-fang-intel, supriyar

Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
2024-04-24 22:01:50 +00:00
82fe9071c2 [ROCm][CI] fix 5.7 nightly wheel build (#124797)
Fixes broken ROCm 5.7 build caused by #122106.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124797
Approved by: https://github.com/atalman
2024-04-24 21:55:24 +00:00
a89f442f0b add -fclang-abi-compat=17 to HIP_HIPCC_FLAGS (#124862)
C++20 mangling rules were recently added to hip-clang. This flag maintains compatibility since pytorch is at C++17. Otherwise the linker fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124862
Approved by: https://github.com/malfet, https://github.com/pruthvistony
2024-04-24 21:46:50 +00:00
7809b34288 [DTensor][Easy] Update OpSchema __repr__ to show args_schema in format print (#124812)
When printing op_schema with `print(f"{op_schema=}")`:

Before -- can't view into the OpStrategy/TupleStrategy in format print:
```
# A pointwise strategy
op_schema=OpSchema(op=aten.relu.default, args_schema=(<torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e0520>,), kwargs_schema={})
# A pointwise strategy
pointwise_strategy -- op_schema=OpSchema(op=aten.threshold_backward.default, args_schema=(<torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e1540>, <torch.distributed._tensor.op_schema.OpStrategy object at 0x7f4e763e1510>, 0), kwargs_schema={})
# A tuple strategy
op_schema=OpSchema(op=aten._foreach_lerp_.Scalar, args_schema=(<torch.distributed._tensor.op_schema.TupleStrategy object at 0x7f4e763e31f0>, <torch.distributed._tensor.op_schema.TupleStrategy object at 0x7f4e763e3460>, 0.09999999999999998), kwargs_schema={})
```

After -- printing out the OpStrategy/TupleStrategy string:
```
# A pointwise strategy
op_schema=OpSchema(op=aten.relu.default, args_schema=(OpStrategy:[None -> R] @ mesh: (4,)), kwargs_schema={})
# A pointwise strategy
op_schema=OpSchema(op=aten.threshold_backward.default, args_schema=(OpStrategy:[None -> R] @ mesh: (4,), OpStrategy:[None -> R] @ mesh: (4,), 0), kwargs_schema={})
# A tuple strategy
op_schema=OpSchema(op=aten._foreach_lerp_.Scalar, args_schema=(TupleStrategy(OpStrategy:[None -> S(0)] @ mesh: (4,)), TupleStrategy(OpStrategy:[None -> S(0)] @ mesh: (4,)),0.09999999999999998), kwargs_schema={})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124812
Approved by: https://github.com/wanchaol
2024-04-24 21:34:39 +00:00
a248c24694 [ROCm][Inductor] Disable conv cache emptying with hipgraphs (#124791)
When we warmup hipgraphs, we use cudagraph memory pool to allocate a large part of the memory. We don't necessarily execute the kernels on the GPUs. Therefore, we don't want to free up this allocated memory. However, this is conflicting with emptyCache call happening inside findAlgorithm where convolution algorithm benchmarking is happening. For benchmarking, we might use large memory allocations to cache algorithm results. As a fix, we just disable the emptyCache() call during cudagraph warmup.

As per this cuDNN PR which did the same thing for CUDA, we did not have a significant affect on memory footprint. a8ff647e42

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124791
Approved by: https://github.com/eellison, https://github.com/jeffdaily
2024-04-24 21:21:10 +00:00
80ab062103 [MemoryViz] Improve description of blocks with missing frames (#124784)
Summary:
It is common for blocks to be missing frames and there are many users asking why.

Let's improve this output message to cover common reasons:

1) block was allocated before _record_memory_history was enabled
2) context or stacks passed to _record_memory_history does not include this block
3) backward events allocated with C++ stack and will not show if stacks = python

Test Plan:
CI and ran it locally:
![image](https://github.com/pytorch/pytorch/assets/17602366/60a03a22-0e3e-43d8-9ee7-b14358096fc7)

Differential Revision: D56490921

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124784
Approved by: https://github.com/zdevito
2024-04-24 21:16:31 +00:00
8885638f95 [quant][pt2e] Propagate get_attr meta through known ops only (#124415)
Summary: Avoid situation where the graph traversal finds a matmul node with a `get_attr` as its `args[0]`, and incorrectly propagate the `get_attr`'s meta to everything downstream.

Test Plan: CI

Differential Revision: D56219120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124415
Approved by: https://github.com/jerryzh168
2024-04-24 20:55:56 +00:00
355dc34f86 Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)
Test the generic torch.Stream/Event with fake device gurad and hooks.

Differential Revision: [D56443358](https://our.internmc.facebook.com/intern/diff/D56443358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
2024-04-24 20:51:20 +00:00
381653de63 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-24 20:51:20 +00:00
408aa0182c Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-04-24 20:51:17 +00:00
a22847a9cb We should not be in kernel invocation before we restore fake mode (#124762)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124762
Approved by: https://github.com/eellison
ghstack dependencies: #124760
2024-04-24 20:32:59 +00:00
0d58aeb73a Handle size/etc accessors in FakeTensor, support accessing symbolic types from toInt/etc in IValue (#124760)
Fixes https://github.com/pytorch/pytorch/issues/122772

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124760
Approved by: https://github.com/albanD, https://github.com/eellison
2024-04-24 20:32:59 +00:00
9bd6e93a04 [inductor] Add option to create parent directory for write_atomic (#124646)
In #124640 I see the error

```
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 887, in load
    compiled_graph = FxGraphCache._lookup_graph(key, example_inputs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 776, in _lookup_graph
    write_atomic(artifact_path, graph.source_code)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 412, in write_atomic
    with tmp_path.open(write_mode) as f:
  File "/opt/conda/envs/py_3.10/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp02wlik2v/iu/.28383.139931139675904.tmp'
```

Which is fixed by creating the parent directory first. Since this is what you
want to do in most cases, I add an argument to `write_atomic` to do so itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124646
Approved by: https://github.com/lezcano
2024-04-24 20:12:23 +00:00
adbf62cd0a Fix layer norm in static runtime when input is non-contiguous (#124789)
Test: The added unit test fails before this fix. But it passes now after the fix.

The fix is coming from @swolchok in D56087067.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124789
Approved by: https://github.com/davidberard98
2024-04-24 19:49:36 +00:00
71d92bace2 [CUDA] Fix 64-bit indexing in vol2col in conv3d (#124650)
Similar to #118005, fixes sometimes silent IMAs that occur

CC @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124650
Approved by: https://github.com/soulitzer
2024-04-24 19:47:18 +00:00
8fe0b8b6a8 No CPP or xdist process level reruns (#124798)
xdist doesn't play well with current process level rerun scheme
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124798
Approved by: https://github.com/huydhn
2024-04-24 19:44:51 +00:00
c55309e58f OSS: Capture triton kernel in ET (#124775)
This DIFF is to capture triton kernels in execution trace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124775
Approved by: https://github.com/briancoutinho
2024-04-24 19:39:37 +00:00
872eeb0d7d Refresh OpOverloadPacket if a new OpOverload gets added (#124654)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124654
Approved by: https://github.com/albanD
2024-04-24 19:30:52 +00:00
7ad6dc2cf3 [Profiler][PrivateUse1] Profiler support PrivateUse1 key (#124818)
Summary:
1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user.
2.Add PrivateUse1 key to ActivityType.
3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities.
4. Fix some bugs when processing profiler results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124818
Approved by: https://github.com/aaronenyeshi
2024-04-24 18:52:08 +00:00
f07b6227e6 Initial add of torch.distributed.pipelining (#124776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124776
Approved by: https://github.com/wconstab
2024-04-24 18:51:20 +00:00
40cf38fd15 [BE]: Apply ruff rule FURB192 (#124742)
Apply RUFF rule FURB192 to remove unnecessary sorts and replace them with min / max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124742
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-24 18:44:08 +00:00
48312a7fc3 [DeviceMesh] Removed unneeded .to(cpu) (#124768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124768
Approved by: https://github.com/wz337
ghstack dependencies: #124651, #124741, #124767
2024-04-24 18:07:20 +00:00
927ae80afa Release 2.3 compatibility matrix (#124861)
Update release compatibility matrix with latest release

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124861
Approved by: https://github.com/svekars, https://github.com/seemethere, https://github.com/malfet
2024-04-24 18:05:14 +00:00
1db7d64af2 [DeviceMesh] Initialized mesh tensor with CPU context (#124767)
This PR makes sure to construct the `DeviceMesh`'s `mesh` tensor on CPU device in `init_device_mesh()`. This means that we can call `init_device_mesh()` under meta-device context and still construct the correct `mesh` tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124767
Approved by: https://github.com/wz337
ghstack dependencies: #124651, #124741
2024-04-24 18:04:06 +00:00
674e15ae07 Back out "Switch to predispatch" (#124860)
Summary:
Original commit changeset: 1f155b3a0bfc

Original Phabricator Diff: D56273267

Test Plan: CI

Differential Revision: D56526505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124860
Approved by: https://github.com/angelayi
2024-04-24 17:28:33 +00:00
9888d7495e [ROCm] Triton upstream AMD backend integration (#121801)
Update ROCm-triton to use the AMD backend from https://github.com/openai/triton

Note: `test__int_mm` can be enabled after https://github.com/pytorch/pytorch/pull/122431 is landed

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121801
Approved by: https://github.com/nmacchioni, https://github.com/malfet
2024-04-24 17:28:12 +00:00
ed120b08c4 Add common used score_mod functions for templated attention (#124670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124670
Approved by: https://github.com/Chillee
2024-04-24 17:04:36 +00:00
977105466f Remove activation checkpointing tag to get correct FQNs (#124698)
Fixes #124546

When setting `use_orig_params = False` and using activation checkpointing, the FQN mapping as retrieved by the `_get_fqns` function is incorrect because the prefix that is added to the name of each activation checkpointed module, `_checkpoint_wrapped_module`, can still be present. I think this is an edge case with the `_get_fqns` function that was not addressed by this previous commit #118119.

Without the change, the list of object names for an activation checkpointed module with FSDP (and `use_orig_params=False`) can be something like:
```
['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_checkpoint_wrapped_module', '_flat_param']
```
Which will incorrectly return just one FQN, `{'model.transformer.blocks.0._flat_param'}`, when all the FQNs of the parameters of the transformer block should be returned.

With the change, the list of object names will now have `_checkpoint_wrapped_module` removed:
```
['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_flat_param']
```
And the FQNs are correctly retrieved and returned in `_get_fqns` when [this condition](ea61c9cb29/torch/distributed/checkpoint/state_dict.py (L168)) is satisfied. The correct FQNs are:
```
{'model.transformer.blocks.0.attn.Wqkv.bias', 'model.transformer.blocks.0.ffn.up_proj.bias',
'model.transformer.blocks.0.attn.out_proj.weight', 'model.transformer.blocks.0.norm_2.weight',
'model.transformer.blocks.0.ffn.down_proj.weight', 'model.transformer.blocks.0.attn.Wqkv.weight',
'model.transformer.blocks.0.norm_2.bias', 'model.transformer.blocks.0.ffn.up_proj.weight',
'model.transformer.blocks.0.ffn.down_proj.bias', 'model.transformer.blocks.0.norm_1.bias',
'model.transformer.blocks.0.norm_1.weight', 'model.transformer.blocks.0.attn.out_proj.bias'}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124698
Approved by: https://github.com/Skylion007
2024-04-24 16:47:50 +00:00
bf834d388b Mark test_xavier_uniform as slow (#124801)
takes 17+ minutes, sometimes 30+ min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124801
Approved by: https://github.com/huydhn
2024-04-24 15:48:04 +00:00
e739a2d59e Revert "[quant][pt2e] Move batch norm op between eval/train for cuda (#123957)"
This reverts commit 4efb28c90025ea3d979b720942cd97a274fac6da.

Reverted https://github.com/pytorch/pytorch/pull/123957 on behalf of https://github.com/jeanschmidt due to reverting to check if it will fix rocm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/123957#issuecomment-2075158146))
2024-04-24 15:02:11 +00:00
92295fbacd Revert "Verify types in custom op schemas (#124520)"
This reverts commit 5b98d43488bed0836b4da5996a50bafd0dd2c11c.

Reverted https://github.com/pytorch/pytorch/pull/124520 on behalf of https://github.com/zou3519 due to broke static runtime tests ([comment](https://github.com/pytorch/pytorch/pull/124520#issuecomment-2075111935))
2024-04-24 14:41:26 +00:00
7d94f52a8a [Inductor Cutlass backend] clean up CUTLASSGemmTemplate.add_cutlass_gemm_choices (#124575)
Clean up CUTLASSGemmTemplate.add_cutlass_gemm_choices, removing code that became unneccessary by removing EVT-based epilogue fusion.

Test Plan:
Already covered by test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124575
Approved by: https://github.com/jansel
ghstack dependencies: #121497, #123930, #123932, #121734, #124107, #124574
2024-04-24 14:00:12 +00:00
49f0d127fb Fix a bug in retrieving approximate bsr_dense_addmm kernel meta data (#124371)
Fixes #124333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124371
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-04-24 13:59:18 +00:00
a47f4253ab [Inductor Cutlass backend] Set INDUCTOR_TEST_DISABLE_FRESH_CACHE in test setup (#124574)
The diff https://github.com/pytorch/pytorch/pull/122661 introduces a new automatic cache refresh mechanism during all inductor-derived test cases.

But this refresh mechanism seems not to work properly across process boundaries, specifically when using  autotune_in_subproc, which many tests in test_cutlass_backend.py rely on.

Solution: Set the env var INDUCTOR_TEST_DISABLE_FRESH_CACHE=1
early during test setup within test_cutlass_backend.py

Test Plan:
This is a change to unit tests only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124574
Approved by: https://github.com/aakhundov
ghstack dependencies: #121497, #123930, #123932, #121734, #124107
2024-04-24 13:58:29 +00:00
e76b5e3cc8 [Inductor Cutlass backend] Disable epilogue fusions (#124107)
This diff disables Cutlass backend EVT epilogue fusions. It does not yet contain the removal of most of the underlying implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124107
Approved by: https://github.com/jansel
ghstack dependencies: #121497, #123930, #123932, #121734
2024-04-24 13:56:44 +00:00
537aebc99f [Inductor cutlass backend] Add bmm support (#121734)
Add support for bmm ( batch matrix multiply ) op through the Cutlass backend.

Test Plan:
 * CI
 * Added test in test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121734
Approved by: https://github.com/eellison
ghstack dependencies: #121497, #123930, #123932
2024-04-24 13:54:54 +00:00
fb69eef1b4 Add int testing for foreach_copy on cuda (#124703)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124703
Approved by: https://github.com/crcrpar, https://github.com/albanD
2024-04-24 13:11:56 +00:00
89ca0cb7a0 [FSDP2] Added test to show rank 0 CPU full state dict flow (#124741)
This PR adds a unit test to show how we can convert FSDP2 GPU sharded state dicts to a CPU full state dict on rank 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124741
Approved by: https://github.com/wanchaol, https://github.com/wz337
ghstack dependencies: #124651
2024-04-24 13:02:19 +00:00
e0e2d897ed Handle Tensor returns in PropagateUnbackedSymInts (#124297)
This subsumes https://github.com/pytorch/pytorch/pull/124069

In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities.

To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size.

We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297
Approved by: https://github.com/lezcano
ghstack dependencies: #124290
2024-04-24 12:18:33 +00:00
b04dca1502 Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo (#124290)
The important comment:

```
        # Whenever we allocate a fresh unbacked Symbol, we add it to this
        # pending list.  Unbacked symbol allocation can occur at unpredictable
        # points during meta tensor propagation, but at some point, the we
        # have to know what the binding site for an unbacked symbol is, and
        # this is computed when we actually place the node in the graph.  The
        # important thing is that we always actually handle every unaccounted
        # for unbacked symbol, so this list helps us keep track of them and
        # then make sure they are all accounted for.
        #
        # We could potentially give rise to errors earlier by lexically
        # scoping when we do propagation, and only allowing unbacked symbols
        # to be allocated at this point in time.  However this is inconvenient
        # to do in Dynamo, because fake tensor propagation is far from when we
        # analyze binding sites (set_example_value), so we do it in a more
        # mutatey way.
        #
        # NB: fresh unbacked symbols NEVER get substitutions applied to them,
        # they are binding sites!
```

The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment:

```
    After having run fake tensor propagation and producing example_value
    result, traverse example_value looking for freshly bound unbacked
    symbols and record their paths for later.  It is an error if
    we have allocated an unbacked SymInt but it cannot be found in
    example_value.  (NB: this means if you have a multi-output
    function, you must call this on the tuple of tensor output, you
    cannot wait!)
```

For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it.

I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290
Approved by: https://github.com/lezcano
2024-04-24 09:11:34 +00:00
0848051844 Migrate linux-test Job yo ARC (#124386)
Migrate linux-test Job yo ARC

* Separated `_linux-test-label.yml` workflow to use the `label`;
* Separated `_linux-test-rg.yml` workflow to use the `group`;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124386
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-04-24 06:48:19 +00:00
290bfbe01f [DDP][PT2D] Lazy Initialization of DDP Module for Replicate API (#123424)
In order to make replicate work with Meta tensor, we need to do lazy Initialization for the replicate API. This PR impelements the lazy initialization and ensures that replicate still work with the new DDP compilation.

Differential Revision: [D55787340](https://our.internmc.facebook.com/intern/diff/D55787340/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123424
Approved by: https://github.com/yf225
ghstack dependencies: #124421, #124422
2024-04-24 06:30:19 +00:00
81740fd1f6 [DCP] minor readability fix: make param name consistent with overriden function (#124770)
Summary:
This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference -

    abc.abstractmethod
    def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:

Differential Revision: D56480396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770
Approved by: https://github.com/fegin
2024-04-24 05:31:26 +00:00
34f468e66f remove the redundent '* 1000' to timestamp (#124374)
activity->timestamp() already in nanosecond granularity,  no need to multiply by 1000.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124374
Approved by: https://github.com/aaronenyeshi
2024-04-24 04:57:44 +00:00
0da94f3a08 [device_mesh] add a private init backend option (#124780)
This PR adds a private init backend option, to tackle the issues sub
mesh creation:

in device mesh slicing we don't want to create process groups again,
so explicitly turn the group creation off it's useful

Also I think there might be more submesh creation functionality so
having this flag would ensure that there's no new group created

Differential Revision: [D56497780](https://our.internmc.facebook.com/intern/diff/D56497780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124780
Approved by: https://github.com/awgu
2024-04-24 04:31:58 +00:00
b91f83f181 [cudagraph] add config for cudagraph managed input mutation support (#124754)
Summary: [#123231](https://github.com/pytorch/pytorch/pull/123231) adds cudagraph supports for more types of functions (i.e., cudagraph managed input mutation). These newly supported functions may have mutated static inputs, leading to assertion errors in some workload which skip cudagraph previously. This diff adds a config to opt in the new feature.

Test Plan: ci

Differential Revision: D56481353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124754
Approved by: https://github.com/eellison
2024-04-24 04:23:53 +00:00
bee924d173 Enable test config selection when doing workflow dispatch (#124795)
Fixes https://github.com/pytorch/test-infra/issues/4468

This is done by updating the filter config script to accept a list of test configs coming from workflow dispatch.  For example, having `inductor_huggingface_perf,inductor_timm_perf,inductor_torchbench_perf` will benchmark all 3 datasets, while having `inductor_torchbench_perf` will only run TorchBench.  This is exposed via a new string workflow dispatch parameters called `benchmark_configs`.

Note that GH limits the maximum number of workflow dispatch parameters to 10, so I need to consolidate `training` and `inference` into `training_and_inference` to squeeze the new parameter into the list.

### Testing

Run the script manually and confirm that the filtered list of test config is correct.

Also manually dispatch the job with the new parameter https://github.com/pytorch/pytorch/actions/runs/8808159905 and only the selected `inductor_huggingface_perf` is kept https://github.com/pytorch/pytorch/actions/runs/8808159905/job/24176683708#step:11:128
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124795
Approved by: https://github.com/clee2000
2024-04-24 03:13:38 +00:00
9dded148d0 Fix test_extension_backend on non-AVX systems (#117272)
The test checks for a substring "loadu" in generated code. On AVX systems that line is:
> auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i0))
however on non-AVX systems it is
> auto tmp0 = in_ptr0[static_cast<long>(i0)];

the difference depends on `codecache.valid_vec_isa_list()` being non-empty. See torch/_inductor/codegen/cpp.py:2639

Modify the test to account for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117272
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-24 02:55:12 +00:00
2e7b8ff116 [ROCm] Fix Int_mm() Integration with hipblasLT (#122431)
The PR

- fixes int_mm() /int8_gemm() integration with hipblasLT backend (require ROCm 6.0).
- enables/fixes the following tests on Rocm
    - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_16_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_16_n_32_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_16_use_transpose_a_True_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_False_use_transpose_b_True_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_False_cuda
    - test__int_mm_k_32_n_32_use_transpose_a_True_use_transpose_b_True_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122431
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/atalman
2024-04-24 02:29:33 +00:00
f0f7452e31 Do not propogate (#124769)
Fix the propogate typos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124769
Approved by: https://github.com/Skylion007
2024-04-24 02:18:18 +00:00
952a00eda7 torchelastic: change monitor_interval default to 0.1 (#124692)
This reduces the default monitor_interval for torchelastic to 0.1s as testing shows negligble load for common use cases. Even at the extremes, 100k processes is only 45.4% cpu util of a single core.

Torchelastic monitor_interval only monitors the processes on a single worker so under typical loads even for huge jobs we expect ~8 subprocesses per machine with one per GPU.

As an external datapoint, Python's wait polls every 50usec-50ms (https://github.com/python/cpython/blob/main/Lib/subprocess.py#L2035).

## Motivation

This setting is used to control how frequently we poll for failed processes in elastic.

* For some jobs of note we run elastic 3 times per try so with the default timeout of 5 seconds we should save ~15 seconds per retry.
* @kiukchung's use case: Apparently this is annoying in notebooks etc since it adds delay to shutdown when testing things

## Results

This is measured in cores (100% is a single core under full load).

| monitor_interval (s) | nproc-per-node | CPU util (highest observed) |
| -------------------- | -------------- | --------------------------- |
| 1.0                  | 10             | 0.2%                        |
| 0.1                  | 1              | 0.4%                        |
| 0.1                  | 10             | 0.4%                        |
| 0.01                 | 10             | 0.9%                        |
| 0.001                | 10             | 4.0%                        |
| 0.1                  | 100            | 0.5%                        |
| 0.1                  | 1000           | 2.2%                        |
| 0.1                  | 10000          | 15.7%                       |
| 0.1                  | 100000         | 45.4%                       |

## Methodology

```sh
# run command
$ LOGLEVEL=INFO torchrun --nnodes 1 --nproc-per-node 10 --monitor-interval 0.1 ~/wait.py

# wait a few seconds for all processes to start and reach steady state and then run, wait ~30s or 3 prints and take the highest
$ top -b -d 10 -c | rg 'torchrun.*wait
```

wait.py

```py
import time

time.sleep(10*60)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124692
Approved by: https://github.com/kiukchung, https://github.com/kurman
2024-04-24 01:44:41 +00:00
03fa2421dc Get ARC jobs to run on both classic and ARC infra (#124753)
ARC jobs are too unstable right now.

We're going to mitigate this by:

- Reverting ARC jobs to run on the classic infra (https://github.com/pytorch/pytorch/pull/124748)
- Spin up new jobs in parallel to run on the new infra. (this PR)
- Mark these ARC jobs as unstable (will be done before merging this PR)

More details in https://github.com/pytorch/ci-infra/issues/149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124753
Approved by: https://github.com/zxiiro, https://github.com/seemethere
2024-04-24 01:34:50 +00:00
2716e77cf7 [FSDP1][2D] Fix FSDP1 2D state_dict to use run_check=False (#123802)
`from_local` with replicate placement would run mesh_broadcast if `run_check=True`, by default `from_local` have `run_check=True`, but for FSDP state_dict case we are for sure that these are replicated on dp dimension (FSDP + TP) already, so we don't need to check/force check it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123802
Approved by: https://github.com/wanchaol
2024-04-24 01:25:11 +00:00
57a12d9d0f Add Half support to torch.sparse.addmm for CPU (#124694)
This PR is to add Half support to torch.sparse.addmm for CPU. It is a requested feature in model DCRNN for Half data type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124694
Approved by: https://github.com/pearu
2024-04-24 01:24:01 +00:00
1ab0b3c9f8 [ROCm] avoid heap buffer overflow in hiprtc failure logs (#121865)
hiprtc doesn't seem to include the null byte automatically in the failure logs, resulting in heap buffer overflow. Initializing the log string with the null byte avoids the problem.

Found by rocm address sanitizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121865
Approved by: https://github.com/malfet
2024-04-24 01:09:08 +00:00
4efb28c900 [quant][pt2e] Move batch norm op between eval/train for cuda (#123957)
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.

Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn

Reviewers: jerryzh168

Subscribers: jerryzh168, leslie-fang-intel, supriyar

Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
2024-04-24 01:02:59 +00:00
eqy
64af899fdf [cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs

Will also collect benchmark data,

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-24 01:00:34 +00:00
31ca27af62 Add the quant lift up pass in convert phase (#122777)
**Summary**

Lift up the quant node before view like nodes. It can benefit performance of Attention like block. For example, we have the pattern as:

```
          DQ
    DQ       LINEAR
    LINEAR   VIEW
    VIEW     PERMUTE
    PERMUTE  TRANSPOSE
    Q        Q
    DQ       DQ
       Matmul
        DIV
        ADD
      SOFTMAX
```

We want to lift up the the quant nodes from `matmul` before view like nodes as the output of Linear node.

```
          DQ
    DQ       LINEAR
    LINEAR   Q
    Q        VIEW
    VIEW     PERMUTE
    PERMUTE  TRANSPOSE
    DQ       DQ
       Matmul
        DIV
        ADD
      SOFTMAX
```

It produces a `DQ->LINEAR->Q` pattern which can be fused by backend.

**Test Plan**

```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122777
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #122776
2024-04-24 00:57:59 +00:00
c933af2709 Switch to predispatch (#123573)
This PR switches export IR from aot-dispatch to pre-dispatch IR.

**What is pre-dispatch IR and why should you care?**

Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run.

In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result:
- You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph.
- You can write sound graph transformations more easily as the IR is functional.
- Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR.

If you want to get the core aten IR out of `torch.export`, you will need to:
```
ep = torch.export.export(M(), inputs)
ep_for_core_aten = ep.run_decompositions()
```

Differential Revision: [D56273267](https://our.internmc.facebook.com/intern/diff/D56273267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123573
Approved by: https://github.com/gmagogsfm
2024-04-24 00:51:09 +00:00
3145522427 [Profiler] Update third_party/kineto submodule hash (#124737)
Summary:
Include improvements such as:
- AMD: roctracer crash fix and roctracer external correlations
- NCCL metadata: process group id to process group name
- Complete nanosecond transition for Kineto
- Remove PrivateUse1Type function causing gpu track to be above cpu tracks
- Use relative time and fix gpu user annotation causing events to overlap

Test Plan: CI and Github CI (full suite)

Reviewed By: sraikund16

Differential Revision: D56475055

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124737
Approved by: https://github.com/davidberard98, https://github.com/malfet
2024-04-24 00:30:17 +00:00
e8f9f37b03 [FSDP2] Added test to show rank 0 broadcast meta-device flow (#124651)
This PR includes two things:
1. Changes to support `load_state_dict(assign=True)`
    - These changes are not ideal, but until we have `DTensor` padding the local tensor and general `swap_tensors` adoption, we may need to make do.
2. Example of how to convert a full state dict on rank 0 to sharded state dict on all ranks via broadcast
    - ~~To-do: check for `recordStream` from the funcol broadcast; if being called, remediate either via `async_op=False` c10d broadcast or use `TORCH_NCCL_AVOID_RECORD_STREAMS=1`~~ switched to using c10d `async_op=False` broadcast
    - To-do: check for broadcast latency since not using any coalescing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124651
Approved by: https://github.com/wanchaol
2024-04-24 00:18:23 +00:00
a21327e0b0 [ROCm] update hipDataType support and hipify mappings (#120751)
The hipDataType support and mappings are now up to date as of ROCm 5.7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120751
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-04-23 23:21:56 +00:00
1c4ad87396 [TorchElastic] Option to enable TCPStore libuv backed (#124684)
Summary:
Libuv backed isn't enabled in PTD by default now. Add an option to enable libuv backed to improve scaling of the rendezvous process.
Tries not to make assumption on the default libuv settings in TCPStore since it may change in the next release.

Test Plan: CI

Differential Revision: D56435815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124684
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2024-04-23 23:12:17 +00:00
3999b72d46 Dont error in check consistency if old is None (#124751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124751
Approved by: https://github.com/ezyang
2024-04-23 22:26:52 +00:00
98ffdf930c Revert ARC jobs to run on classic infra again (#124748)
ARC jobs are too unstable right now.

We're going to mitigate this by:
1. Reverting ARC jobs to run on the classic infra (this PR)
2. Spin up new jobs in parallel, marked as unstable, to run on the new infra. (coming soon)

More details in https://github.com/pytorch/ci-infra/issues/149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124748
Approved by: https://github.com/seemethere, https://github.com/zxiiro, https://github.com/malfet, https://github.com/jeanschmidt
2024-04-23 22:24:31 +00:00
cc268a710d Revert "AOTAutograd: gate view-replay behind config, not the default (#124488)"
This reverts commit 47330ca13321a42d4f1e75f091e17183227ae073.

Reverted https://github.com/pytorch/pytorch/pull/124488 on behalf of https://github.com/seemethere due to submodule update causes xla to start failing see job on branch: https://github.com/pytorch/pytorch/actions/runs/8789091145/job/24124569508, Dr. CI incorrectly marked this as flaky and allowed the merge ([comment](https://github.com/pytorch/pytorch/pull/124488#issuecomment-2073568651))
2024-04-23 22:21:50 +00:00
4ceb44c40d Add torch.library.opcheck (#124496)
This PR:
- exposes torch.testing._internal.optests.opcheck as
  torch.library.opcheck
- Adds support for CustomOpDef (aka functions decorated with
  torch.library.custom_op) to opcheck.

Test Plan:
- Updated tests
- We validated opcheck's design internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496
Approved by: https://github.com/williamwen42
2024-04-23 21:48:00 +00:00
763dc26e59 [Dynamo] Add dynamo support to torch.func.linearize (#123118)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123118
Approved by: https://github.com/zou3519
2024-04-23 21:31:49 +00:00
8cf54929e3 compiletime->compile_time (#124579)
Summary: title.

Test Plan: run strobelight profiler.

Reviewed By: oulgen

Differential Revision: D56395415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124579
Approved by: https://github.com/oulgen
2024-04-23 20:50:53 +00:00
d40774f4ed [export] Fix up nn_module_stack for nodes occured around tracepoint ops. (#124457)
Summary: as title.

Test Plan:
hg checkout D55901896
buck run mode/opt torchrec/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_ebc

Differential Revision: D56340319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124457
Approved by: https://github.com/tugsbayasgalan
2024-04-23 20:26:44 +00:00
e94c846cf7 [ez][TD] Unique td_exclusions file name (#124301)
* Fix after #124082

I keep forgetting that these files overwrite each other

Unrelated but TIL if you want to show the pr/issue title when you link it, it should be in a list
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124301
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-04-23 20:25:27 +00:00
57e92162eb [inductor] Keep inductor cache for tests when TORCH_COMPILE_DEBUG is specified (#124755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124755
Approved by: https://github.com/masnesral
2024-04-23 20:22:55 +00:00
5532c7949f Fix import error in update_failures.py (#124695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124695
Approved by: https://github.com/zou3519
2024-04-23 20:09:49 +00:00
e112792a69 [export] refactor _AddRuntimeAssertionsForInlineConstraintsPass (#124503)
Summary:
The current _AddRuntimeAssertionsForInlineConstraintsPass has 2 known issues caused by its use of torch.fx.Interpreter:
1. SymInt-related ops (e.g. item()) are executed, causing new Unbacked SymInts to appear in the graph during the pass.
2. The graph is reconstructed, and node names/indices can be different from before, causing mismatches with `module_call_graph`, and leading to issues during unflattening.

This refactors the pass to use PassBase instead of _ExportPassBaseDeprecatedDoNotUse, only constructing new nodes for assertions.

Test Plan: This pass is called on all strict-mode export calls with range_constraints, test that behavior remains unchanged.

Differential Revision: D56360137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124503
Approved by: https://github.com/zhxchen17
2024-04-23 20:07:49 +00:00
35a448f3cb Record structured log for overall AOTAutograd backwards compilation (#124648)
It's sort of similar to CompilationMetrics but also not quite the same, quite open to bikeshedding.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124648
Approved by: https://github.com/bdhirsh
ghstack dependencies: #124626
2024-04-23 19:51:14 +00:00
abdd569e41 [easy][test_profiler.py] if tqdm is not available, pass instead of None (#124729)
Change the try exception to pass when it cannot import tqdm.

To address comment: https://github.com/pytorch/pytorch/pull/124409#discussion_r1576327365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124729
Approved by: https://github.com/malfet, https://github.com/shengfukevin
2024-04-23 18:39:39 +00:00
1d3a13d3d1 Conform torch.mps to device module interface (#124676)
Right now `torch.fork_rng()` doesn't support MPS. MPS' device module functions don't line up with the others'. There is a step of `fork_rng` to call `device_count()`:

302d7e9a6e/torch/random.py (L146)

It is pretty simple to know the MPS device count, based on whether it is built and available.

Also:

302d7e9a6e/torch/random.py (L168)

302d7e9a6e/torch/random.py (L175)

`get_rng_state` and `set_rng_state` are expected to be able to accept a `device` parameter.

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124676
Approved by: https://github.com/ezyang
2024-04-23 18:38:48 +00:00
4e66aaa010 update kineto submodel commit id to include new pg naming (#124332)
Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto PR #906

Test Plan: N/A

Differential Revision: D56273619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124332
Approved by: https://github.com/aaronenyeshi
2024-04-23 17:58:10 +00:00
7c253a7776 Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-23 17:54:08 +00:00
0792ceab4b [dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124681
Approved by: https://github.com/masnesral
ghstack dependencies: #124592
2024-04-23 17:51:25 +00:00
5d45eb77f1 [inductor] Remove usage of device_interface from _inductor.runtime (#124592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592
Approved by: https://github.com/masnesral
2024-04-23 17:51:25 +00:00
25a2d18dd9 [Profiler] iterate frontend function events for profiler post processing (#124596)
The `function_events` in `_parse_kineto_results` is used to contain all function events from the result. It contains 2 kinds of events. One is frontend function events whose correlation id is 0, for example, `aten::add`, `aten::mul`. They are on the top level of the profile results. The other is the backend events, which are associated with the frontend events and its correlation id is > 0, for example, `at::native::vectorized_elementwise_kernel`, it should be the backend event of a frontend element-wise op. They have the device execution duration for the related frontend op.

In the following post processing code, the **frontend function events** should be iterated to find its correlated backend events in `device_corr_map`, instead of iterating all function events, because `device_corr_map` is designed as a dict, whose key is the id of the frontend function event.
3af12447f8/torch/autograd/profiler.py (L543-L560)

3af12447f8/torch/autograd/profiler.py (L537-L540)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124596
Approved by: https://github.com/aaronenyeshi
2024-04-23 17:40:32 +00:00
05db64024c [DDP][PT2D] Correctly calculate the numel with symint in DDP fusion (#124422)
As title

Differential Revision: [D56315533](https://our.internmc.facebook.com/intern/diff/D56315533/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124422
Approved by: https://github.com/yf225
ghstack dependencies: #124421
2024-04-23 17:06:18 +00:00
47330ca133 AOTAutograd: gate view-replay behind config, not the default (#124488)
Fixes https://github.com/pytorch/pytorch/issues/124499 (I also changed the warn to an info to avoid noise)

That'll take some investigation, but rather than reverting I'm gating the view-replay behind a config that I default to False. To get the behavior back for XLA, can you have `import torch_xla` set this config?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124488
Approved by: https://github.com/ezyang, https://github.com/Microve
2024-04-23 16:15:50 +00:00
b2fd224f27 [AOTI] Add more ABI-compatiblity unit test (#123900)
Summary: Follow https://github.com/pytorch/pytorch/pull/123848, and test more c10 util functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123900
Approved by: https://github.com/chenyang78
2024-04-23 16:06:40 +00:00
e558008a05 [PyTorch] Add test that canEnableStaticRuntime rejects prim::CallMethod (#120853)
Rejecting prim::CallMethod is called out in a comment in impl.cpp, but doesn't seem to be tested. Now it is.

Differential Revision: [D54338261](https://our.internmc.facebook.com/intern/diff/D54338261/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120853
Approved by: https://github.com/houseroad
2024-04-23 15:56:42 +00:00
fb6d052e9c Specify the exact table we upload metrics to (#124321)
Part of https://github.com/pytorch/ci-infra/issues/113

Since this table is only located in one AWS account, but the ARC account
also needs to access it, explicitly specify the account name for the
table
2024-04-23 10:55:52 -05:00
772ae6da1e Fast standalone symbolize for unwinding (#123966)
We've had issues using addr2line. On certain versions of
CentOS it is on a version that has a performance regression making it very slow,
and even normallly it is not that fast, taking several seconds even when parallelized
for a typical memory trace dump.

Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior.

This adds a standalone symbolizer to PyTorch similar to the unwinder which has
no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames.

I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966
Approved by: https://github.com/ezyang
2024-04-23 15:27:18 +00:00
cf98cab1b6 [export] Forward fix XNNPackQuantizer.module_type_config to detect str nn_module_stack (#123662)
https://github.com/pytorch/pytorch/pull/123308 previously changed the nn_module_stack format (module type -> module str). This modifies XNNPackQuantizer's module_type_config to detect class module strs instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123662
Approved by: https://github.com/williamwen42
2024-04-23 15:21:37 +00:00
7ecbbc40c3 [HOP][inductor] Add higher order associative scan operator (#119430)
Currently only supports single tensor scans, e.g. `cumsum`, `cumprod`, `logcumsumexp`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119430
Approved by: https://github.com/Chillee
2024-04-23 14:40:13 +00:00
64491c0811 Restore CompileContext as well in backwards (#124626)
This should fix many of the unknown compile id problems currently
afflicting tlparse backwards analysis.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124626
Approved by: https://github.com/bdhirsh
2024-04-23 14:39:52 +00:00
4f3e1f1c93 Revert "Add support for capturing tensors with score_mod (#124444)"
This reverts commit e0c5113dec79608941db69ae091dfe8893f9a14f.

Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/malfet due to This is weird, but somehow profile test started to timeout after this PR, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=noGPU_AVX512 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2072506731))
2024-04-23 14:39:37 +00:00
5b98d43488 Verify types in custom op schemas (#124520)
Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.

This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).

Test Plan:
- new tests

Differential Revision: [D56432690](https://our.internmc.facebook.com/intern/diff/D56432690)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124520
Approved by: https://github.com/albanD
2024-04-23 14:18:35 +00:00
107f944f22 Support fp8 quantization (#123161)
This commit enables float8_e5m2 and float8_e4m3fn dtypes in fx quantization and PT2E.

Motivation for using fp8 quantization instead of int8:
- it works better to run inference with the same datatype the model was trained with,
- fp8 can handle outliers better, which is one of the problems in LLMs activations.

The numerical recipe we want to use it for is fp8 inference:
- bgemms/gemms running in float8_e4m3fn,
- Per-Tensor-Quantization/Scaling,
- amax observer for measurement with input_backoff and weight_backoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123161
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-04-23 13:35:27 +00:00
f8f6c460cd [Inductor max autotune] Make autotuning robust against very slow Kernels (#123932)
If a Kernel does not return in a reasonable amount of time during autotuning, it can delay inductor compilation a lot. This change introduces soft / hard kill timeouts and a mechanism to kill Kernels being profiled in subprocesses if they take too long.

Correspondingly, a few new config options are introduced within _inductor/config.py - all of them with inline docs.

Test Plan:
Existing tests within test_max_autotune.py and test_cutlass_backend.py ) cover the new codepaths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123932
Approved by: https://github.com/jansel
ghstack dependencies: #121497, #123930
2024-04-23 11:56:15 +00:00
25f321b84f Refactor autocast C++ APIs to be device-agnostic (#124359)
# Motivation
This PR aims to refactor autocast **C++** APIs to be device-agnostic and deprecate the device-specific autocast  **C++** APIs.
In C++ side,
- `is_enabled()` -> `is_enabled(device_type)`.
- `set_enabled(new_enabled)` -> `set_enabled(device_type, new_enabled)`.
- `get_autocast_dtype()` -> `get_autocast_dtype(device_type)`
- `set_autocast_dtype(dtype)` -> `set_autocast_dtype(device_type, dtype)`

These following C++ APIs are deprecated and should be removed in PyTorch 2.5
- `is_cpu_enabled`
- `set_cpu_enabled`
- `get_autocast_cpu_dtype`
- `set_autocast_cpu_dtype`
- `is_xpu_enabled`
- `set_xpu_enabled`
- `get_autocast_xpu_dtype`
- `set_autocast_xpu_dtype`
- `is_ipu_enabled`
- `set_ipu_enabled`
- `get_autocast_ipu_dtype`
- `set_autocast_ipu_dtype`
- `is_hpu_enabled`
- `set_hpu_enabled`
- `get_autocast_hpu_dtype`
- `set_autocast_hpu_dtype`
- `is_xla_enabled`
- `set_xla_enabled`
- `get_autocast_xla_dtype`
- `set_autocast_xla_dtype`
- `is_privateuseone_enabled`
- `set_privateuseone_enabled`
- `get_autocast_privateuseone_dtype`
- `set_autocast_privateuseone_dtype`

In Python side,
provide 4 generic autocast APIs:
- `torch.is_autocast_enabled(device_type)`
- `torch.set_autocast_enabled(device_type, new_enabled)`
- `torch.get_autocast_dtype(device_type)`
- `torch.set_autocast_dtype(device_type, dtype)`

# Additional Context
We will submit another PR to refactor autocast **Python** APIs based on this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124359
Approved by: https://github.com/jgong5, https://github.com/albanD
2024-04-23 10:38:50 +00:00
3c964ad1ca add fused_sgd_kernel support for CPU device (#123629)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9
https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_sgd time: 0.2301 seconds
_fused_sgd time: 0.0925 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_sgd time: 2.6195 seconds
_fused_sgd time: 1.7543 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```
Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-23 08:28:19 +00:00
4efb980e07 [BE] Update older scipy used in CI to 1.8.1 (#124675)
As older scipy are affected by CVE-2023-29824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124675
Approved by: https://github.com/kit1980
2024-04-23 06:59:48 +00:00
7b6e354ecd [DDP][PT2D] Fix some tracing bugs of DDP (#124421)
1. We need to clear the cache of get_legacy_mod_inlinelist to ensure the newly added rule will be captured.
2. Don't add the hook if the parameter does not require gradient.

Differential Revision: [D56315534](https://our.internmc.facebook.com/intern/diff/D56315534/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124421
Approved by: https://github.com/yf225
2024-04-23 06:43:48 +00:00
9a5b4d2403 Do not forward parent's value range to CSE variable for variables created within codegen. (#123099)
Consider we are generating code for `ops.gt`, and within it we call
`ops.to_dtype`. Before, we would forward the bounds from `gt` to the
to the result of `to_dtype`, which is wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123099
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-04-23 06:26:39 +00:00
edcd968b51 Add out wrappers to some decompositions (#115437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437
Approved by: https://github.com/lezcano
2024-04-23 06:26:11 +00:00
e0c5113dec Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-23 06:20:13 +00:00
c82fcb7b30 Add testing and fix weights_only load for quantized types and nn.Parameters with python attrs (#124330)
Adds the following to allowed globals for the `weights_only` unpickler
- [x] `torch._utils._rebuild_qtensor` and qtensor related types
- [x] `torch._utils._rebuild_parameter_with_state` (used deserializing a parameter that has user-defined attributes like `Param.foo`)

The remaining rebuild functions that have not been allowlisted are

- [x] `torch._utils._rebuild_wrapper_subclass` (allowlisted in above PR)
- [ ] `torch._utils._rebuild_device_tensor_from_numpy`
- [ ] `torch._utils._rebuild_xla_tensor` (legacy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124330
Approved by: https://github.com/albanD
2024-04-23 04:13:26 +00:00
de5d689cf9 [EZ] Update pillow to 10.3.0 (#124614)
As older versions as subject to [CVE-2024-28219](https://nvd.nist.gov/vuln/detail/CVE-2024-28219), although it's not super important from CI PoV

Modernize `torch/utils/tensorboard/summary.py` to use Pillow-9+ APIs (is this file even used for anything anymore?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124614
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi
2024-04-23 03:22:23 +00:00
7706cd7d12 Extend CPU inductor merge rule (#124671)
To help unblock: https://github.com/pytorch/pytorch/pull/123710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124671
Approved by: https://github.com/leslie-fang-intel, https://github.com/huydhn
2024-04-23 02:18:00 +00:00
660db767ef Don't clean up fresh inductor cache on error (#124620)
Useful for local debugging.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124620
Approved by: https://github.com/oulgen, https://github.com/desertfire, https://github.com/jansel
2024-04-23 02:13:05 +00:00
7e095be4b6 Fix test_max_autotune_remote_caching (#124655)
D55206000 broke this test. It is not clear why it did not run in the CI but here's the fix.

Differential Revision: [D56439213](https://our.internmc.facebook.com/intern/diff/D56439213/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124655
Approved by: https://github.com/aorenste
2024-04-23 01:41:04 +00:00
375ec25f55 Add missing aten::sort.any op for assistant lm models (#123982)
Differential Revision: D56084098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123982
Approved by: https://github.com/JacobSzwejbka
2024-04-23 01:35:07 +00:00
cyy
ea61c9cb29 [Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043
Approved by: https://github.com/ezyang
2024-04-23 00:43:50 +00:00
5f5778476a rename ort to maia (#123265)
Fixes #123264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265
Approved by: https://github.com/albanD
2024-04-23 00:33:25 +00:00
bffecb5aff [Inductor] Enable VecMask store (#123710)
**Summary**
Enable the vectorization of store with `bool` dtype.

**Test Plan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123710
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #123512
2024-04-23 00:29:47 +00:00
dd440ac734 Add Matmul recipe into x86_inductor_quantizer (#122776)
**Summary**
Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default.

**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```

Differential Revision: [D56288468](https://our.internmc.facebook.com/intern/diff/D56288468)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-04-23 00:25:41 +00:00
1fcdea8cd6 Do not import transformers when import torch._dynamo (#124634)
Fixes https://github.com/pytorch/pytorch/issues/123954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124634
Approved by: https://github.com/thiagocrepaldi, https://github.com/Chillee
ghstack dependencies: #124343
2024-04-23 00:25:20 +00:00
0c21161488 Add meta function for torch.histc (#124548)
Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-04-23 00:24:59 +00:00
6054789874 Make numel equal test size oblivious in reshape_symint (#124611)
Fixes https://github.com/pytorch/pytorch/issues/124581

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124611
Approved by: https://github.com/bdhirsh
ghstack dependencies: #124139
2024-04-22 23:59:40 +00:00
abf3f90781 [MPS] Fix large copy (#124635)
By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks

Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with:
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```

Fixes https://github.com/pytorch/pytorch/issues/124335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635
Approved by: https://github.com/kulinseth
2024-04-22 23:43:11 +00:00
72a34eeb99 Dynamo x autograd.Function supports non-{Tensor, symnode, constant} inputs (#124360)
Fixes #118395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124360
Approved by: https://github.com/zou3519
2024-04-22 23:32:54 +00:00
302d7e9a6e [Binary Build] Increase timeout for Linux nightly binary builds (#124668)
Related issue: https://github.com/pytorch/pytorch/issues/124667. Please note, this is mitigation PR. Will follow up with investigation and proper fix for this.

Similar to: 94d6463255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124668
Approved by: https://github.com/huydhn
2024-04-22 22:38:39 +00:00
87a35d5a29 Use new function to log one cluster per line (#124628)
Summary:
For motivation behind the overall stack of diffs see D56218385 summary.

This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`.

Test Plan:
```
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/...
File changed: fbcode//common/base/ThreadStackTrace.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
4 additional file change events
Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98
Network: Up: 0B  Down: 0B
Jobs completed: 2. Time elapsed: 1.5s.
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
[romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$
```

Tested to print the stack trace:
P1220109730

Differential Revision: D56218360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628
Approved by: https://github.com/wconstab
2024-04-22 21:57:39 +00:00
501edc7e59 [inductor, test] remove cast for test_tmp_not_defined_issue2_cpu (#114910)
Does this verify that https://github.com/pytorch/pytorch/issues/94017 is fixed?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114910
Approved by: https://github.com/angelayi
2024-04-22 21:51:53 +00:00
ba3c00c266 [test_profiler.py] Disable tqdm monitor thread and torch.compile with compile_threads=1 (#124409)
Summary: if tqdm is not shutdown properly, it will leave the monitor thread alive. This causes an issue in the multithreading test because we check all events in that test with their tids. The events that correspond to these lingering threads all have TID of (uint64_t)(-1) which is invalid. The work around is turning off monitoring thread when tqdm is loaded. Since these are unit tests, it is safe to turn off monitor thread.

Test Plan: buck test  mode/dev-sand caffe2/test:profiler

Differential Revision: D56310301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124409
Approved by: https://github.com/aaronenyeshi
2024-04-22 21:51:14 +00:00
c01499ecc6 [sym_shapes][perf] Cache ShapeEnv constrain_symbol_range calls (#124610)
Differential Revision: [D56422688](https://our.internmc.facebook.com/intern/diff/D56422688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124610
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-22 21:49:08 +00:00
05addd5658 [tp] add kwargs support to prepare_module_input (#124114)
as titled, this PR adds kwargs support to PrepareModuleInput style,
where there might be modules who have only kwargs inputs but no
positional args, so we should support this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124114
Approved by: https://github.com/XilunWu
2024-04-22 21:46:31 +00:00
5785b02ba6 Skip workspace permission change for ROCm CI (#123816)
PR https://github.com/pytorch/pytorch/pull/122922 added chown steps to test.sh and used the trap mechanism to ensure that, even if the test scripts fails and exits with a non-zero code, it will call the cleanup_workspace function on EXIT.

However, this doesn't work as intended when the CI job gets cancelled for eg. if a PR pushes new commits and the older commit CI job gets cancelled. The trap function doesn't get called as the test script immediately aborts.

Any subsequent jobs scheduled on the same runner then fail in the 'Checkout PyTorch' step when they try to delete the workspace.

This has been resulting in a slew of CI failures on the HUD.

Example of this situation playing out on one of the ROCm runners:
Cancelled job: https://github.com/pytorch/pytorch/actions/runs/8563212279/job/23469711035

![image](https://github.com/pytorch/pytorch/assets/37884920/7192e4fe-8cff-4256-abc8-9f874a3918ff)

Subsequent failed job: https://github.com/pytorch/pytorch/actions/runs/8564517036/job/23472675041

![image](https://github.com/pytorch/pytorch/assets/37884920/24b0af66-cfe9-431f-851a-24a1ccc18e84)

This PR skips the logic introduced by PR 122922 for ROCm CI.

Alternative to https://github.com/pytorch/pytorch/pull/123468 and https://github.com/pytorch/pytorch/pull/123588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123816
Approved by: https://github.com/pruthvistony, https://github.com/zxiiro, https://github.com/kit1980, https://github.com/malfet
2024-04-22 21:27:32 +00:00
bb37910e30 [AOTI] Fixes ScatterFallback codegen (#124580)
Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures.

Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580
Approved by: https://github.com/chenyang78
2024-04-22 20:47:26 +00:00
fd59554be6 Scripts to compile reruns + td exclusions and upload to s3 (#124312)
Edits upload_test_stats to also upload a condensed version that contains reruns, and one that contains the list of td_exclusions.

Grouped by build name + test config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124312
Approved by: https://github.com/malfet
2024-04-22 20:19:35 +00:00
0bbbc754dd Add AOTInductor generated cpp code to TORCH_TRACE (#124617)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124617
Approved by: https://github.com/albanD
2024-04-22 19:25:20 +00:00
0093735ccd [inductor] Use compile time config values in runtime (#124561)
This removes usage of torch._inductor.config from `torch._inductor.runtime`.  Fixing two issues:
1) If configs change we should really use the compile time ones
2) In compile workers, we want to use the parent process config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569
2024-04-22 18:46:40 +00:00
cb9fe91f5c [inductor] Remove config check for 3D tiling (#124569)
This makes the check per-kernel (if 3D tiling is used), rather than
global config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559, #124560
2024-04-22 18:46:40 +00:00
4620a45542 [inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124560
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559
2024-04-22 18:46:35 +00:00
0cc0e60e30 [inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557
2024-04-22 18:46:29 +00:00
7fd8870e6b [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 18:46:24 +00:00
bb8815bc31 [inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552
2024-04-22 18:46:20 +00:00
480585fd2b [inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552
Approved by: https://github.com/yanboliang
2024-04-22 18:41:12 +00:00
16eea7c6a5 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552)"
This reverts commit a7035cc11aa3aefe1a45a9ba6d0cb4d2a6f2e7c1.

Reverted https://github.com/pytorch/pytorch/pull/124552 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
56714cb497 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553)"
This reverts commit f4d47f5bbb07bed98b1eb8313607be6e94686269.

Reverted https://github.com/pytorch/pytorch/pull/124553 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
0b90af0bf5 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)"
This reverts commit fcf28b0ad59b1912d5783688b0f25f18b46efeb3.

Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
b3d6c2fe9b Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559)"
This reverts commit 9ea2a0951005c4bcb2491556a8548319c6cccfdb.

Reverted https://github.com/pytorch/pytorch/pull/124559 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
0f44ef93ab Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560)"
This reverts commit 3ac30bc32ad300d70391ec552e5738d6ed66f9a5.

Reverted https://github.com/pytorch/pytorch/pull/124560 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
8973c5b846 Revert "[inductor] Remove config check for 3D tiling (#124569)"
This reverts commit 317c0af149855b5924a59170a18abecca97e2ce0.

Reverted https://github.com/pytorch/pytorch/pull/124569 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
30dec1da84 Revert "[inductor] Use compile time config values in runtime (#124561)"
This reverts commit 3af12447f85dfede191a113c052e58fa7b21a8b3.

Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))
2024-04-22 18:24:38 +00:00
d77e7b7c54 Make some kernel static asserts clearer (#124519)
Users get int/int64_t and double/float confused a lot.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124519
Approved by: https://github.com/Skylion007
2024-04-22 18:17:40 +00:00
c2f8bfae9c Make torch._inductor.dependencies.Dep a proper class (#124407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124407
Approved by: https://github.com/peterbell10
2024-04-22 17:09:34 +00:00
77c35334c1 Fix build on s390x (#123250)
Rename s390x-specific zvector functions with same name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123250
Approved by: https://github.com/malfet
2024-04-22 16:57:08 +00:00
be2e56b5ab s390x: update using vectorization builtins (#124396)
With gcc >= 12 on s390x store builtins
are accidentally optimized out due to
bad type aliasing.

Ensure that proper corresponding types are used,
and if types do mismatch,
first store data into array of correct type
and then memcpy it to destination pointer.

See also:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124396
Approved by: https://github.com/malfet
2024-04-22 16:55:18 +00:00
0ee514e628 [CI] Upgrade xpu driver to LTS_803.29 (#123920)
Upgrade xpu driver from 647.21  to LTS 803.29

Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123920
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/huydhn
2024-04-22 16:45:01 +00:00
9c2ac4476c Allow ONNX models without parameters (#121904)
Currently, if initializers are available, they are included in the ONNX model. If they are not available, the model is serialized without them.

However, there are times in which the initializers are avaialable, but the user prefers not to include them in the model, say for visualizing it on Netron or because the initialziers will be specified along with the inputs in the onnx runtime of choice.

This PR allow users to pass `include_initializers` to `ONNXProgram.save()` API.

Fixes #100996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121904
Approved by: https://github.com/titaiwangms
2024-04-22 15:53:38 +00:00
6ede882c0b preferred blas library; cublaslt gemm implementation (#122106)
Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources.

The default blas implementation remains cublas or hipblas.  cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106
Approved by: https://github.com/lezcano
2024-04-22 15:38:22 +00:00
9a322ba1b0 [user triton] Return unbacked SymInts used in the grid (#124594)
Summary: When unbacked SymInts are used only in a grid of a user-written Triton kernel call, there is no dependency between the Triton kernel's buffer and those unbacked SymInts. As a result, definition of the unbacked SymInts are not codegen-end and the code using them in the grid definition breaks.

Here we add the unbacked SymInts used in the grid to the `get_unbacked_symbol_uses` returned by the `UserDefinedTritonKernel` alongside those used in the `kwargs` (returned by `ExternKernel`).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_unbacked_symint
...
----------------------------------------------------------------------
Ran 24 tests in 155.764s

OK (skipped=16)
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D56406991](https://our.internmc.facebook.com/intern/diff/D56406991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124594
Approved by: https://github.com/oulgen
2024-04-22 15:33:30 +00:00
277ab8a4c0 Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)"
This reverts commit a56e057814565b2ae33b2106b4d0136179aa18f8.

Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/jeanschmidt due to Broken internal signals, @albanD please help get this sorted :) ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2069716129))
2024-04-22 14:44:44 +00:00
d5037c389c [Inductor cutlass backend] Fix tests: skipIfROCm always skips when using as class annotation (#123930)
I previously added @skipIfRocm as a class annotation within test/inductor/test_cutlass_backend.py - turns out this annotation always skips if applied at class level, so I need to skip Cutlass tests on ROCm differently..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123930
Approved by: https://github.com/jansel
ghstack dependencies: #121497
2024-04-22 13:59:59 +00:00
ad7b5d32b6 Intel GPU oneDNN Upstreaming: Convolution operators support (#117529)
# Motivation

This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112 and Conv integration code in #117512.  Some runtime support is needed in #116019.

This PR implements the convolution and deconvolution operators for XPU that should be defined in `aten` libraries. Also, the backward support is also supported.

With this PR, the conv-related operators should be functionality ready.

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117529
Approved by: https://github.com/EikanWang, https://github.com/malfet
ghstack dependencies: #117512
2024-04-22 13:22:36 +00:00
9d4bef6248 Intel GPU oneDNN upstreaming: Conv primitive integration (#117512)
# Motivation

This PR is a part of RFC #114848,. This PR would depend on oneDNN compilation in #117098 and basic integration support in #117112.  Some runtime support is needed in #116019.

This PR provides the oneDNN integration code for Convolution and Deconvolution related operators. All aten convolution operators(conv, deconv, and conv-pointwise fusion) will goes into this layer before executing oneDNN primitive.  The integration code is responsible for providing correct memory description for primitive and accompanied with primitive attribute description.

Wit this PR land, we will add Conv related operators accompanied with their registration.

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117512
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-04-22 12:20:54 +00:00
42bd1abc62 [Inductor Cutlass backend] Tolerate dynamic shapes (#121497)
Previously, when the Cutlass backend was enabled, using dynamic shapes could lead to exceptions during JIT.

With this change, there are guards in place to just disable the Cutlass backend if dynamic dimensions are involved.

In addition, if no choices for a GEMM are available using the selected backends, then an ATen Kernel is used as fallback, even if the ATen backend is not enabled.

Test:
CI
Additional unit test in test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121497
Approved by: https://github.com/jansel
2024-04-22 12:05:50 +00:00
34bce27f0d Revert "fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037)"
This reverts commit 6e24cc012b130869d0029280dcbb34efdd0032cc.

Reverted https://github.com/pytorch/pytorch/pull/124037 on behalf of https://github.com/jeanschmidt due to seems to have introduced a regression in pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) ([comment](https://github.com/pytorch/pytorch/pull/124037#issuecomment-2068659093))
2024-04-22 07:20:10 +00:00
3af12447f8 [inductor] Use compile time config values in runtime (#124561)
This removes usage of torch._inductor.config from `torch._inductor.runtime`.  Fixing two issues:
1) If configs change we should really use the compile time ones
2) In compile workers, we want to use the parent process config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569
2024-04-22 04:51:30 +00:00
317c0af149 [inductor] Remove config check for 3D tiling (#124569)
This makes the check per-kernel (if 3D tiling is used), rather than
global config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124569
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559, #124560
2024-04-22 04:51:30 +00:00
3ac30bc32a [inductor] Refactor runtime files into torch._inductor.runtime (part 5) (#124560)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124560
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557, #124559
2024-04-22 04:51:24 +00:00
9ea2a09510 [inductor] Refactor runtime files into torch._inductor.runtime (part 4) (#124559)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124559
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553, #124557
2024-04-22 04:51:20 +00:00
fcf28b0ad5 [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 04:51:15 +00:00
f4d47f5bbb [inductor] Refactor runtime files into torch._inductor.runtime (part 2) (#124553)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124553
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552
2024-04-22 04:51:09 +00:00
a7035cc11a [inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552
Approved by: https://github.com/yanboliang
2024-04-22 04:51:05 +00:00
6e24cc012b fix Invalid call to aoti_torch_tensor_copy_ #123039 (#124037)
fixes #123039

In abi mode, ExternKernelSchedulerNode generates code using `aoti_torch_tensor_copy_` which requires `AtenTensorHandle`, but the allocation generates ArrayRefTensor to allocate mem in stack. To fix this issue, this PR prevents ExternKernelSchedulerNode from using stack-mem-allocation in abi, and creates AtenTensorHandle instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124037
Approved by: https://github.com/desertfire
2024-04-22 01:34:22 +00:00
b1984237a0 [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi
2024-04-22 01:26:55 +00:00
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
fd90991790 [rfc] opentelemetry in pytorch (#122999)
1. Add current latest version (opentelemetry-cpp version v1.14.2) to PyTorch library.
Steps:
```
$cd pytorch
$git submodule add https://github.com/open-telemetry/opentelemetry-cpp.git third_party/opentelemetry-cpp
$cd third_party/opentelemetry-cpp
$git checkout v1.14.2
$git add third_party/opentelemetry-cpp .gitmodules
$git commit
```
Expected change in checkout size:
```
(/home/cpio/local/a/pytorch-env) [cpio@devvm17556.vll0 ~/local/pytorch (gh/c-p-i-o/otel)]$ git count-objects -vH
count: 654
size: 3.59 MiB
in-pack: 1229701
packs: 17
size-pack: 1.17 GiB
prune-packable: 76
garbage: 0
size-garbage: 0 bytes
```

2.

TODO
- [x] Figure out how dynamic linking works. App builders will somehow need to `target_include` opentelemetry-cpp at runtime.
- [ ] Examples on how to use opentelemetry + pytorch
- [ ] Tests + documentation (e.g. using null opentelemetry implementation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122999
Approved by: https://github.com/ezyang
2024-04-21 15:20:21 +00:00
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
f34905f61d Assert that TracingContext is available when set_example_value is called (#124284)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124284
Approved by: https://github.com/Chillee
ghstack dependencies: #124105, #124059, #124176, #124283
2024-04-21 11:23:13 +00:00
0e6367dd44 Factor var_to_range assignments to _update_var_to_range helper (#124283)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #124105, #124059, #124176
2024-04-21 11:23:13 +00:00
cbf420b67a [inductor] for UserDefinedTritonKernels don't mark all inputs as mutating (#124425)
Take this example:
```
def _mul2(x):
    y = torch.empty_like(x)
    mul2_kernel[(10,)](
        in_ptr0=x, out_ptr=y,
        n_elements=x.numel(), BLOCK_SIZE=1,
    )
    return y

def f(x):
    for _ in range(4):
        x = _mul2(x)
    return x + 1
```

Currently, the codegen will show up like this. Notice, how we allocate 5 buffers of the same size.
```
# Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: []
buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: []
buf4 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: []
buf8 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf8, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: []
buf12 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf8, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf12, (10, ), (1, ), 0) ...)

# Source Nodes: [add], Original ATen: [aten.add]
buf16 = empty_strided_cuda((10, ), (1, ), torch.float32)
triton_poi_fused_add_0.run(buf12, buf16, 10, grid=grid(10), stream=stream0)...)
return (buf16, )
```

With this PR, we want to see this. Notice, how we only allocate 2 buffers this time. The other 3 buffers are re-used.
```
# Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: []
buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0), ...)
del arg0_1

# Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: []
buf2 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf2, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: []
buf4 = buf0; del buf0  # reuse
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf2, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: []
buf6 = buf2; del buf2  # reuse
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf6, (10, ), (1, ), 0) ...)
del buf4

# Source Nodes: [add], Original ATen: [aten.add]
buf8 = buf6; del buf6  # reuse
triton_poi_fused_add_0.run(buf8, 10, grid=grid(10), stream=stream0)
return (buf8, )
```

Differential Revision: [D56379307](https://our.internmc.facebook.com/intern/diff/D56379307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124425
Approved by: https://github.com/oulgen
2024-04-21 06:00:14 +00:00
0d90d4d613 [Dynamo] Fix NamedTuple hasattr bug (#124531)
Fixes #124402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531
Approved by: https://github.com/jansel
2024-04-21 04:36:22 +00:00
a6a3f2e06b [MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes.

I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D.

I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards().
This PR also fixes an issue with LeakyRELU on empty tensors.

Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049
Approved by: https://github.com/kulinseth
2024-04-21 00:12:32 +00:00
98f3e0214b [NCCL][TEST] Synchronize proper devices (#124517)
There are multiple instances of `torch.cuda.synchronize()` calls without arguments. These calls cause device 0 being synchronized from multiple ranks while the rest of the devices are not. I am pretty sure that was not intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124517
Approved by: https://github.com/wconstab, https://github.com/eqy
2024-04-20 23:42:32 +00:00
d6f88105ce Fix the problem about load_state_dict with unexpected key whose prefix matches a valid key (#124385)
Fixes https://github.com/pytorch/pytorch/issues/123510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124385
Approved by: https://github.com/mikaylagawarecki
2024-04-20 23:19:25 +00:00
afa78ad08c Call writeline from writelines (#124515)
This makes it more convenient to add a breakpoint here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515
Approved by: https://github.com/albanD
2024-04-20 15:45:30 +00:00
a32eac345f [dynamo] Return gm.forward for eager backend (#124109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124109
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #124445
2024-04-20 14:11:05 +00:00
febc4d8759 [dynamo][easy] forbid_in_graph check to use getattr_static (#124445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-04-20 14:11:05 +00:00
97ccfad915 Fix test_decomp test for ops with py_impl(CompositeImplicitAutograd) (#116832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116832
Approved by: https://github.com/lezcano
2024-04-20 11:10:38 +00:00
a3e3693afc [Dynamo] Fix missing bracket in ListVariable (#124532)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532
Approved by: https://github.com/williamwen42
2024-04-20 08:26:30 +00:00
f20e3ae0c3 Use recursive blob for package data (#119257)
setup.py now supports recursive glob for package data

I only added `.cpp`, `.h`, and `.yaml` files. Not sure if you want to include BAZEL or other files in package_data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119257
Approved by: https://github.com/zou3519
2024-04-20 06:33:39 +00:00
0d0b5b2655 Enable dynamo rosenbrock sparse tests (#124542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542
Approved by: https://github.com/yf225
ghstack dependencies: #124540, #124541
2024-04-20 05:54:41 +00:00
184f16016e Enable dynamo-traced deepcopy test for RMSprop (#124541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541
Approved by: https://github.com/yf225
ghstack dependencies: #124540
2024-04-20 05:54:41 +00:00
6a730698e2 Enable dynamo-traced Adamax tests (#124540)
Enabling tests related to https://github.com/pytorch/pytorch/issues/121178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540
Approved by: https://github.com/yf225
2024-04-20 05:54:41 +00:00
f1cbaf1764 Adds LSE output for templated-attention-hop if inputs require grad (#124308)
Adds LSE output for templated-attention-hop if inputs require grad

Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards.

### Output code
https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a

## Before
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.159 |              |             |             |             |            |               |                |
| Max     |     1.342 |           16 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     1.016 |            1 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |

## After
 Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     1.155 |              |             |             |             |            |             |                |
| Max     |     1.339 |           16 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     1.009 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.bfloat16 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308
Approved by: https://github.com/Chillee
2024-04-20 05:45:56 +00:00
0d64b82f0b Make CompiledFxGraph portable between machines (#124438)
As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk.

Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438
Approved by: https://github.com/jansel
2024-04-20 05:26:14 +00:00
c5a4ba2257 [inductor] consider pointwise nodes when deciding reduction hint (#124131)
In certain **rare** scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if
- the reduction kernel contains a reduction node followed by a pointwise node
- And the pointwise node use a transposed layout.
- the reduction node is an inner reduction
- and rnumel <= 1024 ,

then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` .

I've tried a few version of fix.
- The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings.
- Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways.

The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test.

Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test.

For this specific test from user reports, we improve the mentioned reduction kernels perf by **4.14x** (451us -> 109us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131
Approved by: https://github.com/jansel
2024-04-20 05:07:56 +00:00
57f64197f3 Reduce warning msg in torch.profiler (#124469)
Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once?

Test Plan:
On AMD GPU side, I got a lot of those warnings:
```
W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())”
```
So just suppress the excessive logs

Reviewed By: aaronenyeshi, yoyoyocmu

Differential Revision: D55602788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469
Approved by: https://github.com/aaronenyeshi
2024-04-20 04:45:12 +00:00
b79b0d3d6a Enable UFMT on test/test_legacy_vmap.py (#124381)
Part of https://github.com/pytorch/pytorch/issues/123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124381
Approved by: https://github.com/ezyang
2024-04-20 03:37:57 +00:00
3d8b903d95 [PyTorch] Remove ArrayRefTensor::numel_ (#124516)
ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion.

Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-04-20 02:44:20 +00:00
f9fce110af [FSDP2][ez] Removed error check for swap tensors flag (#124513)
Since `DTensor` uses `swap_tensors` path automatically now, we can remove this check for the global flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124513
Approved by: https://github.com/weifengpy
ghstack dependencies: #124319, #120256
2024-04-20 00:46:36 +00:00
1c2cb36811 [FSDP2] Added CPU offloading (#120256)
#### Overview
This PR adds CPU offloading via the `offload_policy: OffloadPolicy` argument.
- We incur one H2D copy for each parameter before all-gather.
- We incur one D2H copy for each gradient after reduce-scatter.
- We run optimizer on CPU.

#### Example (Mixed Precision and CPU Offloading)
This example uses a small 125M numel model, which is not too representative. We can try to run with a larger model like Llama-7B. However, since the current optimizer step is already too slow, we may want to patch a faster CPU optimizer.

Forward
![Screenshot 2024-02-21 at 10 36 29 AM](https://github.com/pytorch/pytorch/assets/31054793/00ed95db-3a55-49bb-ac98-9b9162feaacd)
![Screenshot 2024-02-21 at 10 39 12 AM](https://github.com/pytorch/pytorch/assets/31054793/10e29854-1907-4001-b3dc-aab6c3bf153c)

Backward
![Screenshot 2024-02-21 at 10 37 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7039cb2e-eb78-4f53-b83f-67bae61ebddd)
![Screenshot 2024-02-21 at 10 38 44 AM](https://github.com/pytorch/pytorch/assets/31054793/e34615d6-6b6b-4995-aef1-9c7563034799)

Overall CPU (CPU optimizer step dominates)
![Screenshot 2024-02-21 at 10 39 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7a2a929a-3a40-4b35-891b-016cf57e8079)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120256
Approved by: https://github.com/weifengpy
ghstack dependencies: #124319
2024-04-20 00:42:58 +00:00
cf5ca58e7f [NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343
Approved by: https://github.com/jbschlosser
2024-04-19 23:13:59 +00:00
acbf888a13 rename sl to strobelight (#124455)
Summary:
TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT
SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH
SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME
profile_with_sl() -> strobelight()
compiletime_sl_profile_meta() -> compiletime_strobelight_meta()

Test Plan:
1. run and verify
```
TORCH_COMPILE_STROBELIGHT=TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```
2. run and verify
```
buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:function_profiler_example --local-only
```
3. run and verify truncated stack for
```
TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```
4. add infinite loop in _verify and verify samples for
```
COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```

Reviewed By: oulgen

Differential Revision: D56327139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455
Approved by: https://github.com/oulgen
2024-04-19 22:50:13 +00:00
0feab7d6c3 Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)"
This reverts commit cb17721899d4d6a55d66d4f7188e36c20a078231.

Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff908d2a9c20d5354426d34c539fcb7a1.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
52da03edeb Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)"
This reverts commit b6f0159db08c1ad55fe57a5e92d8933e21ea543e.

Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
f8f7cfbeee Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723)
support following case:
```python
import torch
...
class CustomFooTensor(torch.Tensor):
  @classmethod
  def __torch_function__(cls, func, types, args=(), kwargs=None):
    ...
a = CustomFooTensor([3])
print(a.is_foo)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723
Approved by: https://github.com/albanD
2024-04-19 22:34:34 +00:00
19850d770d update triton pin (#124429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124429
Approved by: https://github.com/shunting314, https://github.com/malfet
2024-04-19 22:34:28 +00:00
d8a98ddd60 Prep PR for cutlass 3.5 update (#124412)
# Summary
These changes are needed for the upgrade to cutlass 3.5
#123458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124412
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia, https://github.com/malfet
2024-04-19 22:10:37 +00:00
b3504af56e Enable UFMT on test/scripts and some files (#124137)
Part of: #123062

Ran lintrunner on:

- `test/scripts`
- `test/simulate_nccl_errors.py`
- `test/test_ao_sparsity.py`
- `test/test_autocast.py`
- `test/test_binary_ufuncs.py`
- `test/test_bundled_images.py`
- `test/test_bundled_inputs.py`
- `test/test_comparison_utils.py`
- `test/test_compile_benchmark_util.py`
- `test/test_complex.py`
- `test/test_cpp_api_parity.py`
- `test/test_cpp_extensions_aot.py`
- `test/test_cpp_extensions_jit.py`
- `test/test_cpp_extensions_open_device_registration.py`

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124137
Approved by: https://github.com/soulitzer
2024-04-19 22:01:27 +00:00
f0560f7b3b [opcheck] Stop doing test_aot_dispatch_static by default (#124495)
Motivations:
- this is pretty redundant with test_aot_dispatch_dynamic.
- The user story for opcheck is that a user should use opcheck to see
  if their operator was "registered correctly". If a user's custom op
  only supports dynamic shapes, then it's a bit awkward for
  one of the tests (e.g. `test_aot_dispatch_static`) to fail.
- We've already stopped running test_aot_dispatch_static in all of
  our opcheck tests.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495
Approved by: https://github.com/williamwen42
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414
2024-04-19 21:57:22 +00:00
37d18966ea [custom_op] set some tags when constructing the op (#124414)
- the op is automatically "pt2-compliant"
- In general we want to turn on needs_fixed_stride_order for all customm
  ops, but this needs some more work, so we're just going to turn it on
  for the new custom op API.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124414
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403
2024-04-19 21:57:22 +00:00
1900f79b72 [FSDP2] Added set_reshard_after_backward (#124319)
This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward.

```
for microbatch_idx, microbatch in dataloader:
    is_last_microbatch = microbatch_idx == num_microbatches - 1
    model.set_requires_gradient_sync(is_last_microbatch)
    model.set_reshard_after_backward(is_last_microbatch)
    model.set_is_last_backward(is_last_microbatch)
    microbatch_fwd_bwd(model, microbatch, microbatch_idx)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319
Approved by: https://github.com/weifengpy
2024-04-19 21:49:35 +00:00
10b9d4d19c [export] handle Dim.lower = 0, 1 for ep.run_decompositions() (#123602)
Summary:
With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1).

This updates `_get_updated_range_constraints()` to use the old range constraints if possible.

Test Plan: Existing pre-dispatch/dynamic shapes test case.

Differential Revision: D55899872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602
Approved by: https://github.com/tugsbayasgalan
2024-04-19 21:29:36 +00:00
c74dfca5e7 Int4MM: Unswizzle for different dtypes (#124448)
If dtype is not the one this platform is optimized for, it might need different unswizzling pattenrs Implement ones for non-vectorized flavor of the kernel, so that int4mm can be used with float32 and float16 dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124448
Approved by: https://github.com/jgong5, https://github.com/mikekgfb
2024-04-19 21:17:15 +00:00
000d55870a Enable in oss (#124031)
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
2024-04-19 20:28:55 +00:00
e6a788ac26 Fix compilation on aarch64 with gcc (#124511)
Which is more stringent than clang when equivalently sized NEON registers are cast to each other. In particular, at one point `uint16x4_t` were cast to `int16x4_t`, which gcc does not allow. Added `vreinterpret_s16_u16` (which is a no-op) to solve this and tested in https://godbolt.org/z/sYb4ThM6M

Test plan: Build aarch64 wheels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124511
Approved by: https://github.com/mikekgfb
2024-04-19 19:53:19 +00:00
179108f14d Use separate flags for MultiTemplates from BenchmarkFusion (#122825)
Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229
2024-04-19 19:50:42 +00:00
73f56e1e81 [sym_shapes][perf] Do not calculate hint in advice_is_size (#124472)
Differential Revision: [D56352412](https://our.internmc.facebook.com/intern/diff/D56352412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124472
Approved by: https://github.com/ezyang
2024-04-19 19:10:24 +00:00
661fd23640 [AMD] TunableOp take priority over DISABLE_ADDMM_HIP_LT (#124161)
Summary: It seems super confusing that if we set DISABLE_ADDMM_HIP_LT + PYTORCH_TUNABLEOP_ENABLED, the former takes priority. This is because the former goes through the gemm_and_bias and tunable op is integrated with gemm path. Before we can integrate tunable op with gemm_and_bias, we'll probably just let tunable op takes priority

Test Plan: Run a simple linear program and verified.

Differential Revision: D56183954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124161
Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni
2024-04-19 19:08:06 +00:00
f87c788a34 Revert "Capture triton kernel in execution trace (#124140)"
This reverts commit 89407eca3b0be3c0272b5c583f8e77b9108a71f8.

Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))
2024-04-19 19:05:44 +00:00
761de37ab7 [sym_shape][perf] eval_static: guards, unbacked compute once (#124217)
Differential Revision: [D56212345](https://our.internmc.facebook.com/intern/diff/D56212345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124217
Approved by: https://github.com/ezyang
2024-04-19 19:03:04 +00:00
8869b543e8 [AMD] Remove deprecated macro from COnvUtils (#124158)
Summary:
This is not great, but our ATen-cpu is not completely GPU agnostic. Previously we have worked on D54453492 (https://github.com/pytorch/pytorch/pull/121082) and D54528255, but there are a few things we haven't resolved, and it's exploding here. So we'll continue to fix them until all are gone.

This ROCm block is for 4.3 which is very old. I don't think it should be supported any more. So let's just kill this macro

Test Plan: CI

Differential Revision: D56172660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124158
Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni
2024-04-19 19:00:31 +00:00
b0d83726bd [5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241)
Summary: as title

Test Plan:
CI & unit test

patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test

Differential Revision: D56223917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-04-19 18:57:38 +00:00
25c65d6642 Change register_autograd to reflect ordering of setup_context and backward (#124403)
old: `register_autograd(setup_context, backward, /)`
new: `register_autograd(backward, /, *, setup_context=None)`

Motivations:
- We introduce these APIs as "give us a backward and use setup_context
  to save things for backward".
- setup_context isn't always necessary.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124403
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199
2024-04-19 17:56:30 +00:00
a8e17b2d4d Move schema inference to torch._library (#124199)
After this PR, we can delete torch._custom_op/torch._custom_ops (except
there are external libraries depending it).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124199
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134
2024-04-19 17:56:30 +00:00
a78450a00b Excise uses of the old custom ops APIs (#124134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124134
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299
2024-04-19 17:56:26 +00:00
9489019085 Small fixes for deferred epilogue (#123229)
Two small fixes:

- preserve rng around compile_fx_inner
- Now that will precompile in the background while lowering multiple templates in parallel, we no longer can allocate inputs at the beginning of the function because we will have multiple sets of inputs allocated at the same time. Instead, allocate them when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123229
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642
2024-04-19 17:41:29 +00:00
39fc280dce Dont precompile already seen keys, limit epilogue choices (#122642)
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
2024-04-19 17:34:22 +00:00
7ae835eee4 Enable SourcelessBuilder to build GraphModule generated by make_fx (#123673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123673
Approved by: https://github.com/ezyang, https://github.com/anijain2305
ghstack dependencies: #123680
2024-04-19 17:23:51 +00:00
68a027f144 Fixes for 123400 (#123406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123406
Approved by: https://github.com/janeyx99
ghstack dependencies: #123324, #123404, #123405, #124309
2024-04-19 17:20:57 +00:00
5050e627dc Defer marking_static_address (#124309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124309
Approved by: https://github.com/anijain2305
ghstack dependencies: #123324, #123404, #123405
2024-04-19 17:20:57 +00:00
1531a29fb9 Enable tests related to 116061 (#123405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123405
Approved by: https://github.com/janeyx99
ghstack dependencies: #123324, #123404
2024-04-19 17:20:54 +00:00
406d99e46c Fix for 117147 (#123404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123404
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
ghstack dependencies: #123324
2024-04-19 17:20:50 +00:00
203d111c54 Enable dynamo test_forloop_goes_right_direction_multi_gpu (#123324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123324
Approved by: https://github.com/janeyx99
2024-04-19 17:20:41 +00:00
293f756cdc Support aot_export torchbind op (#123370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123370
Approved by: https://github.com/zou3519
ghstack dependencies: #123367
2024-04-19 17:17:27 +00:00
e62169a8fa Support torchbind op dispatch in python (#123367)
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.

The idea is:
1. when inputs contains FakeScriptObject,  we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
2024-04-19 17:17:27 +00:00
136f8378e1 Re-land precompile triton templates (#124030)
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
2024-04-19 17:03:33 +00:00
bad8d25881 Add torch.library.register_kernel (#124299)
This mirrors the .register_kernel method on the object produced by the
custom_op decorator.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200
2024-04-19 13:54:21 +00:00
3918dfedc5 [custom_op] Rename register_impl to register_kernel (#124200)
Motivation:
- The API is used for registering an implementation for a specific
  device type.
- "impl" is ambiguous and can be confused with Library.impl.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124200
Approved by: https://github.com/albanD
ghstack dependencies: #124180
2024-04-19 13:54:21 +00:00
22a2f676c3 [custom_op] add ability to provide manual schema (#124180)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124180
Approved by: https://github.com/albanD
2024-04-19 13:54:13 +00:00
cyy
a56e057814 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/malfet, https://github.com/albanD
2024-04-19 13:39:41 +00:00
8b1ad51881 Better Error Message in ChainedScheduler and SequentialLR (#121633)
Fixes #121577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121633
Approved by: https://github.com/janeyx99
2024-04-19 13:37:41 +00:00
c9db59e9e4 [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-19 13:31:58 +00:00
96724a769b [ptd] drop ncclGroupStart/end for ncclCommInit (#124363) (#124416)
Summary:

```
ncclGroupStart()
ncclCommInit(..)
ncclGroupEnd()
```

above pattern is only needed when we have *single-thread* to manage multiple GPUs

in our case, we always have 1 process managing 1 GPU, we don't need group operation.

Test Plan: CI

Differential Revision: D56274975

Co-authored-by: Cen Zhao <cenzhao@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416
Approved by: https://github.com/shuqiangzhang
2024-04-19 13:12:42 +00:00
88fa843e58 Add vectorized norm fill for ppc64le (#113351)
This patch adds the vectorized norm fill for ppc64le.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113351
Approved by: https://github.com/jgong5
2024-04-19 12:34:00 +00:00
8e280862ff Add custom joint graph passes (#124443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124443
Approved by: https://github.com/aorenste, https://github.com/malfet
2024-04-19 11:54:46 +00:00
b412b75b42 [optim] add fused_adam/adamw_kernel support for CPU device (#123074)
On par with `CUDA` implementation.

For `autocast` logic, same with `CUDA` + `Fused Adam`:
 - check inf in `gradscalar.step`
 - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param.

**TestPlan**:
```
# extend CUDA only test for CPU fused adagrad
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_torch.py -k test_grad_scaling_autocast_fused

# extend fused test
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
python test_optim.py -k test_can_load_older_state_dict

# newly added test (follow 6b1f13ea2f/test/test_cuda.py (L1108))
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
```

**Benchmark**:
**5.1x** on 56 core SPR
**Parameter-size=1M**
**Nparams=10**
[test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7)

```
numactl -C 0-55 -m 0 python bench_adam.py
non-fused 6.0174267292022705 s
fused 1.1787631511688232 s
```

**Note: Fused kernel accuracy**
The accuracy failure in CI shows a little higher than default tolerance
```
2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%)
2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed)
2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed)
```
I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations.
For example, in non-fused impl
```
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
and in fused impl
```
  exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d];
  //  std::cout << "exp_avg_sq " <<   exp_avg_sq_ptr[d] << std::endl;
  exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] +
      scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val;
```
If I keep `std::cout`, I can get exactly same results in UT
```
===============param
0.6796758770942688
0.6796758770942688
```
But when I comment out it, there will be a difference
```
===============param
0.6796758770942688
0.6796759366989136
```
So I will make the tolerance a little higher than default one.

Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-04-19 11:14:04 +00:00
9a71d12d92 [CUDAGraphTree] Support mutated inputs from prior cudagraph pool (#123231)
# PR
This PR supports mutating inputs in cudagraph trees, if these inputs are outputs from previous cudagraph. Please check #121861 for more details.

# Note on Optimistic Mutation Check
To determine whether applying cudagraph, we need to check input mutations, falling into four categories: a) no mutation, b) mutation on parameters/buffers, c) mutation on cudagraph recorded tensors, d) mutation on non-cudagraph recorded tensors. We can apply cudagraph for type a,b,c but cannot for type d. This input mutation types depends on function, current_node, and inputs.

Since `check_for_mutation` is slow, there is a trade-off on making type c or d faster.
- To make type d) faster, we want to `check_for_mutation` and call eager function early. However, this adds unnecessary overhead to type a, b, c due to the extra check.
- To make type c) faster, we want to skip `check_for_mutation` at the beginning and only `check_for_mutation` before `record_function` for a new function. This removes the overhead of `check_for_mutation` for type a, b, c. However, this adds extra overhead to type d due to `check_invariants` for all children nodes.

Instead, we design optimistic mutation check. The assumption is that, given a function and a node, the input mutation types usually remain the same across inputs. So, if we have ever detect a function on a node with type d, we will never detect it as type c. The detailed design is:
- [Slow Path] On the first invocation of a function on a node, we run `check_for_mutation` once and cache the input mutation type as `non_cudagraph_managed_mutation[node_id][func_id]`.
- [Fast Path] On the subsequent invocations of a function on a node, we skip `check_for_mutation`. For `non_cudagraph_managed_mutation[node_id][func_id]` as true, we directly call eager function. Otherwise, we `check_variants` and call cudagraph function.
- [Slow Path] Before `record_function`, we run `check_for_mutation` again.

**Q1: Would there be overhead for type a,b,c,d?**
A: No. We only check input mutation types for the first invocation of a function on a node.

**Q2: If a function happens to be type c during the first invocation on a node, could we detect it as type d in the future?**
A: Yes. This is done by `check_invariants` and guarantees the correctness.

**Q3: If a function happens to be type d during the first invocation on a node, could it still be recognized as type c in the future?**
A: No. But this should happen rarely according to our assumption. In the rare case that it happens, there would not be any correctness issues and the performance is the same as the eager (or inductor optimized) function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123231
Approved by: https://github.com/eellison
2024-04-19 10:32:12 +00:00
58e403c739 Added a docstring for torch.Size.numel. (#124186)
Fixes #61231. Fixes #124167.

This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186
Approved by: https://github.com/janeyx99
2024-04-19 09:23:02 +00:00
520bc1080e Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)"
This reverts commit 768ce2cddad2057349d1194274a5f93c47c5ac88.

Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))
2024-04-19 09:09:03 +00:00
b2f6cfd9c0 Fix AVX2 int4pack_mm_kernel crash if weighs are unaligned (#124433)
Followup after https://github.com/pytorch/pytorch/pull/124128
`s/_mm256_load_si128/_mm256_loadu_si128/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124433
Approved by: https://github.com/desertfire
2024-04-19 05:17:38 +00:00
a6f044a490 [dynamo, 3.8-3.9] support dataclass with frozen=True in Python 3.8/3.9 (#124393)
Closes #114966

Frozen field assignment in `__init__` in Python 3.8-3.9:

f5bd65ed37/Lib/dataclasses.py (L402-L411)

```python
import builtins

BUILTINS = builtins

def _field_assign(frozen, name, value, self_name):
    # If we're a frozen class, then assign to our fields in __init__
    # via object.__setattr__.  Otherwise, just use a simple
    # assignment.
    #
    # self_name is what "self" is called in this function: don't
    # hard-code "self", since that might be a field name.
    if frozen:
        return f'BUILTINS.object.__setattr__({self_name},{name!r},{value})'
    return f'{self_name}.{name}={value}'
```

Frozen field assignment in `__init__` in Python 3.10+:

812245ecce/Lib/dataclasses.py (L436-L445)

```python
__dataclass_builtins_object__ = object

def _field_assign(frozen, name, value, self_name):
    # If we're a frozen class, then assign to our fields in __init__
    # via object.__setattr__.  Otherwise, just use a simple
    # assignment.
    #
    # self_name is what "self" is called in this function: don't
    # hard-code "self", since that might be a field name.
    if frozen:
        return f'__dataclass_builtins_object__.__setattr__({self_name},{name!r},{value})'
    return f'{self_name}.{name}={value}'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124393
Approved by: https://github.com/jansel
2024-04-19 05:10:33 +00:00
e408d9ca25 Revert "Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC (#123721)"
This reverts commit d032a780080646828bdda15f3af0277288b2fa34.

Reverted https://github.com/pytorch/pytorch/pull/123721 on behalf of https://github.com/malfet due to ARC is too flaky ([comment](https://github.com/pytorch/pytorch/pull/123721#issuecomment-2065750954))
2024-04-19 04:51:35 +00:00
96a067b190 Revert "Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC (#123722)"
This reverts commit b5d4ebe9aeabc1fc46ca39dee2d446f9b5e9e114.

Reverted https://github.com/pytorch/pytorch/pull/123722 on behalf of https://github.com/malfet due to ARC is too flaky ([comment](https://github.com/pytorch/pytorch/pull/123722#issuecomment-2065749522))
2024-04-19 04:49:07 +00:00
1ba85b34dd [AOTI] Enbale mmaped weights when CUDA is used (#124346)
By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers

It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346
Approved by: https://github.com/mikekgfb, https://github.com/desertfire
2024-04-19 04:47:27 +00:00
87f44d70b1 [torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233)
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
2024-04-19 04:07:00 +00:00
768ce2cdda [Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247)
This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing.

#suppress-api-compatibility-check

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-04-19 03:31:13 +00:00
803a08f8ae [ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030)
This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py.
It is required for DeepSpeed transformer extension build on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2024-04-19 02:57:16 +00:00
290e3e7abb Add ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408)
Summary: The intent is that we can whitelist certain benchmarks to a) enable TORCH_COMPILE_DEBUG=1, and b) save the generated artifacts in test/debug in case of a failure. Via the rules in action.yml, we can then upload test/debug/ to S3 whenever it exists. I chose to introduce a new directory (test/debug/) rather than using an existing one (e.g., test/test-reports/), because these don't seem like test reports and we can later add other debug-related artifacts if we find it useful. For example, we might want to later explore including the inductor cache artifacts.

Test Plan:
See artifacts generated when I force a failure: https://hud.pytorch.org/pr/124234
Specifically: https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/8729891826/1/artifact/debug-test-inductor_torchbench-2-2-linux.g5.4xlarge.nvidia.gpu_23953679574.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124408
Approved by: https://github.com/desertfire
2024-04-19 02:46:00 +00:00
889e3eeed3 Avoid cuda init to FakeTensorMode (#124413)
Also partially fixes #122109

This PR:
- We add a C++ flag (only_lift_cpu_tensors) to toggle the
  torch.tensor(1, device='cuda') ctor strategy.
  When false (default), it does the current PyTorch behavior
  of unconditionally constructing a concrete CUDA tensor then calling
  lift_fresh on it. When true, we instead construct a concrete CPU
  tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient
  modes).
- FakeTensorMode flips this flag depending on if CUDA is available or
  not. We don't unconditionally set the flag to True because that is
  likely BC-breaking.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413
Approved by: https://github.com/eellison
2024-04-19 02:39:35 +00:00
e620c3e814 Optimized templated attention to use exp2 (#124356)
0.705 (vs. FA2) to 0.860 after this change.

<img width="1270" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/d58f57ba-e50e-44ea-8a8a-4f13b8650adf">

to

<img width="1277" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f1945b67-0cfc-463c-a2f6-5812b90677fe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124356
Approved by: https://github.com/drisspg
2024-04-19 01:58:19 +00:00
eqy
0bde4efa84 Fix broken test in test_aot_inductor.py (#124329)
Doesn't seem to run in upstream CI due to sm90 requirement but it is failing on our end due to the extra positional argument

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124329
Approved by: https://github.com/chenyang78
2024-04-19 01:54:30 +00:00
0affd23014 Enable UFMT on test/test_python_dispatch.py (#124373)
Part of https://github.com/pytorch/pytorch/issues/123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124373
Approved by: https://github.com/ezyang
2024-04-19 00:57:18 +00:00
ddd0ed1b43 distributed: templated ring attention (#124215)
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.

This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.

Misc changes:

* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)

Test plan:

```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
2024-04-19 00:57:08 +00:00
4946638f06 [AOTI] Add ABI-compatiblity tests (#123848)
Summary: In AOTInductor generated CPU model code, there can be direct references to some aten/c10 utility functions and data structures, e.g. at::vec and c10::Half. These are performance critical and thus it doesn't make sense to create C shim for them. Instead, we make sure they are implemented in a header-only way, and use this set of tests to guard future changes.

There are more header files to be updated, but we will do it in other followup PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123848
Approved by: https://github.com/jansel
ghstack dependencies: #123847
2024-04-19 00:51:24 +00:00
cbefaf2a37 [AOTI] Move c10/util ostream function implementations to their headers (#123847)
Summary: AOTInductor generated code for CPU models may have direct reference to these c10-implemented data types, see _inductor/codegen/cpp_prefix.h. To make sure the AOTI generated code is ABI backward compatible, we need to change those headers to a header-only implementation. The next PR in this stack will add tests to use those data types without linking against libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123847
Approved by: https://github.com/jansel
2024-04-19 00:51:24 +00:00
9ed9b22ec0 Implement efficient_conv_bn_eval_decomp_graph_transform to handle conv and bn fusion after decomp (#123680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123680
Approved by: https://github.com/ezyang, https://github.com/youkaichao
2024-04-19 00:22:25 +00:00
ca6a0e1348 [c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334)
Summary:
This ENV was introduced to safely rollout the behavior change in destroy
process group (e.g., call ncclCommsAbort). Now that this behavior change
were already rolled out, we no longer need this env and we should clean
up it to keep our code cleaner
Test Plan:
Modified/existing ut pass

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334
Approved by: https://github.com/wconstab
2024-04-18 23:42:55 +00:00
2f45be46f6 [DeviceMesh][Test] Add 3d unit test for get_local_rank() (#124142)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142
Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu
2024-04-18 23:19:17 +00:00
e0792cf3d6 Make copy_cast, softmax and cat_out unranked (#123191)
Fixes #ISSUE_NUMBER
This helps with the performance as it removes multiple copies of the graphs saved due to their shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123191
Approved by: https://github.com/DenisVieriu97
2024-04-18 23:14:55 +00:00
e4f6340f21 realize inputs to mem bound mm decomposition (#123165)
Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165
Approved by: https://github.com/jackiexu1992
2024-04-18 23:10:04 +00:00
5ba6bb7b2f Add swap_tensors path to nn parametrizations (#124130)
Fixes #123859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130
Approved by: https://github.com/albanD
2024-04-18 22:22:08 +00:00
87f651c7e7 fix cpu test errors (#124116)
Similar fix is from @int3 but not landed. Credit to @int3 too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116
Approved by: https://github.com/chenyang78
2024-04-18 20:30:58 +00:00
2e48b39603 Fix example_value of map (#124203)
Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers.

Also remove a redundant call function node in strict_mode higher order op in dynamo.

Test Plan:
existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-04-18 19:18:36 +00:00
4a0900d04b Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)"
This reverts commit ef93402f619f58d651845981ccd1eba1d68da077.

Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))
2024-04-18 18:55:48 +00:00
61bc188f42 Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)"
This reverts commit b51f66c1950a582dd18d1b2ee67df840a8c4dbbe.

Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/malfet due to Broke gcc9 builds ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2064936414))
2024-04-18 18:53:59 +00:00
89407eca3b Capture triton kernel in execution trace (#124140)
Summary: This DIFF is to capture triton kernels in execution trace.

Test Plan: buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D56162599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140
Approved by: https://github.com/briancoutinho
2024-04-18 18:38:26 +00:00
74bedbb9e1 [export] Serialize rational symint ranges (#123884)
Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)).

Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884
Approved by: https://github.com/zhxchen17
2024-04-18 18:20:11 +00:00
b6f0159db0 Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)
Test the generic torch.Stream/Event with fake device gurad and hooks.
@exported-using-ghexport

Differential Revision: [D55902506](https://our.internmc.facebook.com/intern/diff/D55902506/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614
Approved by: https://github.com/albanD
ghstack dependencies: #123611, #123612
2024-04-18 17:40:13 +00:00
37215a4fa2 Fix memory leak in pattern_matcher (#124345)
#121313 changed precompiled patterns so they are more integrated with the pattern matching code.  This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data.

Fix this by clearing out the constant storage for the example data that we keep around.

Fixes #124081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345
Approved by: https://github.com/xuzhao9
2024-04-18 17:38:12 +00:00
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00
cb17721899 Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)
This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch.
------------
**torch.Stream APIs**
```
# Defined in torch/csrc/Stream.cpp
class Stream(_StreamBase):
    stream_id: _int  # Stream id
    device_index: _int
    device_type: _int

    device: _device  # The device of the stream

    @overload
    def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ...
    @overload
    def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ...
    def query(self) -> _bool: ...
    def synchronize(self) -> None: ...
    def wait_event(self, event: Event) -> None: ...
    def wait_stream(self, other: Stream) -> None: ...
    def record_event(self, event: Optional[Event] = None) -> Event: ...
    def query(self) -> None: ...
    def synchronize(self) -> None: ...
    def __hash__(self) -> _int: ...
    def __repr__(self) -> str: ...
    def __eq__(self, other: object) -> _bool: ...
```
------------------
**torch.Event APIs**:
- IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream.
- currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag.
- elapsedTime API is added to c10::Event

```
# Defined in torch/csrc/Event.cpp
class Event(_EventBase):

    device: _device  # The device of the Event
    event_id: _int # The raw event created by device backend

    def __new__(self,
        device: Optional[DeviceLikeType] = None,
        enable_timing: _bool = False,
        blocking: _bool = False,
        interprocess: _bool = False) -> Event: ...
    @classmethod
    def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ...
    def record(self, stream: Optional[Stream] = None) -> None: ...
    def wait(self, stream: Optional[Stream] = None) -> None: ...
    def query(self) -> _bool: ...
    def elapsed_time(self, other: Event) -> _float: ...
    def synchronize(self) -> None: ...
    def ipc_handle(self) -> bytes: ...
    def __repr__(self) -> str: ...
```

-----------

c10::Event provides new APIs
- calculate **elapsedTime**.
- Get raw event id
- Synchronize event.

```
  double elapsedTime(const Event& event) const {
    return impl_.elapsedTime(event.impl_);
  }

  void* eventId() const {
    return impl_.eventId();
  }

  void synchronize() const {
    return impl_.synchronize();
  }
```
----------
TODO: need to find a good way to test them in PyTorch with API mocks.

Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611
Approved by: https://github.com/albanD
2024-04-18 17:35:09 +00:00
7a6edb0b66 Possible fix for einops warning (#124084)
See https://github.com/arogozhnikov/einops/issues/315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084
Approved by: https://github.com/peterbell10
2024-04-18 17:09:50 +00:00
a8cf91c395 Fix predispatch tracing for aten::lift_fresh_copy (#124198)
Differential Revision: D56200666

Previously, when we hit the Functionalize kernel for lift_fresh_copy, we directly dispatch self.clone() to proxy dispatch. As a result, we end up receiving a functional tensor at proxy dispatch. As a work around, I unwrap self manually. Not sure, why it works ok in aot-dispatch tho

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124198
Approved by: https://github.com/bdhirsh
2024-04-18 17:02:38 +00:00
e1062f5738 [export] Add a printer to unflattened module. (#124315)
Summary: add a helper method to print graph in every level of unflattened module.

Test Plan: {F1489609684}

Differential Revision: D56263195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315
Approved by: https://github.com/tugsbayasgalan
2024-04-18 16:35:51 +00:00
415a8f6398 Fixed issue in affine_grid_backward when grad_grid is non-contiguous (#124370)
Description:
- replaced .view with .reshape to fix the problem when grad_grid is channels last 2d/3d
- added a consistency test

Fixes #124154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124370
Approved by: https://github.com/lezcano
2024-04-18 16:30:10 +00:00
aa2da0cdd2 [Export] Add runtime assert to non-strict export (#123681)
This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export.

Differential Revision: D55944267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2024-04-18 16:13:27 +00:00
5677128cb8 [MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258)
By creating constants using input tensors dtype

One line reproducer:
```
python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))"
```

Before the change
```
loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```
After
```
tensor(-33.7812, device='mps:0', dtype=torch.float16)
```

Fixes https://github.com/pytorch/pytorch/issues/124252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258
Approved by: https://github.com/kulinseth
2024-04-18 15:21:01 +00:00
ef93402f61 [NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343
Approved by: https://github.com/jbschlosser
2024-04-18 14:42:54 +00:00
bbb6e36495 [FSDP2] Fixed set_requires_gradient_sync's recurse arg (#124318)
The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that.

The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952, #124293
2024-04-18 14:21:57 +00:00
9385ef2a5d Revert "Skip workspace permission change for ROCm CI (#123816)"
This reverts commit 4322a0e782119f870ba1a17aec2be8a0ef1103d7.

Reverted https://github.com/pytorch/pytorch/pull/123816 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123816#issuecomment-2063949316))
2024-04-18 14:07:09 +00:00
1325fd94a4 Support xpu autocast policy (#124052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124052
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-04-18 14:06:48 +00:00
cyy
b51f66c195 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/albanD
2024-04-18 13:35:48 +00:00
1542874311 Delete qualname from custom_op decorator (#124092)
I forgot to delete this in an earlier PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089
2024-04-18 12:48:04 +00:00
648c39c47d Add OpOverload.redispatch; use it in new custom ops API (#124089)
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
  Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch

We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
2024-04-18 12:48:04 +00:00
645173a0b5 Add torch.library.register_autograd (#124071)
Allows registering autograd for all custom op entry points:
- the new-style custom op API (custom_op)
- the old-style torch.library APIs
- C++ operator registration

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066
2024-04-18 12:47:59 +00:00
8135c4b921 torch.library.register_fake now accepts more types (#124066)
We allow it to accept:
- a string with the op name
- an opoverload
- a new-style custom op

If any of these are referring to a new-style custom op (created with the
custom_op decorator), then we dispatch to CustomOpDef.register_fake.
Otherwise, we do what we previously did.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065
2024-04-18 12:47:55 +00:00
a0466061e1 Support xpu host allocator (#123080)
# Motivation
This PR mainly covers caching host allocator supported on xpu backend.

# Solution
`XPUCachingHostAllocator` adopts the **same** caching mechanism as cuda via two abstract interfaces -`CachingHostAllocatorImpl` and `CachingHostAllocatorInterface`.

# Additional Context
Following CUDA, this PR adds a new API `getPinnedMemoryAllocator` to support the tensor's memory pinned.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123080
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
2024-04-18 12:29:21 +00:00
b5d4ebe9ae Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC (#123722)
Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123722
Approved by: https://github.com/jeanschmidt
2024-04-18 12:06:57 +00:00
d032a78008 Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC (#123721)
Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123721
Approved by: https://github.com/jeanschmidt
2024-04-18 12:06:28 +00:00
6fcbeb3489 [ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256)
Add CPU FP16 support for nll_loss and cross_entropy_loss.
Resolve issue #123328.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-04-18 11:44:38 +00:00
d59f1da62f [sym_shapes][perf] _find not update unchanged replacements (#124274)
Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274
Approved by: https://github.com/ezyang
2024-04-18 08:32:02 +00:00
9eba1995d0 [sym_shapes][perf] Use sympy xreplace instead of subs (#124208)
https://github.com/sympy/sympy/issues/22240

Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-18 08:19:03 +00:00
2b82345e48 Revert "Re-land precompile triton templates (#124030)"
This reverts commit 030bb13fe84c88ab5c988351543362b60fefb556.

Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))
2024-04-18 07:21:41 +00:00
704fac5618 [dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231
Approved by: https://github.com/jansel
ghstack dependencies: #124230, #124237
2024-04-18 06:36:20 +00:00
6e86a40694 Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943)"
This reverts commit dd7aeedb72f8a96d0f168308292e0d41c095f01b.

Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))
2024-04-18 06:34:32 +00:00
8ff85b42f9 Revert "Add swap_tensors path to nn parametrizations (#124130)"
This reverts commit 64f6ddf12c11738c3f4b1ed01cf4f699541496bf.

Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))
2024-04-18 06:12:54 +00:00
ec608a5d66 Refactor CUDA's amp cast policy to be generic (#124051)
# Motivation
This PR intends to create several op lists for different policies:
- `AT_FORALL_LOWER_PRECISION_FP` for policy `lower_precision_fp`
- `AT_FORALL_FP32` for policy `fp32`
- `AT_FORALL_FP32_SET_OPT_DTYPE` for policy `fp32_set_opt_dtype`
- `AT_FORALL_PROMOTE` for policy `promote`.

To make sure the other backend can reuse the policy op list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124051
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #124050
2024-04-18 04:35:25 +00:00
8ad66e05d2 [4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123)
Summary: as title

Test Plan: CI & unit test

Differential Revision: D56163334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-04-18 04:19:37 +00:00
de1c0d2497 [cublas] Keep explicit workspace creation to avoid OOM (#124250)
Summary:
We explicitly set the cublas workspace even though CUDA 12.2+ fixed the issue where memory usage increased during graph capture. Original issue: https://github.com/pytorch/pytorch/pull/83461

This is because in CUDA 12.2+, the use of cudaMallocAsync in cublas will allocate memory dynamically (even if they're cheap) outside PyTorch's CUDA caching allocator. It's possible that CCA used up all the memory and cublas's cudaMallocAsync will return OOM

Test Plan: CI

Differential Revision: D56226746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124250
Approved by: https://github.com/houseroad, https://github.com/eqy
2024-04-18 04:17:38 +00:00
c9ab9248ce [Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249)
Generalize device-bias code in tirton_utils.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel
2024-04-18 03:54:31 +00:00
27daa110c8 Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578)" (#124324)
Summary:
Original commit changeset: 528276bc8a92

Original Phabricator Diff: D56057952

Differential Revision: D56271240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324
Approved by: https://github.com/davidberard98
2024-04-18 03:33:54 +00:00
f213f262af [dynamo][cpp-guards] Improve when to use Dict vs DictSubclassGuardManager (#124237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124237
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #124230
2024-04-18 03:33:37 +00:00
9fed2e826b [DTensor][Test] Add unit tests to keep track of DTensor sharding for 2D (#123687)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123687
Approved by: https://github.com/wanchaol
2024-04-18 03:29:16 +00:00
dca24d70ba [dynamo, test] remove skip for unhandled exception test (#123876)
This test might no longer segfault in CI due to changes to how we allocate and free shadow frames in dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123876
Approved by: https://github.com/jansel
2024-04-18 03:02:34 +00:00
812bae09be [dynamo] fix 3.11+ refleak (#124238)
Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+.

In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame.

Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238
Approved by: https://github.com/jansel
2024-04-18 03:02:29 +00:00
7c94652d7d [benchmarks] Add --use-warm-peak-memory (#124326)
Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed)

```
e.g. resnet50
# without --warm-peak-memory
memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29

# with --warm-peak-memory
memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95
```

![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37)

This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory:
![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326
Approved by: https://github.com/jansel
ghstack dependencies: #119411
2024-04-18 02:57:01 +00:00
0ddd17bdc6 [benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411)
creates memory snapshot pickles e.g.
```
inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle
inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411
Approved by: https://github.com/jansel
2024-04-18 02:57:01 +00:00
6b4b857a60 [dynamo][nn_module] Enable torch.compile/disable as decorators on the class (#124187)
Support something like. This is UI change, so please review carefully.

~~~
        @torch._dynamo.disable
        class SimpleLinear(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.layer0 = torch.nn.Linear(4, 4)

            def forward(self, inp):
                return self.layer0(torch.sigmoid(inp))

        @torch.compile(backend=cnts)
        class SimpleModel(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.layer0 = SimpleLinear()
                self.layer1 = torch.nn.Linear(4, 4)

            def forward(self, inp):
                z = self.layer0(torch.sin(inp))
                return self.layer1(z)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124187
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-04-18 02:51:30 +00:00
b6b757701e [aot] trim refcount for subclass runtime wrapper (#124155)
On torchtrain,

before
<img width="1218" alt="image" src="https://github.com/pytorch/pytorch/assets/9547562/b340c114-071a-440c-904c-c042de4d92c5">

after
![image](https://github.com/pytorch/pytorch/assets/9547562/ee3b6e6f-6e46-46bc-a93d-d4603673ee63)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124155
Approved by: https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #124127
2024-04-18 02:34:52 +00:00
1f04c29be5 [inductor] Freeze the layout of the conv input to channels_last (#122765)
Fix https://github.com/pytorch/pytorch/issues/118082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-04-18 02:23:38 +00:00
51a56efbb9 [inductor] modify the output_stride of ConcatKernel (#122761)
Fix https://github.com/pytorch/pytorch/issues/121613.
Modify the `output_stride` of `ConcatKernel`: If any input to `Concat` is `Pointwise`, check the layout of all inputs to `Pointwise`, if any of the inputs is in channels_last format, set channels_last strides for the `output_stride`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122761
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-04-18 02:19:46 +00:00
78f3b99a94 [inductor] Modify the rules for freezing the layout of x.unwrap_view() in convert_to_reinterpret_view (#122760)
Fix https://github.com/pytorch/pytorch/issues/121607

Modify the rules for freezing the layout of `x.unwrap_view()` in `convert_to_reinterpret_view`: If any read of `x.unwrap_view()` is in channels_last format, freeze the layout of `x.unwrap_view()` to channels_last format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122760
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-04-18 02:17:07 +00:00
b71423c2e4 [inductor] let coordesc tuner respect max RBLOCK (#124325)
Fix https://github.com/pytorch/pytorch/issues/124251 .

Coordesc tuner need respect max RBLOCK. When rnumel is a multiple of max-RBLOCK, inductor codegen will skip rmask. If coordesc tuner does not consider max-RBLOCK and pick a RBLOCK larger than that, we would get CUDA IMA (illegal memory access) error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124325
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-04-18 02:12:35 +00:00
43b4ac956e Add index_reduce decomposition (#122579)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122579
Approved by: https://github.com/peterbell10
ghstack dependencies: #123375
2024-04-18 01:30:47 +00:00
030bb13fe8 Re-land precompile triton templates (#124030)
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
2024-04-18 01:22:13 +00:00
fae31495ff Try to speed up lintrunner in CI (#124311)
Before timing: clang is 19min and noclang is 16min
After timing: clang is 17min and noclang is 15min

This is still crazy slow so most likely more could be done but didn't check the logs in details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124311
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-04-18 01:17:47 +00:00
cc66c43d51 Make macro with AMP more generic (#124050)
# Motivation
According to [[RFC] Intel GPU Upstreaming](https://github.com/pytorch/pytorch/issues/114723), we would like to upstream amp autocast policy to facilitate the functionality and accuracy of `torch.compile` on e2e benchmarks.

# Solution
The first PR aims to make macro `KERNEL` to be generic. It accepts two types of inputs, like `(DISPATCH, OP, POLICY)` and `(DISPATCH, OP, OVERLOAD, POLICY)`.
The second PR intends to refactor CUDA's autocast policy to make it can be shared with `XPU` backend.
The final PR would like to support XPU autocast policy which shares the same recipe with `CUDA` backend.

# Additional Context
Another motivation is we would like to unify autocast API and provide the generic APIs, like:
- `torch.get_autocast_dtype(device_type)`
- `torch.set_autocast_dtype(device_type)`
- `torch.is_autocast_enabled(device_type)`
- `torch.set_autocast_enabled(device_type)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124050
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-04-18 01:15:03 +00:00
102a223216 Enable dynamo test_state_dict_deterministic (#123323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123323
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498, #123322
2024-04-18 01:06:28 +00:00
d88fcb86d8 Enable dynamo traced test_forloop_goes_right_direction (#123322)
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
2024-04-18 00:50:10 +00:00
57a3dc56d4 Small Adamax fix (#123498)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123498
Approved by: https://github.com/janeyx99
2024-04-18 00:50:03 +00:00
21f7cbdc1c Enable UFMT on test/test_autograd.py (#124141)
Part of: #123062

Ran lintrunner on:

- `test/test_autograd.py`

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124141
Approved by: https://github.com/soulitzer
2024-04-18 00:16:23 +00:00
025387f4dd [ez][CI] Reduce CI_SERIAL_LIST pt2 (#124298)
#124085

Add @serialTest() to some tests

slow gradcheck already runs serially

Doing this slowly so its easier to check flaky issues that might get made

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124298
Approved by: https://github.com/kit1980
2024-04-18 00:13:36 +00:00
38bfe7bcd1 add link to custom ops troubleshooting page on tensor data_ptr error (#124240)
Fix part of https://github.com/pytorch/pytorch/issues/123603.

Example traceback on branch https://github.com/pytorch/vision/compare/main...wwen/custom_ops_test:
```
running my_custom_op!
Traceback (most recent call last):
  File "/data/users/williamwen/torchvision/playground.py", line 13, in <module>
    print(opt_fn1(torch.randn(3, 3)))
  File "/data/users/williamwen/pytorch2/torch/_dynamo/eval_frame.py", line 387, in _fn
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 977, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 818, in _convert_frame
    result = inner_convert(
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 411, in _convert_frame_assert
    return _compile(
  File "/data/users/williamwen/pytorch2/torch/_utils_internal.py", line 70, in wrapper_function
    return function(*args, **kwargs)
  File "/data/users/williamwen/py310-env/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 700, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 266, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 568, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/bytecode_transformation.py", line 1116, in transform_code_object
    transformations(instructions, code_options)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 173, in _fn
    return fn(*args, **kwargs)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 515, in transform
    tracer.run()
  File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 2237, in run
    super().run()
  File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 875, in run
    while self.step():
  File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 790, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 492, in wrapper
    return inner_fn(self, inst)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 1260, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 730, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/torch.py", line 747, in call_function
    tensor_variable = wrap_fx_proxy(
  File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1425, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1510, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1804, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1736, in get_fake_value
    ret_val = wrap_fake_exception(
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1251, in wrap_fake_exception
    return fn()
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1737, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1872, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1854, in run_node
    return node.target(*args, **kwargs)
  File "/data/users/williamwen/pytorch2/torch/_ops.py", line 870, in __call__
    return self_._op(*args, **(kwargs or {}))
torch._dynamo.exc.TorchRuntimeError: Failed running call_function torchvision.my_custom_op1(*(FakeTensor(..., size=(3, 3)),), **{}):
The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://docs.google.com/document/d/1W--T6wz8IY8fOI0Vm8BF44PdBgs283QvpelJZWieQWQ

from user code:
   File "/data/users/williamwen/torchvision/playground.py", line 5, in fn1
    return torch.ops.torchvision.my_custom_op1(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124240
Approved by: https://github.com/zou3519
2024-04-18 00:08:09 +00:00
5a60a1abde Move the implementation of register_fake onto torch.library.Library (#124065)
Motivations:
- This makes things more consistent: using a Library object, you should
  be able to do all of the registration APIs that tie registrations to
  the lifetime of the Library.
- I need this for the next PR up in the stack, where we will have
  torch.library.register_fake support both CustomOpDef (from the new
  custom ops API) and other custom ops.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064
2024-04-17 23:51:20 +00:00
d1e1d671ef Stop requiring a pystub for register_fake by default (#124064)
Previously, if someone used `register_fake` to add a fake impl for an
operator defined in C++, we would require them to add a
`m.set_python_module(<module>)` call to C++. This was to avoid
situations where a user imported the C++ operator without importing the
fake impl.

This "breaks" open registration: there's no way to add a fake impl
outside of a repository that defines an operator, so we want to turn
this behavior off by default in open source.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064
Approved by: https://github.com/albanD
ghstack dependencies: #123937
2024-04-17 23:51:20 +00:00
f5049de242 Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)"
This reverts commit 5bef127c2ea49280e7fda4f9fa7cad6fa4078e7d.

Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/PaliC due to your using TORCH_INTERNAL_ASSERT incorrectly ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2062696010))
2024-04-17 23:44:00 +00:00
64f6ddf12c Add swap_tensors path to nn parametrizations (#124130)
Fixes #123859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130
Approved by: https://github.com/albanD
2024-04-17 23:37:28 +00:00
b5235694f4 [FSDP2] Made unshard return type consistent (#124293)
We can always return an `UnshardHandle` if `async_op=True` even if the FSDP module does not manage any parameters and hence does not have an `FSDPParamGroup`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124293
Approved by: https://github.com/weifengpy
ghstack dependencies: #120952
2024-04-17 23:33:46 +00:00
64f42bfd52 [dynamo] Support list.reverse (#124210)
fixes #123974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124210
Approved by: https://github.com/peterbell10
2024-04-17 23:33:32 +00:00
dd7aeedb72 [Dynamo] Check for __bool__ attribute before accessing it (#120943)
This PR checks if __bool__ attribute is available before accessing it when handling a UserDefinedObjectVariable

Fixes #119782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120943
Approved by: https://github.com/zou3519
2024-04-17 23:26:55 +00:00
00372b1211 Extend int[48]mm ops to float32 input (#124287)
Just for completeness

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124287
Approved by: https://github.com/mikekgfb
2024-04-17 23:10:49 +00:00
14162eecfc Update Security Policy to provide Security Guidance for users (#120531)
Fixes #120530

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120531
Approved by: https://github.com/malfet, https://github.com/albanD
2024-04-17 23:08:48 +00:00
9875a834e4 [Intel GPU] oneDNN GPU GEMM support (#117202)
# Motivation

This PR is a part of RFC #114848, and it  is a successor PR of #116249 and #116019. This PR would depend on oneDNN compilation in #116249. Some runtime support is needed in #116019.

Aten operators like `addmm`, `baddmm` is defined in `Blas.cpp` in `aten/src/ATen/native/mkldnn/xpu/`.

Accompanied with these files provide core functionaliy, `BlasImpl.h`, `Utils.h` and other file provide basic utilities for them. For instance, `Utils.h` provide common memory descriptor query utils for `Matmul.h` and these utility function will also be used in other primitive, like `convolution`.  `BlasImpl.h` is a header file that provide helper for handling shape info processing in matmul related operators. It would not only help basic GEMM operator like `addmm, baddmm` but also help fusion operators used in `torch.compile` like `linear_pointwise` in #117824.

In next stage, we would continually complete the oneDNN support through enabling  `matmul fusion`  and `convolution` related code.

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117202
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117098, #117112
2024-04-17 23:06:38 +00:00
6330acae76 Refactored implementation for upsample_nearest decompostions (#122783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122783
Approved by: https://github.com/peterbell10
2024-04-17 23:05:40 +00:00
bebdbb63ce Introduce set_example_value and use it throughout Dynamo (#124176)
I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose.  I cannot easily do it on
meta itself because its a generic dict with no interposition point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124176
Approved by: https://github.com/oulgen
ghstack dependencies: #124105, #124059
2024-04-17 22:57:11 +00:00
d23bf9cef0 Add fake impl for aten.unique2 (#124306)
Reapply of: https://github.com/pytorch/pytorch/pull/121571
Differential Revision: [D56258431](https://our.internmc.facebook.com/intern/diff/D56258431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124306
Approved by: https://github.com/gmagogsfm
2024-04-17 22:55:27 +00:00
cc18afa25f Intel GPU oneDNN upstreaming for primitive integration (#117112)
# Motivation

As proposed in https://github.com/pytorch/pytorch/issues/114848 and https://github.com/pytorch/pytorch/issues/114723, oneDNN library is an important component for Intel GPU software ecosystem.

Current PR is based on #117098, where oneDNN library for Intel GPU should be ready.  This PR is the integration code from aten to oneDNN. GEMM integration code is the core part in this PR. Accompanied with GEMM, more basic support like runtime (device, stream), primitive attr is also included.

We put the oneDNN integration code in directory `aten/src/ATen/native/mkldnn/xpu/detail`. We add a namespace `at::native::xpu::onednn` for oneDNN integration.

The code in this PR would be used in following PRs, where aten operators would call the functions in these integration code.. We separate the prs due to onednn integration is logically separable with aten operator implementation. Also, this can ease the burden of reviewing by avoid too much codes in single PR.

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117112
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-04-17 22:49:56 +00:00
944d046645 Revert "[DeviceMesh][Test] Add 3d unit test for get_local_rank() (#124142)"
This reverts commit a403757913689d200683a4158c565bc3dbade74b.

Reverted https://github.com/pytorch/pytorch/pull/124142 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/124142#issuecomment-2062587289))
2024-04-17 22:31:30 +00:00
1ec05c769b all_gather and reduce_scatter autograd (#123989)
This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library.

This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future.

The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

This follows the pattern of #123599 .

Test plan:

```sh
pytest test/distributed/test_functional_api.py -k Autograd
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989
Approved by: https://github.com/wanchaol
2024-04-17 21:32:22 +00:00
a403757913 [DeviceMesh][Test] Add 3d unit test for get_local_rank() (#124142)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124142
Approved by: https://github.com/xunnanxu, https://github.com/fegin, https://github.com/XilunWu
2024-04-17 20:45:49 +00:00
cdc855af97 [Test][2D] Turn on 2D state_dict tests for uneven sharding (#124255)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124255
Approved by: https://github.com/wanchaol
2024-04-17 20:45:34 +00:00
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
eqy
b726a23d4e change tf32 thresholds for test_per_sample_grads_embeddingnet (#124104)
TF32 causes issues with the tolerances here; we might also consider migrating some of the `with_tf32_off` tests in this file to `tf32_on_and_off` in case it would be useful to get signal for TF32.

CC @malfet @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124104
Approved by: https://github.com/zou3519
2024-04-17 19:16:32 +00:00
4efdf9a6a6 fix pytorch version for onnx in doc (#124182)
Fixes [ 123845](https://github.com/pytorch/pytorch/issues/123845)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124182
Approved by: https://github.com/albanD
2024-04-17 18:05:15 +00:00
24cecf06d7 Update autotune jk knobs (#124214)
Differential Revision: [D56201145](https://our.internmc.facebook.com/intern/diff/D56201145/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124214
Approved by: https://github.com/aakhundov
2024-04-17 17:49:25 +00:00
f433517181 [dynamo][decorator] Support disable on nn modules (#124185)
Fixes https://github.com/pytorch/pytorch/issues/123979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124185
Approved by: https://github.com/weifengpy, https://github.com/yoyoyocmu
2024-04-17 16:20:34 +00:00
46324fe073 Speedup int4mm_kernel with NEON (#124257)
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
  Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86
  Before/after performance running stories110M on M2Pro

 | eager (before) | eager (after) | compile(before) | compile (after) |
 | ---- | --- | -- | -- |
 | 28 | 57  | 31 | 104 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124257
Approved by: https://github.com/mikekgfb
2024-04-17 16:04:25 +00:00
9b1d6c8d98 improve F.adaptive_avg_pool2d error messages on mps (#124143)
Gives better error messages on mps. Partially fixes #123725 in the case of `F.adaptive_avg_pool2d`. This also relates to #96056.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124143
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-17 16:04:09 +00:00
7e1c98c171 [dynamo] support object.__setattr__(obj, name, value) (#124068)
Resolves #114964
Resolves #114966

- #114964
- #114966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124068
Approved by: https://github.com/jansel
2024-04-17 15:57:14 +00:00
36f6928a37 Revert "[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556)"
This reverts commit 41613a0803f7cde7956f039bc80f94253b0843f9.

Reverted https://github.com/pytorch/pytorch/pull/120556 on behalf of https://github.com/aaronenyeshi due to Breaks GPU Chrome trace UI ([comment](https://github.com/pytorch/pytorch/pull/120556#issuecomment-2061578951))
2024-04-17 15:38:14 +00:00
d2b0c0a34e Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375)
As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375
Approved by: https://github.com/zou3519, https://github.com/peterbell10
2024-04-17 15:31:28 +00:00
5a735ece6b Remove @abock from ONNX approvers/codeowners (#124259)
As he is no longer interested in the project
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124259
Approved by: https://github.com/kit1980, https://github.com/BowenBao
2024-04-17 14:13:53 +00:00
b880a71010 [BE] Add missing std:: prefix to Unique.mm (#124232)
Follow up after https://github.com/pytorch/pytorch/pull/124117 fixes following warning
```
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Unique.mm:282:26: warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions]
  return std::make_tuple(get<0>(out).to("mps"), get<1>(out).to("mps"), get<2>(out).to("mps"));
                         ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124232
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-04-17 14:12:29 +00:00
5f378e1853 [Inductor cutlass backend] Fix flaky test ( CUDA IMA ) (#124106)
A unit test within test_cutlass_backend.py can fail with CUDA illegal memory accesses due to the fact that some CUTLASS Kernels contain bugs.

By using autotuning in subprocesses, this CUDA illegal memory access simply
leads to the buggy Cutlass Kernels being filtered out, instead of causing it
to bring down the entire process.

Test Plan:
This is a change to a unit test. It's recommended to use autotune_in_subproc when using the Cutlass backend anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124106
Approved by: https://github.com/eellison
2024-04-17 13:19:13 +00:00
47dbfecd37 Rename impl_abstract to register_fake, part 1/2 (#123937)
This PR:
- adds a new torch.library.register_fake and deprecates
  torch.library.impl_abstract. The motivation is that we have a lot of
  confusion around the naming so we are going to align the naming with
  the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
  `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
  here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
  pystub)

I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.

DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
2024-04-17 12:46:01 +00:00
6efcb6c718 Fix wrong ufmt exclusions in .lintrunner.toml (#124135)
Part of: #123062

In this pull request(#123809), there were some exclusions that should have been removed, but weren't.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124135
Approved by: https://github.com/ezyang
2024-04-17 12:22:50 +00:00
2dc15b6849 Revert "[sparse] Add fast semi-structured spasification kernels (#122350)"
This reverts commit 14b2273b0c58b4000e10b2e441341eeafb7dd2f6.

Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2061070350))
2024-04-17 11:47:02 +00:00
3f89f565bb Revert "Re-land precompile triton templates (#124030)"
This reverts commit d68196e7ef5eb8f62064ef70c75032f4d8b4a4fa.

Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))
2024-04-17 11:31:33 +00:00
77ad630f5d Revert "Dont precompile already seen keys, limit epilogue choices (#122642)"
This reverts commit 050051f412e50d98d506adf0d05aa6e4ceab54bd.

Reverted https://github.com/pytorch/pytorch/pull/122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))
2024-04-17 11:31:32 +00:00
acc466751b Add bfloat16 support to binary_cross_entropy for CPU (#123823)
Fixes #123715

As the title stated.

But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-04-17 09:44:07 +00:00
c4878abab0 Fix Setup Linux for ARC (#124171)
We can't get information about `ami-id`, `instance-id`, `instance-type` for the ARC runners:

```
2024-04-16T11:10:17.0098276Z curl: (22) The requested URL returned error: 401
2024-04-16T11:10:17.0110775Z ami-id:
2024-04-16T11:10:17.0159131Z curl: (22) The requested URL returned error: 401
2024-04-16T11:10:17.0167378Z instance-id:
2024-04-16T11:10:17.0219464Z curl: (22) The requested URL returned error: 401
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124171
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/zxiiro
2024-04-17 09:25:02 +00:00
d0211e207c inductor cpp wrapper: add GIL release back (#123897)
Fixes https://github.com/pytorch/pytorch/issues/123517.
This PR adds the GIL release (originally added in https://github.com/pytorch/pytorch/pull/111888) back following the suggestion here: https://github.com/pytorch/pytorch/pull/123897#discussion_r1562509705.
We added a default constructor and an assignment operator for the `RAIIPyObject` class
 (https://github.com/pytorch/pytorch/pull/123897#discussion_r1566262575) in order to declare the `custom_op_wrapper` outside of the GIL acquisition scope.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123897
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-04-17 07:18:14 +00:00
e3effa5855 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-17 06:46:02 +00:00
ed22dde877 Pointer to the nonzero limit ticket (#124244)
For the nonzero impl limits we are still asking at runtime to fill a new ticket  but we had already more then one.
So I am pointing to the current open ticket.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124244
Approved by: https://github.com/ezyang
2024-04-17 06:15:36 +00:00
dd3cea3291 Fix derived dim bugs in ep.run_decomp (#123326)
Differential Revision: [D55730289](https://our.internmc.facebook.com/intern/diff/D55730289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123326
Approved by: https://github.com/avikchaudhuri
2024-04-17 04:00:55 +00:00
236b0d12fa Don't clamp slices generated from cat kernel (#124139)
Fixes https://github.com/pytorch/pytorch/issues/123793

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124139
Approved by: https://github.com/Microve, https://github.com/peterbell10, https://github.com/Skylion007
2024-04-17 03:13:10 +00:00
050051f412 Dont precompile already seen keys, limit epilogue choices (#122642)
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
2024-04-17 03:08:59 +00:00
51cc808ac7 [dynamo][cpp-guards] Missing decref on early returns in DictSubclassGuardManager (#124230)
I am sad that I missed this earlier. Good thing is that CI caught it. Will be more careful next time.

This was the reason https://github.com/pytorch/pytorch/pull/123547 is reverted - https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058350245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124230
Approved by: https://github.com/mlazos
2024-04-17 02:49:07 +00:00
d68196e7ef Re-land precompile triton templates (#124030)
Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030
Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu
2024-04-17 02:30:46 +00:00
32ca18ea3b Handle the case when one of the output of forward pass is None (#123988)
Summary: When applying FSDP-2 to FM-FB benchmark with FullModel model, we ran into an error that one of the output tensors of a forward pass is None. I double checked that the same output tensor is also None in FSDP-1. So, we just need to handle the None properly here.

Test Plan:
See that in the internal diff.

Differential Revision: D56087956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123988
Approved by: https://github.com/awgu
2024-04-17 02:18:32 +00:00
6e4c4e93b6 [Inductor] add contiguous layout optm for bmm input (#122599)
Fixes #117743.

Add contiguous layout optimization for `bmm` input, to avoid additional copies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122599
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-04-17 02:12:20 +00:00
1fd9e320ea Remove unnecessary FileLock in Fx Graph Cache (#124212)
Writing to file happens via `write_atomic`, there's no need to take a global lock on the file system. This is likely creating unnecessary waits.

Differential Revision: [D56208628](https://our.internmc.facebook.com/intern/diff/D56208628/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124212
Approved by: https://github.com/masnesral, https://github.com/eellison
2024-04-17 01:02:41 +00:00
f56c4572a6 Fix typos in docs (#124218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124218
Approved by: https://github.com/albanD
2024-04-17 00:46:08 +00:00
bf45ac8c98 [FSDP2] Added explicit unshard(async_op) API (#120952)
This PR adds an `unshard(async_op: bool = False)` API to manually unshard the parameters via all-gather. This can be used for reordering the all-gather with other collectives (e.g. all-to-all).

This currently requires the user to set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` to avoid `recordStream` from `ProcessGroupNCCL` and get expected memory behaviors.

Differential Revision: [D56148725](https://our.internmc.facebook.com/intern/diff/D56148725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120952
Approved by: https://github.com/wanchaol
2024-04-17 00:39:34 +00:00
0abd3f60fd [CI] Reduce CI_SERIAL_LIST list (#124085)
Add serial marker for individual tests so the test file can be removed from the ci serial list
Run serial marked tests first in serial
Run all other tests afterwards in parallel

Slowly reduce list and mark individual tests as serial instead

Hope # of serial tests is small so sharding evenness doesn't get too messed up

Hopefully can do 3 procs for sm86 and cpu?

serial no longer looks like a real word to me

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124085
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-04-17 00:23:47 +00:00
946b50c788 [ez][TD] Increase logging (#124082)
increase logging during td
generate an artifact that says which tests got excluded
fix minor bug where filter test configs couldnt get commit messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124082
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-04-17 00:18:28 +00:00
e7cf6f81ea [sym_shapes][perf] Skip assert in check_is_size (#124209)
Differential Revision: [D56207943](https://our.internmc.facebook.com/intern/diff/D56207943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124209
Approved by: https://github.com/ezyang
2024-04-17 00:10:06 +00:00
cebf65126c FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: #124105
2024-04-16 23:28:42 +00:00
4322a0e782 Skip workspace permission change for ROCm CI (#123816)
PR https://github.com/pytorch/pytorch/pull/122922 added chown steps to test.sh and used the trap mechanism to ensure that, even if the test scripts fails and exits with a non-zero code, it will call the cleanup_workspace function on EXIT.

However, this doesn't work as intended when the CI job gets cancelled for eg. if a PR pushes new commits and the older commit CI job gets cancelled. The trap function doesn't get called as the test script immediately aborts.

Any subsequent jobs scheduled on the same runner then fail in the 'Checkout PyTorch' step when they try to delete the workspace.

This has been resulting in a slew of CI failures on the HUD.

Example of this situation playing out on one of the ROCm runners:
Cancelled job: https://github.com/pytorch/pytorch/actions/runs/8563212279/job/23469711035

![image](https://github.com/pytorch/pytorch/assets/37884920/7192e4fe-8cff-4256-abc8-9f874a3918ff)

Subsequent failed job: https://github.com/pytorch/pytorch/actions/runs/8564517036/job/23472675041

![image](https://github.com/pytorch/pytorch/assets/37884920/24b0af66-cfe9-431f-851a-24a1ccc18e84)

This PR skips the logic introduced by PR 122922 for ROCm CI.

Alternative to https://github.com/pytorch/pytorch/pull/123468 and https://github.com/pytorch/pytorch/pull/123588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123816
Approved by: https://github.com/pruthvistony, https://github.com/zxiiro, https://github.com/kit1980
2024-04-16 23:26:34 +00:00
3eea300680 [quant] Do not decompose choose_qparams_per_token_asymmetric (#124178)
Summary: https://github.com/pytorch/pytorch/pull/123452 added
backward support to this op by turning it into
CompositeImplicitAutograd, which meant it gets decomposed during
export/compile. However, this is not desirable behavior for the
PTQ case when we try to lower the model. This commit enables
QAT without breaking PTQ by refactoring the impl into a separate
op that does have backward support.

Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward

Reviewers: jerryzh168, digantdesai, zou3519

Subscribers: jerryzh168, digantdesai, zou3519, supriyar

Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178
Approved by: https://github.com/digantdesai
2024-04-16 22:58:48 +00:00
3e90e93a78 [inductor] disable comprehensive padding in fbcode (#124191)
Comprehension padding cause small NE change and fail an internal test. Disable it for internal use case to mitigate.

Differential Revision: [D56197430](https://our.internmc.facebook.com/intern/diff/D56197430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124191
Approved by: https://github.com/jansel
2024-04-16 22:44:08 +00:00
b3f88317ec [dtensor][5/N] have table-wise sharding use LocalShardsWrapper on participating ranks only (#122853)
**Summary**
We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`.

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392, #122843
2024-04-16 22:27:30 +00:00
d419fcd19f [dtensor][4/N] have row-wise sharding always use LocalShardsWrapper (#122843)
**Summary**
Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`.

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even`

**Result**
```
Row-wise even sharding example in DTensor
         Col 0-15
-------  ----------
Row 0-1  cuda:0
Row 2-3  cuda:1
Row 4-5  cuda:2
Row 6-7  cuda:3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392
2024-04-16 22:27:30 +00:00
1d7ac7baa0 [dtensor][3/N] add torchrec row-wise uneven sharding example (#121392)
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-uneven`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121392
Approved by: https://github.com/wanchaol
ghstack dependencies: #120265
2024-04-16 22:27:28 +00:00
9d3543df9a [dtensor][2/N] add torchrec table-wise sharding example (#120265)
**Summary**
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor.

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265
Approved by: https://github.com/wanchaol
2024-04-16 22:27:24 +00:00
9d88339b53 Revert "make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347)"
This reverts commit 63dcb5b0f2ef3578e81841fd8a2166e732c0ca99.

Reverted https://github.com/pytorch/pytorch/pull/123347 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123347#issuecomment-2059994989))
2024-04-16 22:08:24 +00:00
440e4353c7 [DCP] Remove overlapping loader in async case (#123942)
In the async case, the state dict is already on CPU, so maintaining this buffer makes no sense. Additionally, using the overlapping cpu loader introduces new cuda synchronize calls, leading to additional unnecessary overhead.

Differential Revision: [D56065250](https://our.internmc.facebook.com/intern/diff/D56065250/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123942
Approved by: https://github.com/fegin
ghstack dependencies: #123941
2024-04-16 21:19:31 +00:00
606c4f1367 [PT] [ST] fix test_sharded_tensor (#124103)
Summary:
https://github.com/pytorch/pytorch/pull/123230 formalizes the rank validation to support sub groups.

It broke a few UTs, some of which got fixed in https://github.com/pytorch/pytorch/pull/123778

This is to fix the remaining one reported by DanilBaibak

Test Plan: CI

Differential Revision: D56155076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124103
Approved by: https://github.com/fegin
2024-04-16 21:18:22 +00:00
46a25cc0db [DCP] Adds support for non-primatives in async_save by deep copying during cpu offloading (#123941)
Adds support for non-primatives in async_save by deep copying during cpu offloading.

If users are not type checking, the expectation in async is likely that the object is copied

Differential Revision: [D56065237](https://our.internmc.facebook.com/intern/diff/D56065237/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123941
Approved by: https://github.com/fegin
2024-04-16 20:49:25 +00:00
14b2273b0c [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-16 20:31:52 +00:00
383d2d1f6c Add testing and fix issues for weights_only load for LRScheduler (#123775)
Fixes https://github.com/pytorch/pytorch/issues/98921

There were two issues detected:
- `MultiStepLR`: issue is described in https://github.com/pytorch/pytorch/issues/98921, this is resolved by allowlisting `collections.Counter`
- `OneCycleLR`: `state_dict['anneal_func']` is either `<function OneCycleLR._annealing_cos at 0x7f364186f5b0>` or
`<function OneCycleLR._annealing_linear at 0x7f39aa483640>` depending on the `anneal_func` kwarg.
   This leads to `WeightsUnpickler error: Unsupported class __builtin__.getattr` from the `weights_only` Unpickler.

  Fixed the above in a BC-compatible manner by adding `OneCyclicLR._anneal_func_type` as a string attribute and removing `OneCyclicLR.anneal_func`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123775
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-16 20:29:27 +00:00
1f89bf4188 Revert "[reland] _foreach_copy with different src/dst dtypes (#123844)"
This reverts commit ff1e3ff5a503a520c1a310c8e72a383657f9a4bc.

Reverted https://github.com/pytorch/pytorch/pull/123844 on behalf of https://github.com/malfet due to Perhaps it enabled it for different dtype, but broke for the same ([comment](https://github.com/pytorch/pytorch/pull/123844#issuecomment-2059861767))
2024-04-16 20:23:14 +00:00
42e22bb444 [nccl-pg] Pass pg name and desc to NCCL communicator (#124149)
Summary:
Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer.
The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>")
Function only valid when NCCL_COMM_DESCRIPTION is defined.

Differential Revision: D55703310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149
Approved by: https://github.com/shuqiangzhang
2024-04-16 20:08:07 +00:00
72271fb07e Add NEON ISA support on aarch64 (#123584)
Fixes #104729

This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.

Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.

Script attached below.
Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py`
[TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt)
```python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)

with torch.set_grad_enabled(False):
    for _ in range(50):
        compiled_model(inputs) #Warmup
    print("Warmup over")
    with profile(activities=[ProfilerActivity.CPU]) as prof:
        with record_function("model_inference"):
            for _ in range(100):
                compiled_model(inputs)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-04-16 18:49:52 +00:00
67bd43b510 [compiled autograd][dynamo] use aliases for stack restore when partial graphs steal inputs (#124127)
same idea as https://github.com/pytorch/pytorch/pull/123359, but for when we restore stack variables after calling a partial graph:

Illustrated by the test case:

before:
```python
def function(inputs):
    graph_out_0 = __compiled_fn_2(inputs)
    getitem_1 = graph_out_0[0]
    add = inputs[1]  <---- error inputs is already cleared
    del graph_out_0
    add_1 = add + getitem_1
    add = None
    getitem_1 = None
    cpu = add_1.cpu()
    add_1 = None
    return (cpu,)
```
after:
```python
def function(inputs):
    inputs_ref_0 = inputs[1]
    graph_out_1 = __compiled_fn_2(inputs)
    getitem_1 = graph_out_1[0]
    add = inputs_ref_0
    del graph_out_1
    add_1 = add + getitem_1
    add = None
    getitem_1 = None
    cpu = add_1.cpu()
    add_1 = None
    return (cpu,)
```

Co-authored-by: Jason Ansel <jansel@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124127
Approved by: https://github.com/jansel
2024-04-16 17:01:34 +00:00
d838cc8f66 [DCP] Returns a copy of sd in copy sd (#123567)
I found that returning the copy is actually useful in situations where you might do something like:

```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```

and would like `cache` not to change structure from `ret.update(some_other_values)`.  Open to some notes here, not returning a copy might force the user to do some additional copies for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
2024-04-16 15:29:32 +00:00
0f6ce45bcb [Inductor] handle AMD special launch options (#124146)
Summary: `matrix_instr_nonkdim` and `waves_per_eu` are AMD specific launch configs that can't be treated as fn input args

Test Plan:
HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true -c fbcode.rocm_arch=mi300 //hammer/modules/sequential/encoders/tests:hstu_bench -- --torch-compile=True

the E2E works well on the magic model

Differential Revision: D56165438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124146
Approved by: https://github.com/aakhundov
2024-04-16 11:07:17 +00:00
4dc160864b [dynamo, 3.12] enable dynamo-wrapped tests in CI (#123307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123307
Approved by: https://github.com/jansel, https://github.com/malfet
ghstack dependencies: #124095, #124100, #124124
2024-04-16 08:44:43 +00:00
962096bce6 [dynamo, 3.12] skip some failing profiler dynamo-wrapped tests (#124124)
The dynamo wrapped tests and normal tests give the same results locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124124
Approved by: https://github.com/jansel, https://github.com/aaronenyeshi
ghstack dependencies: #124095, #124100
2024-04-16 08:44:43 +00:00
5e17f62d10 [dynamo, 3.12] move functorch/test_aotdispatch.py::TestAOTAutograd::test_view_detach from dynamo xfail to skip (#124100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124100
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #124095
2024-04-16 08:44:43 +00:00
9309580d69 [dynamo, 3.12] handle possibility of NULL local variables during graph breaks (#124095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124095
Approved by: https://github.com/jansel
2024-04-16 08:44:43 +00:00
2b3594f90e [dynamo] fix call_finally issue in Python 3.8 (#124122)
Fix https://github.com/pytorch/pytorch/issues/97811 again...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124122
Approved by: https://github.com/jansel
2024-04-16 08:36:20 +00:00
298eb69c91 [EZ] Make weight_int4pack_mm compilable for half input dtype (#124136)
To enable efficient int4 quantization on ARM

Followup after https://github.com/pytorch/pytorch/pull/124022
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124136
Approved by: https://github.com/mikekgfb
2024-04-16 08:10:59 +00:00
bb0c768c5b [dynamo][refactor] Move LazyGraphModule handling (#124113)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124113
Approved by: https://github.com/jansel
ghstack dependencies: #124078
2024-04-16 06:39:45 +00:00
530bf391cc Revert "[dynamo] Turn on CPP guard manager (#123547)"
This reverts commit 3e98bdd66d2b051a918e58d5f7bb80b366677bf8.

Reverted https://github.com/pytorch/pytorch/pull/123547 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058337419))
2024-04-16 06:38:15 +00:00
52be63eb2c Revert "Enable UFMT on all of test/distributed (#123539)"
This reverts commit 89ac37fe919997e844f0baa6e28965d0d52b0682.

Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))
2024-04-16 06:33:21 +00:00
2e48f7b044 [pytree] add tree_iter function (#123913)
- Add a new `tree_iter` function.
- Bump `optree` version to `0.11.0` for C++ version of `tree_iter`.

This PR is split from #120300.

- #120300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123913
Approved by: https://github.com/zou3519
2024-04-16 06:02:08 +00:00
0eab740db3 [Docs][Distributed] Add migration notes for --local-rank option style change for torchrun in PyTorch 2.0 (#109480)
Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480
Approved by: https://github.com/ezyang
2024-04-16 05:51:57 +00:00
7530c5a85d [DOC] Fix example and typo (#123959)
Fixes #123554 and fixes #123053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123959
Approved by: https://github.com/mikaylagawarecki
2024-04-16 05:38:24 +00:00
cyy
5bef127c2e [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/albanD
2024-04-16 04:39:20 +00:00
83ef3bb128 Fix AVX512 int4pack_mm_kernel crash if weighs are unaligned (#124128)
By replacing `_mm256_load_si256` with `_mm256_loadu_si256`, as there are no guarantees that tensor should be aligned

Fixes crash reported in https://github.com/pytorch/pytorch/issues/124034 though I'm unsure about perf implications if tensor are properly aligned

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124128
Approved by: https://github.com/mikekgfb
2024-04-16 04:35:25 +00:00
df5829d0ba [inductor] let rand_strided support fp8 (#124120)
I'm working on https://fb.workplace.com/groups/1075192433118967/posts/1411161629522044/ (this is a meta internal link about a inefficient inner/persistent reduction kernel generated by inductor). I found the generated benchmark code for a kernel ( https://gist.github.com/shunting314/13a0105f72a1c54d9c220370c7fd3845 ) can not be run since rand_strided failed to generate tensors for fp8. Errors are like

```
RuntimeError: "normal_kernel_cpu" not implemented for 'Float8_e4m3fn'
```
for CPU
or
```
RuntimeError: "normal_kernel_cuda" not implemented for 'Float8_e4m3fn'
```
for GPU

This PR work around that problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124120
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-04-16 04:15:56 +00:00
89ac37fe91 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-16 03:23:56 +00:00
e4efa311f1 Refactor test_tensor_set_data to be parametrized (#124105)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124105
Approved by: https://github.com/albanD
2024-04-16 03:23:41 +00:00
791e5db705 Part 3: UFMT fix the rest files in torch/optim due to the pr-sanity-checks (#124055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124055
Approved by: https://github.com/ezyang
ghstack dependencies: #124048, #124053, #124054
2024-04-16 03:22:39 +00:00
ac74a6783b Part 2: UFMT fix 2 files in torch/optim due to the pr-sanity-checks (#124054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124054
Approved by: https://github.com/ezyang
ghstack dependencies: #124048, #124053
2024-04-16 03:20:21 +00:00
560efaa471 Part 1: UFMT partial files in torch/optim due to the pr-sanity-checks (#124053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124053
Approved by: https://github.com/ezyang
ghstack dependencies: #124048
2024-04-16 03:17:18 +00:00
f30704f5f3 add preparatory work for torch/optim/lr_scheduler.py (#124048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124048
Approved by: https://github.com/albanD
2024-04-16 03:17:18 +00:00
6babf00014 [inductor] Bypass FX graph cache when we have HigherOrderOperators (#123325)
Summary: The initial motivation was to avoid caching when we have triton higher order ops, but it's probably safer to avoid the cache for all higher order ops and allow/implement if/when we find it necessary.

Test Plan: Unit test cribbed from: https://docs-preview.pytorch.org/pytorch/tutorials/2783/recipes/torch_compile_user_defined_triton_kernel_tutorial.html?highlight=triton

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123325
Approved by: https://github.com/eellison
2024-04-16 02:51:49 +00:00
ff1e3ff5a5 [reland] _foreach_copy with different src/dst dtypes (#123844)
Attempt to reland https://github.com/pytorch/pytorch/pull/121717.
The change is the array bounds check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123844
Approved by: https://github.com/janeyx99
2024-04-16 02:20:58 +00:00
a4c8002ee0 MPS FFT implementation bug (#123274)
Current implementation drops the negative frequency components even when the user doesn't ask for the one-sided transform. The tests for the negative frequency components seem to have worked by accident due to internal implementation details but the issue becomes evident in MacOs 14.4.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123274
Approved by: https://github.com/malfet
2024-04-16 02:02:37 +00:00
eeb626b46a [BE] Do not use using namespace in mps headers (#124117)
- Remove `using namespace std` from `MPSDevice.h`
- Add `std::` prefix to 1st argument of `MPSProfiler::StartTrace`
- Do the same in front of `numeric_limits` template instantiation in `ReduceOps.mm`
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124117
Approved by: https://github.com/malfet
2024-04-16 01:39:42 +00:00
1cf62e86a4 skip various unit tests for Jetson (#122531)
skip multiprocessing, cuda expandable segments, mem eff and flash attention tests on Jetson due to hanging / sigkill issues from nvidia internal testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122531
Approved by: https://github.com/eqy, https://github.com/malfet
2024-04-16 01:26:26 +00:00
aaad0554b4 [Inductor] Fix endless recursion in codecache.DLLWrapper.__getattr__ (#123931)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123931
Approved by: https://github.com/peterbell10
2024-04-16 00:52:21 +00:00
cyy
c2596fd3e0 [Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032
Approved by: https://github.com/Skylion007
2024-04-16 00:42:18 +00:00
9079c76689 Fix Asynchronous PyTorch Profiler Trace (#124080)
Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto.

Test Plan:
Trace recorded using resnet test.

Trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces

Differential Revision: D56144914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080
Approved by: https://github.com/aaronenyeshi
2024-04-16 00:24:32 +00:00
1885c3972d [C10D] Add dist.get_node_local_rank helper (#123992)
Fixes #122816

Summarizing the pros/cons of the request and motivation from #122816

- (+) it's really common for users to do 'os.getenv["LOCAL_RANK"]' so we
  should provide a helper
- (-) we can't really control if/how local rank information is made
  available, but it is handled automatically if torchrun is used.

We can assume local rank is correctly passed if it is passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123992
Approved by: https://github.com/shuqiangzhang, https://github.com/zdevito, https://github.com/XilunWu
2024-04-16 00:09:46 +00:00
2b54b00e30 Update some more APIs to have positional-only args (#124063)
Not BC-breaking since we haven't released these yet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124063
Approved by: https://github.com/albanD
ghstack dependencies: #123615, #124062
2024-04-15 23:32:47 +00:00
3c25b18d76 Excise old custom ops prototype from custom_op_db (#124062)
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124062
Approved by: https://github.com/albanD
ghstack dependencies: #123615
2024-04-15 23:32:47 +00:00
a03711d24d [custom_ops] Support TensorList inputs/outputs (#123615)
We add a `supports_tensorlist` decorator that gives an autograd.Function
the ability to handle TensorLists.

Test Plan:
- custom_op_db tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123615
Approved by: https://github.com/albanD
2024-04-15 23:32:43 +00:00
5a15cbfa44 Fix typo in TorchScript annotate docstring (#123719)
It's already in the docstring for torch.jit.Attribute to use Attribute in a __init__ method of a Module. However, this was wrong in the `annotate` docstring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123719
Approved by: https://github.com/mikaylagawarecki
2024-04-15 22:52:20 +00:00
-
70ad64e8a6 update docs for separate context and forward functions (#121955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121955
Approved by: https://github.com/soulitzer
2024-04-15 22:31:12 +00:00
9fa922c2ed [profiler] Log process group name instead of pg uid (#124035)
Summary:
As part of the work of unifying process group identifier, log <group_name, group_desc>,  instead of pg uid in profiler.
- group_name remains as the unique identifier, e.g. “0”, "1"
- group_desc will be the user specified name, e.g. "fsdp".

Reviewed By: aaronenyeshi, kwen2501

Differential Revision: D55610682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035
Approved by: https://github.com/aaronenyeshi
2024-04-15 21:49:06 +00:00
bd222473fc [EZ][BE] Fix unknown pragma warning (#124086)
By using `C10_DIAGNOSTIC_` macros instead of `#pragma clang diagnostic` that puts appropriate compiler supported pragmas. Fixes following warning during the bazel build
```
INFO: From Compiling aten/src/ATen/native/TensorFactories.cpp:
aten/src/ATen/native/TensorFactories.cpp:372: warning: ignoring #pragma clang diagnostic [-Wunknown-pragmas]
  372 | #pragma clang diagnostic push
      |
aten/src/ATen/native/TensorFactories.cpp:373: warning: ignoring #pragma clang diagnostic [-Wunknown-pragmas]
  373 | #pragma clang diagnostic ignored "-Wmissing-prototypes"
      |
aten/src/ATen/native/TensorFactories.cpp:375: warning: ignoring #pragma clang diagnostic [-Wunknown-pragmas]
  375 | #pragma clang diagnostic pop
      |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124086
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/Skylion007
2024-04-15 21:44:31 +00:00
9aba918bd8 Support Accelerator OOM Error (#121200) (#121702)
Fixes #121200
This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-04-15 21:41:46 +00:00
495a4d4a42 [FSDP2] Added mesh arg to fsdp_pre_all_gather (#123953)
This PR adds a `mesh: DeviceMesh` argument to `fsdp_pre_all_gather()` so that the extension can know over which mesh the all-gather is happening. This can be useful in recovering the post-all-gather tensor size in the `fsdp_post_all_gather()` (e.g. for `NF4Tensor`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123953
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #119302, #122908
2024-04-15 21:35:51 +00:00
d1a0821e7e [FSDP2] Added pre/post-all-gather extensions (subclass) (#122908)
**Overview**
This PR adds pre/post-all-gather extensions to FSDP2.
- The pre/post-all-gather extensions are specified at the tensor-level on the `sharded_param._local_tensor` (i.e. the tensor wrapped by the sharded `DTensor`). If the user has a tensor-subclass parameter on the module passed to FSDP that preserves the subclass through the sharding ops (e.g. `new_zeros`, `chunk`, etc.), then the `sharded_param._local_tensor` will naturally be of that subclass.
- The pre-all-gather function has signature:
  ```
  def fsdp_pre_all_gather(self) -> Tuple[Tuple[torch.Tensor, ...], Any]
  ```
    - The first return value is a `Tuple[torch.Tensor, ...]` of the all-gather inputs. It is a tuple since a subclass could contribute >1 inner tensors.
    - The second return value is any optional metadata needed to pass through to the post-all-gather.
- The post all-gather function has signature:
  ```
  def fsdp_post_all_gather(
      self,
      all_gather_outputs: Tuple[torch.Tensor, ...],
      metadata: Any,
      param_dtype: torch.dtype,
      *,
      out: Optional[torch.Tensor] = None,
  ) -> Union[Tuple[torch.Tensor, Tuple[torch.Tensor, ...]], None]:
  ```
    - The `all_gather_outputs` are exactly the all-gathered versions of the `fsdp_pre_all_gather` 1st return value (representing the all-gather inputs). We make sure to unflatten these back to ND for the user.
    - The `metadata` is the `fsdp_pre_all_gather` 2nd return value, untouched.
    - The `param_dtype` is the parameter dtype based on the passed-in `MixedPrecisionPolicy`. Namely, if no policy is passed in, then `param_dtype` is the original dtype, and otherwise, it is the `MixedPrecisionPolicy.param_dtype`.
    - If `out` is not specified, then the return value has type `Tuple[torch.Tensor, Tuple[torch.Tensor, ...]]`. The first tuple item is the unsharded parameter (e.g. re-wrapping into some subclass). The second tuple item is a tuple of unsharded inner tensors that FSDP should free during reshard. These should be derived from the all-gather outputs.
    - The `out` argument is required due to FSDP's `resize_` usage. We require an in-place variant for the backward all-gather. Here, `out` will be exactly the object returned as the first tuple item in the out-of-place variant mentioned before. The unsharded inner tensors will be allocated before calling `fsdp_post_all_gather`. When `out` is specified, the `fsdp_post_all_gather` should return `None`. If the post-all-gather does not do any out-of-place ops, then the `out` variant can just be a no-op since the unsharded inner tensors will be the same as the all-gather outputs, which FSDP directly writes to after all-gather. (E.g., this is the case for both float8 and `NF4Tensor`.)
- We check for `fsdp_pre_all_gather` and `fsdp_post_all_gather` directly via `hasattr` to accommodate monkey patching so that we do not strictly require the user to use a tensor subclass. The monkey patch must happen after the local tensors have been finalized (after applying FSDP and after any meta-device init).
- For now, we require that all gradients in one FSDP parameter group share the same dtype. This is fine for float8 and `NF4Tensor` use cases. If this requirement is too strict, then in the future we can issue 1 reduce-scatter per dtype per group.

**Design Notes**
- We assume that the `sharded_param._local_tensor` is padded on dim-0.
    - This assumption should not block immediate use cases, and when we pad the `DTensor._local_tensor` by default, this assumption will always be true.
    - This assumption allows us to call `sharded_param._local_tensor.fsdp_pre_all_gather()`; i.e. it tells us from which tensor object to invoke `fsdp_pre_all_gather()`.
    - Suppose we want to compose with CPU offloading. Then, CPU offloading's H2D copy should run first, i.e. `sharded_param._local_tensor.to("cuda").fsdp_pre_all_gather()`, where `_local_tensor.to("cuda")` should return an instance of the subclass so that it still defines `fsdp_pre_all_gather()`. Note that in this case, the subclass instance on GPU is a temporary, which means caching values on it would not be possible. One possibility would be to have `.to("cuda")` move any cached values too.
- `fsdp_post_all_gather` can either return an unsharded parameter that aliases with the all-gather output or does not alias, but there is no way to know a priori.
    - If the unsharded parameter aliases with the all-gather output, then we should _not_ free the all-gather output in `unshard`.
    - If the unsharded parameter does not alias with the all-gather output, then we prefer to free the all-gather output in `unshard` to avoid holding the unneeded temporary.
    - One approach is for eager-mode to check for this alias (by comparing data pointers). However, this might be adversarial to full-graph compilation. The compromise for simplicity can be to always free the all-gather output in `reshard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122908
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119302
2024-04-15 21:35:51 +00:00
ea52918e81 [FSDP2] Generalized all-gather outputs to >1 per parameter (#119302)
This PR is part of the FSDP extensions work. For subclasses such as for QLoRA's `NF4Tensor` (using block-wise quantization) that have multiple inner tensors per parameter, we must generalize to allow each parameter to contribute >1 all-gather inputs and hence have >1 all-gather outputs.

This PR does this generalization by converting `FSDPParam.all_gather_input: torch.Tensor` to `FSDPParam.all_gather_inputs: List[torch.Tensor]`. Unfortunately, since we need to preserve the mapping from all-gather inputs/outputs to their source parameter, we have to introduce `List[List]` instead of simply `List` in several places. Furthermore, we still require the flattened 1D `List` for `torch.split` calls, introducing some redundancy between data structures. Nonetheless, I do not see a way to avoid this if we want the generalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119302
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-04-15 21:35:46 +00:00
975f77784f Fix CUDA out of memory error message formatting (#123984)
We need a string instead of an integer here. With device 0, the
string was getting NULL terminated leading to a truncated
error message
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123984
Approved by: https://github.com/eqy, https://github.com/peterbell10
2024-04-15 21:00:17 +00:00
95a090fb56 [CI] Update bazel deps (#124076)
- Update `WORKSPACE` to actually use Python-3.10 as job name claims it is
- Get rid of unneeded `future` and `six` dependencies (Removed long time ago)
- Update `requests`, `typing-extensions` and `setuptools` to the latest releases
- Mark `tools/build/bazel/requirements.txt` as a generated file

This also updates idna to 3.7 that contains a fix for [CVE-2024-3651](https://github.com/advisories/GHSA-jjg7-2v4v-x38h), though as we are no shipping a binary with it, it does not expose CI system to any actual risks

TODOs:
 - Add periodic job that runs `pip compile` to update those to the latest version
 - Unify varios requirements .txt (i.e. bazel requirements and requirements-ci should be one and the same)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124076
Approved by: https://github.com/seemethere, https://github.com/DanilBaibak
2024-04-15 20:39:50 +00:00
601112fdb4 [dynamo][log] Print missing skipped frame info on debug (#124078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124078
Approved by: https://github.com/yanboliang
2024-04-15 20:33:17 +00:00
e5b404b809 [inductor] Fix fresh_inductor_cache() (#122661)
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.

Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
2024-04-15 20:28:54 +00:00
99059affb9 Use packed metadata from triton to reduce launch latency (#123842)
https://github.com/openai/triton/pull/3633 converts some kernel launch metadata from a namedtuple to a regular tuple, which is faster to parse.  Using it here shaves off a microsecond or so from the apparently extremely-sensitive launch path.

Fixes #123597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123842
Approved by: https://github.com/jansel, https://github.com/shunting314
ghstack dependencies: #123841
2024-04-15 19:43:06 +00:00
6c9f5064ea Avoid retrieving launch metadata if launch_enter_hook is not installed (#123841)
Fixes #123597

There's a sizable comment in the PR about why this is needed, but essentially the launch path is really really perf sensitive (running `launch` is ~30 microseconds, and according to the linked issue, regressing it to 33us is worth 6% overall on torchbench).  The `bin.launch_metadata` call doesn't look super expensive, but microseconds matter, and this is only useful when we have a launch hook installed (which seems pretty rare?).  This change is worth about 2us, and when combined with the other diff in the stack seems to completely eliminate the torchbench regression.

Differential Revision: [D56046347](https://our.internmc.facebook.com/intern/diff/D56046347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123841
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-04-15 19:43:06 +00:00
90d1720861 [export] Restore original placeholder names (part 3: constant input de/serialization) (#123590)
Summary:
note: breaking the original diff D55225818 into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.

Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ...

This PR supports constant argument placeholder (e.g. forward(self, x, y=1)) names and de/serialization, by adding a name field for ConstantArguments in the graph signature, and ConstantInputSpec in the input specs for serialization.

Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py

Differential Revision: D55506949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123590
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-04-15 19:09:41 +00:00
fb6f6270d6 [inductor] comprehensive padding (#120758)
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel
2024-04-15 19:05:51 +00:00
221c397e2e Use NEON to speedup int8pack_mm on aarch64 (#124023)
Just vectorizing innter loop as follows:
```cpp
float32x4_t c_val = vdupq_n_f32(0.0);
for (int k = 0; k < K; k += 8) {
  float16x8_t a_val = vld1q_f16(reinterpret_cast<const float16_t *>(A) + m * lda + k);
  int16x8_t b_val = vmovl_s8(vld1_s8(B + n * ldb + k));
  auto a_val_low = vcvt_f32_f16(vget_low_f16(a_val));
  auto a_val_high = vcvt_f32_f16(vget_high_f16(a_val));
  auto b_val_low = vcvtq_f32_s32(vmovl_s16(vget_low_s16(b_val)));
  auto b_val_high = vcvtq_f32_s32(vmovl_s16(vget_high_s16(b_val)));
  c_val = vaddq_f32(c_val, vmulq_f32(a_val_low, b_val_low));
  c_val = vaddq_f32(c_val, vmulq_f32(a_val_high, b_val_high));
}
float scale_val = static_cast<float>(scales[n]);
C[m * ldc + n] = reduce(c_val) * scale_val;
```

Which bumps perf from 35 to 58 tokens per second (65% perf gain).
Unrolling both inner and outer loops bumps perf to 64 tokens per sec
(i.e. another 10% gain)

Before/after performance running stories110M on M2Pro
| eager (before) | eager (after) | compile(before) | compile (after) |
| ---- | --- | -- | -- |
| 35 | 64  | 56 | 132 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124023
Approved by: https://github.com/mikekgfb
ghstack dependencies: #124022
2024-04-15 18:57:59 +00:00
8ce29f1416 Enable UFMT on test/onnx_caffe2, test/optim, test/package and test/profiler (#123901)
Part of: #123062

Ran lintrunner on:

 - `test/onnx_caffe2`
- `test/optim`
- `test/package`
- `test/profiler`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123901
Approved by: https://github.com/ezyang
2024-04-15 17:46:59 +00:00
63dcb5b0f2 make sure dynamo doesn't inline DTensor __new__ or __torch_dispatch__ (#123347)
Fixes https://github.com/pytorch/pytorch/issues/122459, https://github.com/pytorch/torchtrain/issues/61

Even with the previous PR ("support DTensor/subclass constructors directly in the graph"), I still see some errors when running the repro above that start some logs showing that dynamo is inlining `__new__`.

I noticed that putting `@torch._dynamo.disable` on DTensor's `__new__` makes the entire repro pass.

Why does having dynamo try to inline `Subclass.__new__` run into problems? Morally, dynamo probably shouldn't be inlining __new__ ("creating a subclass" is a blackbox operation that AOTAutograd can trace through anyway). But concretely, we can end up with a node in the dynamo FX graph that has a "partially initialized tensor subclass" as its example value, because the subclass has been created but its fields have not been assigned to yet.

This breaks a bunch of invariants throughout dynamo: there are many places where if we have a tensor subclass node, we want to look at its inner tensors, to see if they are FakeTensors, what their FakeTensorMode is, and if they have dynamic shapes.

One option is to decide that "uninitialized subclass" is a first-class thing that anyone looking at the FX node examples values on the dynamo graph needs to handle, but this seems like a lot of work when in reality we don't need dynamo to trace the __new__ at all. Hence the `torch._dynamo.disable`.

I still wasn't very satisfied, since it was unclear to me **why** dynamo was inlining the `__new__` call, instead of interposing on the `DTensor()` constructor directly. After a long chat with @anijain2305, he explained that with code like this:
```
@torch._dynamo.disable(recursive=False)
def f(x):
    out = SubclassConstructor(x)
```

Dynamo will never get the chance to interpose on the subclass constructor. Instead, what will happen is:
(1) Dynamo hands back control to cpython to run `f()`, since we disabled that frame
(2) `SubclassConstructor(x)` is run in eager mode
(3) `SubclassConstructor(x)` eventually calls `SubclassConstructor__new__`
(4) this is a new frame, that cpython then allows dynamo to intercept and start compiling

So it looks like we are basically forced to handle the situation where dynamo might directly start compiling `Subclass.__new__`

All of the above does not explain the story for `__torch_dispatch__` though. Empirically, I have a repro in torchtrain where looking at the dynamo logs, we see dynamo try to inline `__torch_dispatch__`.
```
[rank0]:DEBUG: Skipping frame because no content in function call _prepare_output_fn                     /data/users/hirsheybar/b/pytorch/torch/distributed/tensor/parallel/style.py 318
[rank0]:DEBUG: torchdynamo start compiling __torch_dispatch__ /data/users/hirsheybar/b/pytorch/torch/distributed/_tensor/api.py:297, stack (elided 5 frames):
```

I haven't been able to create a smaller repro of the problem (even using `_dynamo.disable(recursive=False)`), although in theory, if there is a `torch.*` op that you were to inline (where one of the inputs is a subclass), the next frame would likely be `__torch_dispatch__`. Dynamo always treats `torch.*` operations as not-inlinable though, so in theory we shouldn't ever see dynamo inline `__torch_dispatch__`, but a `_dynamo.disable()` fixes the problem.

I asked Animesh if we can have dynamo automatically apply this behavior to subclasses instead of needing it to be added explicitly. He pointed out that for `disable(recursive=False)`, we can't really do this within dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123347
Approved by: https://github.com/zou3519
ghstack dependencies: #122502, #122751, #123348
2024-04-15 17:23:20 +00:00
9c4fc5fa34 [BE][Ez]: Fix minor potential perf regression from #123960 (#124013)
The `non_blocking` arg here is useless if the values are all eagerly consumed, so revert the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124013
Approved by: https://github.com/ezyang
2024-04-15 16:51:45 +00:00
fea1b99d89 Remove warning from LazyModuleMixin constructor. (#123968)
Remove warning from `LazyModuleMixin` about lazy modules being a new feature under heavy development. The last nontrivial change to the code happened more than three years ago.

Fixes #123928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123968
Approved by: https://github.com/mikaylagawarecki
2024-04-15 15:36:55 +00:00
af9a707233 Use uv in lintrunner init when it is available. (#124033)
Before, a no-op lintrunner init takes 12s.  After, it takes 1s;
a full order of magnitude improvement.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124033
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-04-15 14:47:21 +00:00
7cd7a7aa8e [xla hash update] update the pinned xla hash (#124042)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124042
Approved by: https://github.com/pytorchbot
2024-04-15 10:50:54 +00:00
e3ac61587a Enable UFMT on test/functorch (#123541)
Partially addresses #123062

Ran lintrunner on:

- `test/functorch`

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123541
Approved by: https://github.com/zou3519, https://github.com/ezyang
2024-04-15 06:21:52 +00:00
03a05e791a Don't add non-integer Triton kernel arg 1 to equal_to_1 (#123886)
Summary: Triton compiler adds constnat argument 1 to `equal_to_1` [only when it's an int](8c5e33c77e/python/triton/runtime/jit.py (L275)). Here we restrict Inductor's `equal_to_1` in the same way.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_float_arg
...
----------------------------------------------------------------------
Ran 1 test in 6.528s

OK

$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 2 tests in 10.142s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123886
Approved by: https://github.com/oulgen
ghstack dependencies: #123703
2024-04-14 20:34:05 +00:00
19f50333e9 Improve assert message for unbacked symint not written out (#123965)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123965
Approved by: https://github.com/Skylion007
2024-04-14 20:03:43 +00:00
a096e99a5d Enable int8mm kernel for float16 (#124022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124022
Approved by: https://github.com/mikekgfb
2024-04-14 19:48:43 +00:00
9bb54c7f3c [pytree] enable functools.wraps in Python pytree with dynamo (#124012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124012
Approved by: https://github.com/Skylion007
2024-04-14 09:25:05 +00:00
f5331aade5 Simplify ATen sparse semi-structured operators based on CUTLASS (#123473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473
Approved by: https://github.com/cpuhrsch
2024-04-14 06:57:41 +00:00
635c238bad Enable UFMT on all of test/quantization/jit &pt2e (#124010)
Partially addresses #123062
Ran lintrunner on:
- test/quantization/jit
- test/quantization/pt2e

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

cc, please @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124010
Approved by: https://github.com/ezyang
2024-04-14 06:07:23 +00:00
0dfe72c63b [dynamo, 3.12] fix positions and offsets of added instructions when we clean (#123991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123991
Approved by: https://github.com/jansel
ghstack dependencies: #123978
2024-04-14 03:58:04 +00:00
88a7159493 [NT] Fix typo in declared strides variable (#123856)
Summary:
Looks like it's missing an s in the declaration so pyre is throwing an error

{F1484357040}

Test Plan: expect no pyre errors

Differential Revision: D56023743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123856
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-04-13 19:55:57 +00:00
f3fd280238 [dynamo] Relax strict_mode for autograd.Function forward inputs (#123910)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123910
Approved by: https://github.com/oulgen
2024-04-13 19:41:59 +00:00
6440d1baa6 [dynamo, 3.12] fix the block stack... again (#123978)
Some changes to how we handle blocks in 3.11+:
- We only keep track of with blocks that are not enclosed in a try block
- We do not compile partial graphs if we are in a block that is not in a tracked with block - i.e. any block enclosed in some non-with try/except/etc. block

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123978
Approved by: https://github.com/jansel
2024-04-13 17:07:02 +00:00
da7db5d345 [BE] migrate import sorter configurations to pyproject.toml (#123846)
Migrate import sorter configurations to `pyproject.toml` and delete `.isort.cfg`. Also, set the line length to 88 (which is the default of `black`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123846
Approved by: https://github.com/Skylion007
2024-04-13 12:54:14 +00:00
cyy
b60af92c17 [Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312)
This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312
Approved by: https://github.com/Skylion007
2024-04-13 11:45:00 +00:00
97261be0a8 Revert "Simplify ATen sparse semi-structured operators based on CUTLASS (#123473)"
This reverts commit b2a0b8c446234f0b35a66aff87501c4596ea5d51.

Reverted https://github.com/pytorch/pytorch/pull/123473 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123473#issuecomment-2053561077))
2024-04-13 07:47:32 +00:00
6ac8fe46dd Enable UFMT on all of test/quantization/ao_migration &bc (#123994)
Partially addresses #123062
Ran lintrunner on:
- test/quantization/ao_migration
- test/quantization/bc

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

@ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123994
Approved by: https://github.com/ezyang
2024-04-13 06:36:10 +00:00
285c93d64d [inductor] Write generated files from parent process (#123409)
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file.  Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
2024-04-13 06:31:28 +00:00
5961e23e76 primitive attribute assignment (#123898)
This PR ensures that assignment of attributes of primitive type work without needing any code changes in non-strict mode. (In a previous PR we banned attribute assignments of tensor type unless such attributes are registered as buffers.)

While strict mode errors on (all) attribute assignments, non-strict doesn't care, so one might assume that this kind of attribute assignment should already work in non-strict. However, there's a problem: we run through the program once for metadata collection and then run through it again for tracing, so the values observed during tracing (and potentially burned into the graph) do not reflect what should have been observed had the metadata collection pass not run.

So the only thing this PR needs to do is restore values of assigned attributes of primitive type once the metadata collection pass has run. We do this by moving the attribute assignment detecting context manager from the overall `aot_export` call in `_trace.py` to the metadata collection pass in `aot_autograd.py`, and extending it. The rest of the PR moves some utils around.

Differential Revision: D56047952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123898
Approved by: https://github.com/angelayi
2024-04-13 05:27:52 +00:00
891736f115 Fix links rendering when surrounding code in Dynamo deepdive (#123427)
I thought the RST was rendering correctly, but here we are.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123427
Approved by: https://github.com/peterbell10
2024-04-13 04:55:15 +00:00
7e3f80f00f accelerate binary_cross_entropy_with_logits (#122789)
Following https://github.com/pytorch/pytorch/pull/115539

Same benchmark in #115539:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|before #115539 |2049|1736|
|after #115539    |1320|1049|
|this PR               |907  |801|

This PR is faster 24-31% than the version after #115539.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122789
Approved by: https://github.com/peterbell10
2024-04-13 04:18:47 +00:00
d39e6b3156 Cleanup: Remove redundant inference_patterns PatternMatcherPass (#121602)
## Summary
Removes a redundant `PatternMatcherPass` in Inductor post-grad passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121602
Approved by: https://github.com/jgong5, https://github.com/eellison
2024-04-13 04:04:11 +00:00
eeea3b12aa Fix _LazyConvXdMixin.initialize_parameters and add related tests (#123756)
Fixes #123257

_LazyConvXdMixin.initialize_parameters did not handle positional args (other than input) and kwargs to be passed on to the corresponding non-lazy class' .forward() method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123756
Approved by: https://github.com/mikaylagawarecki
2024-04-13 03:58:37 +00:00
2216068559 Enable UFMT on test/test_ops* (#123935)
Part of https://github.com/pytorch/pytorch/issues/123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123935
Approved by: https://github.com/ezyang
2024-04-13 03:31:56 +00:00
71b8363f40 [inductor] Remove unused local variable. (#120227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120227
Approved by: https://github.com/Skylion007
2024-04-13 03:19:13 +00:00
2a2e1d8e4f [functional collective] change the Python APIs to only use the native funcol ops (#123777)
## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777
Approved by: https://github.com/wanchaol
ghstack dependencies: #123776
2024-04-13 03:08:36 +00:00
2da3e113ca [functional_collective] remove the logic that forces torch-xla to use legacy funcol (#123776)
After https://github.com/pytorch/xla/pull/6887, torch-xla now also uses
the all_reduce from native funcol. So we can remove this logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123776
Approved by: https://github.com/wanchaol
2024-04-13 03:08:36 +00:00
58afcd7b61 [dynamo][dict] Add UnspecializedNNModuleVariable to dict keys (#122812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122812
Approved by: https://github.com/jansel
ghstack dependencies: #122943, #123877, #123878
2024-04-13 02:07:35 +00:00
fefe6e2fea [dynamo][3.12] Stop backend detection on the first RETURN_VALUE (#123878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123878
Approved by: https://github.com/williamwen42
ghstack dependencies: #122943, #123877
2024-04-13 02:07:35 +00:00
f1654fd4b0 [PT2D][FSDP] skip FSDP hooks base on dynamo config (#123021)
unit test: `pytest test/distributed/_composable/fsdp/test_fully_shard_compile.py`

For FSDP, we turn on/off compiling hooks base on `torch._dynamo.config.skip_fsdp_hooks`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123021
Approved by: https://github.com/yf225, https://github.com/anijain2305
2024-04-13 01:47:25 +00:00
cyy
77a45883ce [Reland] [Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123821)
Reland of #122892 with problematic changes reverted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123821
Approved by: https://github.com/Skylion007
2024-04-13 00:57:03 +00:00
879f5b9a39 Pass triton kernel info to record function (#123871)
Summary: This DIFF is to pass triton kernel information, such as kernel python file, kernel type, grid, and stream, to record_function. With these information, Execution trace can capture triton kernel and replay it in PARAM.

Test Plan:
unit test
    buck2 test caffe2/test:profiler -- test_record_function_fast

Differential Revision: D56021651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123871
Approved by: https://github.com/sraikund16
2024-04-13 00:55:44 +00:00
7234f180f3 Add mtia to codeowner (#123975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123975
Approved by: https://github.com/egienvalue
2024-04-13 00:46:08 +00:00
1d6c5972c1 [BE]: Optimize min/max/sum comprehensions C419 (#123960)
Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960
Approved by: https://github.com/malfet
2024-04-12 23:54:15 +00:00
961eb39348 AOT logging: log fw_metadata with each graph (#118646)
Log fw_metadata for each AOT graph. This is helpful for seeing information about subclass graph inputs/outputs/tangents, and lots of other stuff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118646
Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang
ghstack dependencies: #118645
2024-04-12 23:53:53 +00:00
585cd117e6 [nccl-pg] print broadcast ncclunique id duration (#123963)
Summary: Print NCCL PG broadcast nccl unique id duration for measurement.

Differential Revision: D56048059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123963
Approved by: https://github.com/wconstab
2024-04-12 23:33:11 +00:00
3e98bdd66d [dynamo] Turn on CPP guard manager (#123547)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123547
Approved by: https://github.com/jansel
2024-04-12 23:30:56 +00:00
d564fe7dca [sparse] add proper path for cloning sparse tensors (#123127)
The code does the right thing (rather than crashing). This is a small step towards https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123127
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2024-04-12 23:19:51 +00:00
3dde6a461f fix cpp path in torch/_C/_autograd.pyi (#123924)
The file `tools/autograd/init.cpp` does not exist, I think the right path is `torch/csrc/autograd/init.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123924
Approved by: https://github.com/Skylion007
2024-04-12 22:32:00 +00:00
380180c918 Fix typo (#123767)
Fixes a tiny typo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123767
Approved by: https://github.com/Skylion007
2024-04-12 22:26:08 +00:00
7b11fb4695 [Dynamo] fix opcode YIELD_FROM and SEND (#123912)
This PR is split from #120300.

- #120300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123912
Approved by: https://github.com/anijain2305
2024-04-12 21:57:47 +00:00
4b889d1247 stop TestMakeFx leaking to other tests (#123958)
Fixes #123916

Due to MultiThreadedTestCase we're leaking is_fx_tracing_flag to other tests which causes any dynamo based tests to fail. The test execution order is arbitrary which caused this to not be caught in development.

Test plan:

```sh
pytest --random-order test/distributed/test_functional_api.py -k 'TestMakeFx or test_all_to_all_single_compile_True'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123958
Approved by: https://github.com/yifuwang
2024-04-12 21:43:12 +00:00
c65aa5af6e [Pytorch] doc sync-stream-and-free-HBM counter in memory_stats (#123799)
Differential Revision: D56000503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123799
Approved by: https://github.com/malfet
2024-04-12 21:19:45 +00:00
5d1f9bd2bc Move the trace_rules.py docs up (#123873)
I always remember that the docs exist but cannot actually find it in the
file because it is on line 3000. Moving it to the top of the file for
visibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123873
Approved by: https://github.com/yanboliang
2024-04-12 20:18:38 +00:00
79deff689f Update compile doc to suggest Module.compile (#123951)
For users for whom fqn change is problematic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123951
Approved by: https://github.com/msaroufim
2024-04-12 20:13:21 +00:00
762e19606e [quant] Enable backward for choose_qparams_per_token_asymmetric (#123452)
Summary: When running the backward for this op, we get the error:
```
RuntimeError: derivative for aten::aminmax is not implemented
```
This commit replaces this call with separate amin and amax
calls instead, which do have implemented derivatives.

Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward

Reviewers: jerryzh168, digantdesai

Subscribers: jerryzh168, digantdesai, supriyar

Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452
Approved by: https://github.com/digantdesai, https://github.com/jerryzh168, https://github.com/zou3519
2024-04-12 20:05:56 +00:00
3346ec8263 [BE] Document what is tested in TestOptim (#123853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123853
Approved by: https://github.com/soulitzer
2024-04-12 19:59:29 +00:00
eqy
2f0fc04fa3 [CUDA][64-bit indexing] Bump large tensor threshold of test_cross_entropy_large_tensor to 70GiB (#123772)
`torch.cuda.max_memory_reserved()` here shows 68729962496 (about 65546 MiB).

CC @malfet @crcrpar

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123772
Approved by: https://github.com/mikaylagawarecki
2024-04-12 19:18:20 +00:00
8069469081 [dynamo] Support Tuple[int] args to autograd.Function (#123887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123887
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804, #123896
2024-04-12 19:03:13 +00:00
70b8c58f84 [dynamo] Emit warning to turn on capture_scalar_outputs (#123896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123896
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786, #123790, #123803, #123804
2024-04-12 19:03:13 +00:00
e3935783f7 [dynamo] Fix @property on user-defined nn.Module (#123804)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123804
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786, #123790, #123803
2024-04-12 19:03:13 +00:00
6bac183dc2 [dynamo] Support numpy.iinfo/finfo (#123803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123803
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786, #123790
2024-04-12 19:03:13 +00:00
11e6f84ad8 [dynamo] Graph break on uninitialized nn.Module (#123790)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123790
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705, #123786
2024-04-12 19:03:13 +00:00
6022600cc6 [inductor] Handle meta tensor ops in graph (#123786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123786
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705
2024-04-12 19:03:13 +00:00
6b0ba6bbd3 [dynamo] Improve constant-prop for regex/torch.__version__ (#123705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123705
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700
2024-04-12 19:03:13 +00:00
a625705290 Enable UFMT on all of test/nn (#123809)
Part of: #123062

Ran lintrunner on:

- `test/nn`

with command:

```bash
lintrunner -a --take UFMT --all-files
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123809
Approved by: https://github.com/mikaylagawarecki
2024-04-12 18:32:25 +00:00
04acdad829 [PT] [FSDP] [test] add barrier device ids (#123866)
Summary:
without this the `ProcessGroupNCCL` lib would try to infer the device id and emit a warning.
This doesn't change the behavior just makes it explicit.

> ProcessGroupNCCL.cpp:3720] [PG 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Test Plan: CI

Differential Revision: D55998175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123866
Approved by: https://github.com/awgu
2024-04-12 18:29:32 +00:00
366b24e242 [Inductor] Add a device agnostic DeviceGuard class to inductor (#123338)
Summary: Currently although only in one place in inductor, the `device` context manager from the device interface is used . This PR creates an inductor specific `DeviceGuard` class for use in these cases, which keeps a reference to the `DeviceInterface` class which is defined and added out of tree. This then offloads the device specific work to the device interface, instead of having to define this logic on the device class which isn't strictly necessary for inductor.

Ideally I would have used the existing `DeviceGuard` class, but these are defined per device and don't work well with inductor's device agnostic/ out of tree compatible design. With the existing classes in mind, I am happy to take suggestions on the renaming of this class.

Whilst I was there, I also took the opportunity to rename `gpu_device` to `device_interface` to clarify this is not necessarily a GPU.

Test Plan: None currently, happy to add some.

Co-authored-by: Matthew Haddock <matthewha@graphcore.ai>
Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123338
Approved by: https://github.com/aakhundov
2024-04-12 18:21:27 +00:00
6367eab1a6 Normalize remote/local cache names (#123914)
Differential Revision: D56027380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123914
Approved by: https://github.com/aakhundov
2024-04-12 18:18:09 +00:00
23dbe2b517 Add test for skipping hf logging during export (#123410)
https://github.com/pytorch/pytorch/pull/123402 already supports hf
logging because HF logger is based on logging module

This PR adds a test to guard this against regression, only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123410
Approved by: https://github.com/BowenBao, https://github.com/malfet
2024-04-12 17:42:46 +00:00
3c09c6b91a Fix memory planning compile error (#123867)
Summary:
We should be using CppPrinter in the cpp wrapper codegen, not the ExprPrinter (which prints expressions for Python)

Not really a memory-planning-specific bug, but exposed by mem planning because it tends to emit more complicated expressions

Differential Revision: D56025683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123867
Approved by: https://github.com/hl475, https://github.com/chenyang78
2024-04-12 17:34:58 +00:00
ab647bd325 Add missing interfaces of torch.optim.swa_utils (#117036)
Add type hints for the function/class interfaces that appear in torch/optim/swa_utils.py but are missing in torch/optim/swa_utils.pyi.

- get_ema_multi_avg_fn
- get_swa_multi_avg_fn
- get_ema_avg_fn
- get_swa_avg_fn
- AveragedModel.__init__(multi_avg_fn)
- SWALR.get_lr

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117036
Approved by: https://github.com/janeyx99
2024-04-12 17:17:36 +00:00
5c0a380bdf [pt2e][qat] Support conv-transpose-bn[-relu] QAT fusion (#123652)
Summary: This commit adds support for QAT fusion for the
[conv-transpose-bn] and [conv-transpose-bn-relu] patterns.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_transpose_bn
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_transpose_bn_relu
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_transpose_bn
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_transpose_bn_relu

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/122224

Differential Revision: [D55930704](https://our.internmc.facebook.com/intern/diff/D55930704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123652
Approved by: https://github.com/jerryzh168
2024-04-12 17:16:02 +00:00
e4c887fbf6 [AOTAutograd] Replay views on output using FunctionalTensor metas. (#121007)
Fix: #120336

This PR fixes an issue on AOTAutograd, specifically on backends that don't support views
by themselves (e.g. XLA). Previously, AOTAutograd tried to reconstruct output views by
calling `as_strided` on the concrete bases using sizes and strides of the outputs that
aliased them. Since backends such as XLA doesn't support tensor aliasing, the sizes and
strides would be that of a contiguous tensor (not a view tensor). Because of that, calling
`as_strided` would error, since the output tensor would be bigger than its base. Instead,
this PR applies the sequence of `ViewMeta` gathered for each output during the
functionalization phase.

**Note:** we intentionally don't support base tensors that went through metadata mutation,
i.e. in-place view operations.

In summary, this PR:

- Introduces one `FunctionalTensorWrapper` member function alongside its Python APIs
    - `apply_view_metas(base)`: applies the `ViewMeta` sequence of the given instance onto
      another base
- Introduces a `OutputAliasInfo.functional_tensor` field
    - Saves the `FunctionalTensorWrapper` instance collected by the functionalization phase
    - Wraps it with a new `FunctionalTensorMetadataEq` class for comparing only the
      metadata of the tensors
- Plumbs `OutputAliasInfo.functional_tensor` to `gen_alias_from_base` function
    - Applies the `ViewMeta` sequence of the saved `FunctionalTensor` onto `aliased_base_tensor`
- Propagates `OutputAliasInfo.functional_tensor` when updating `fw_metadata`

(this PR description was updated in order to reflect the most recent changes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121007
Approved by: https://github.com/bdhirsh
2024-04-12 16:54:13 +00:00
435db051d0 get torch.distributed.breakpoint() to work under Python/Meta contexts (#118645)
I noticed that when I put a `torch.distributed.breakpoint()` in [here](https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L605), it would fail. This fixes it.

In theory, it would probably be better to have a way to get the `barrier()` call to skip the dispatcher completely. I wasn't sure how to do that though, and this seems to cover 90% of issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118645
Approved by: https://github.com/yifuwang
2024-04-12 16:36:52 +00:00
07e0faf3ef [export] set AOTAutograd ctx to enable_grad with pre-dispatch export (#123671)
Summary:
Currently, torch.export (through AOTAutograd) compiles with a torch.no_grad() wrapper, which affects the presence of `set_grad_enabled` nodes in pre-dispatch export graphs. This changes the wrapper to nullcontext (i.e. enable grad) if `pre_dispatch=True`.

An example that previously failed without `with torch.no_grad()` is below:

```
class Model(torch.nn.Module):
    def forward(self, x, y):
        with torch.enable_grad():
            x = x + y
        return x

model = Model()
exported_program = torch.export._trace._export(
    model,
    (torch.tensor(2), torch.tensor(3)),
    dynamic_shapes=None,
    pre_dispatch=True,
    strict=False
)
```

The pass would inline the add call, but then try to construct a HOO subgraph with no inputs/outputs:
```
def forward(self):
    _set_grad_enabled_1 = torch._C._set_grad_enabled(False)
```

Test Plan: Test case checking that nullcontext & no_grad wrappers lead to different export behaviors (regarding set grad subgraphs).

Reviewed By: tugsbayasgalan

Differential Revision: D55777804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123671
Approved by: https://github.com/tugsbayasgalan
2024-04-12 16:16:23 +00:00
757daece95 [sparse] add meta support for add operation (and copy) (#123594)
This is a small step towards #117188
@pearu to review (this was split of #117907)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123594
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-04-12 15:50:30 +00:00
951582949b [export] Enforce final classes in serialization. (#123861)
Summary: as title, these are private API and not meant to be used across repos.

Test Plan: CI

Differential Revision: D56027954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123861
Approved by: https://github.com/tugsbayasgalan
2024-04-12 15:44:56 +00:00
2cb3301f80 [ROCm] Add cast to kFloat in amax calculation (#123872)
necessary cast to kFloat missed in previous amax PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123872
Approved by: https://github.com/drisspg
2024-04-12 15:38:41 +00:00
b024c0c2ef Convert MKL symbols from global to local (#122284)
PyTorch is statically linked to MKL but MKL symbols are visible to global, which may cause symbol conflicts.
Such error has been observed when a different version of MKL is dynamically linked to the other components: `libtorch_cpu.so` was invoked incorrectly when MKL descriptor was freed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122284
Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/ezyang
2024-04-12 15:36:47 +00:00
616446cc0a Update Kineto Hash to fix OSS json output (#123885)
Summary: Need to have temporary flag in Kineto so the correct JSON output is used. Will delete all temporary flags afterwards

Test Plan: Tested traces using updated hash. Values matched expected order of magnitude/general range that is expected.

Differential Revision: D56045866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123885
Approved by: https://github.com/aaronenyeshi
2024-04-12 14:49:57 +00:00
41613a0803 [Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556)
Summary:
1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user.
2.Add PrivateUse1 key to ActivityType.
3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities.
4. Fix some bugs when processing profiler results.
Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120556
Approved by: https://github.com/aaronenyeshi
2024-04-12 14:28:19 +00:00
6f4c7eeb08 Migrate linux-focal-py3_11-clang10-build to ARC (#123441)
Migrate linux-focal-py3_11-clang10-build to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123441
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-04-12 14:23:12 +00:00
e023863474 Migrate linux-focal-py3_8-clang10-build to ARC (#123440)
Migrate linux-focal-py3_8-clang10-build to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123440
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-04-12 14:03:42 +00:00
1cdde98df4 Intel GPU oneDNN upstreaming for library compilation (#117098)
# Motivation

As proposed  in https://github.com/pytorch/pytorch/issues/114848 and https://github.com/pytorch/pytorch/issues/114723, oneDNN library is an important component for Intel GPU software ecosystem.

This PR is intended to enable oneDNN compilation for Intel GPU.  It is the first step for we enabling any operators like `at::baddmm`.
With this PR, a static library `libdnnl.a` for GPU would be compiled in directory `/build/xpumkldnn_proj-prefix`.  It can be further linked to `libtorch_xpu.so` in future. The compilation would  depend on `USE_XPU` bool variables and runtime check like SYCL, which is defined in https://github.com/pytorch/pytorch/pull/116019 for runtime support. Once the #116019 merged, the compilation should be able to be triggered.

The modification is independent to oneDNN CPU compilation, hence no modification would be introduced for CPU Cmakefiles(e.g. FindMKLDNN.cmake)

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117098
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/atalman
2024-04-12 13:46:22 +00:00
3120dbbf81 Revert "[sparse] Add fast semi-structured spasification kernels (#122350)"
This reverts commit aaec97a40364bb6ccfd968f28d309cfff8748d20.

Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2051757450))
2024-04-12 13:26:10 +00:00
f9f7ef33c4 AOTAutograd: add config to error when overlapping input checks would cause slow compile / runtimes (#123455)
We should eventually make the non-overlapping checks faster when dynamic shapes are enabled, but this is pretty difficult to do. So for now this PR adds a config that lets us fail fast when this situation happens, instead of causing compile times to secretly come to a crawl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123455
Approved by: https://github.com/ezyang
2024-04-12 13:25:33 +00:00
f0eb162730 Revert "Switch quantized_decomposed over to new custom ops API (#123454)"
This reverts commit 638729c0cdf3ce4274f4d68f8e46e5a1cd36cbe8.

Reverted https://github.com/pytorch/pytorch/pull/123454 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123454#issuecomment-2051738976))
2024-04-12 13:14:59 +00:00
7fc3aa5f81 [compiled autograd][aot] Trim runtime refs for list inputs from dynamo (#122535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122535
Approved by: https://github.com/bdhirsh
ghstack dependencies: #123630, #123674, #122353, #123359
2024-04-12 10:29:09 +00:00
540b451e91 [compiled autograd][dynamo] Codegen aliases to keep grad mutated tensors alive (#123359)
The current codegen is problematic if __compiled_fn_0 clears the inputs list, since we need it for assignment afterwards
```python
def forward(inputs):
    __compiled_fn_0 = ...  # The actual function needs to be provided
    graph_out_0 = __compiled_fn_0(inputs)  # clears inputs
    temp_list = []
    temp_list.append(graph_out_0[0])
    inputs[4].grad = graph_out_0[1]  # inputs is empty, index error
    inputs[7].grad = graph_out_0[2]
    inputs[8].grad = graph_out_0[3]
    inputs[9].grad = graph_out_0[3]
    del graph_out_0
    return temp_list
```

With this fix, we use aliases to keep the tensors alive
```python
def forward(inputs):
    __compiled_fn_0 = ...  # The actual function needs to be provided
    inputs_ref_1 = inputs[9]
    inputs_ref_2 = inputs[4]
    inputs_ref_3 = inputs[8]
    inputs_ref_4 = inputs[7]
    graph_out_0 = __compiled_fn_0(inputs)
    temp_list = []
    temp_list.append(graph_out_0[0])
    inputs_ref_2.grad = graph_out_0[1]
    inputs_ref_4.grad = graph_out_0[2]
    inputs_ref_3.grad = graph_out_0[3]
    inputs_ref_1.grad = graph_out_0[3]
    del graph_out_0
    return temp_list
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123359
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674, #122353
2024-04-12 10:29:09 +00:00
d274d57037 [compiled autograd][dynamo] Make compiled graph take in boxed inputs (#122353)
### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.

And [Dynamo generates](fdc281f258/torch/_dynamo/codegen.py (L371)) the runtime function's signature using the graph's graphargs.

This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670).

```python
# original code
def forward(inputs):
  a, b, c, d, e = inputs
  inputs.clear()
  out = a
  out += b
  del b  # frees memory
  out += c
  del c  # frees memory
  out += d
  del d  # frees memory
  out += e
  del e  # frees memory
  return out

# compiled code:
def forward(a, b, c, d, e):
  # b, c, d, e can't be freed before end of function
```

This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.

### Solution

We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).

This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.

With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
  # a, b, c, d, e can be freed within the function now
```

Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](597f479643/torch/_inductor/compile_fx.py (L1454-L1478)), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122353
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674
2024-04-12 10:29:09 +00:00
9dfeec9cdc Add a mode to avoid clone() in DDPSink (#122927)
DDPSink clones the outputs of DDP to avoid in-place modification of loss (see https://github.com/pytorch/pytorch/issues/61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory.

As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122927
Approved by: https://github.com/fegin, https://github.com/rohan-varma
2024-04-12 08:56:10 +00:00
4e9094533e [c10d/nccl-pg] allow user to pass process group description (#123472)
Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.

Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.

Solution: Add a group_desc field to c10d

Differential Revision: D55781850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
2024-04-12 08:44:21 +00:00
73f0ecc1ac [BE] UFMT directory torch/_functorch (#123723)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123723
Approved by: https://github.com/Skylion007
2024-04-12 08:04:51 +00:00
7ff53e169f add option to turn on return_tuple in _SplitterBase (#123868)
Summary: as title. split the oss change from D55871896 into this separate diff

Test Plan: deploy

Differential Revision: D56032268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123868
Approved by: https://github.com/ZhengkaiZ, https://github.com/DanilBaibak
2024-04-12 07:50:28 +00:00
d994d993c0 Revert "[inductor] Fix fresh_inductor_cache() (#122661)"
This reverts commit cda383e7bcdac029a6d5508d63c0355a40bb0d32.

Reverted https://github.com/pytorch/pytorch/pull/122661 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122661#issuecomment-2051171028))
2024-04-12 07:26:50 +00:00
e881d567f4 Revert "[inductor] Write generated files from parent process (#123409)"
This reverts commit 79c565b24e6c305c09c8c908e27f4023f41dd567.

Reverted https://github.com/pytorch/pytorch/pull/123409 on behalf of https://github.com/DanilBaibak due to Needs to be reverted because it blocks reverting of the broken PR. ([comment](https://github.com/pytorch/pytorch/pull/123409#issuecomment-2051166617))
2024-04-12 07:23:57 +00:00
5669334175 Revert "Add Matmul recipe into x86_inductor_quantizer (#122776)"
This reverts commit e8e9261b906f69b397e4027362be801f98a68d62.

Reverted https://github.com/pytorch/pytorch/pull/122776 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122776#issuecomment-2051073373))
2024-04-12 06:29:27 +00:00
db895ace1d Only run backward part of COW test if results are strided (#123870)
Fixes #123792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123870
Approved by: https://github.com/ezyang
2024-04-12 04:43:02 +00:00
87f7486df9 Support SparseCsrPrivateUse1 (#123826)
As in the title, the changes support SparseCsr tensors working on PrivateUse1 devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123826
Approved by: https://github.com/ezyang
2024-04-12 04:13:35 +00:00
706f7d1f22 Enable UFMT on test/jit_hooks, test/lazy and some files (#123807)
Part of: #123062

Ran lintrunner on:

- `test/jit_hooks`
- `test/lazy`
- `test/linear.py`
- `test/load_torchscript_model.py`
- `test/mkl_verbose.py`
- `test/mkldnn_verbose.py`

with command:

```bash
lintrunner -a --take UFMT --all-files
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123807
Approved by: https://github.com/ezyang
2024-04-12 03:39:38 +00:00
4e3022dbe9 [dynamo][logs] Print bytecode before tracing (#123877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123877
Approved by: https://github.com/jansel
ghstack dependencies: #122943
2024-04-12 02:32:58 +00:00
ede9e8237a [dynamo] Bug fix for GET_YIELD_FROM_ITER (#122943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122943
Approved by: https://github.com/jansel
2024-04-12 02:32:58 +00:00
aaec97a403 [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-12 02:22:56 +00:00
868e5ced5d [Dispatcher] Collect autograd sequence numbers on PythonTLSSnapshot dispatch keys (#123304)
Fixes #121758

**TL;DR**: When profiling is turned on, the dispatcher will sometimes attach the autograd sequence number to the recorded profiler event. This PR expands the set of profiler events onto which we attach sequence numbers. Before, we'd only attach a sequence number if the current dispatch key was an Autograd dispatch key. Now, we attach a sequence number if the current dispatch key **set** contains Autograd.

**Context**:
The use case for this is torch.profiler for python subclasses.

Autograd attaches a "sequence number" to all ops that it encounters during the forward pass. Then, the corresponding sequence number can be associated with a backward kernel when backward is executed. This is used by the profiler to associate the forward ops to the backward ops; a forward op and a backward op with the same sequence number are "linked" in some post-processing step.

Prior to this PR, this profiler feature didn't work for python subclasses. The reason is that we don't collect profiler information for all the dispatches for a given kernel; we only dispatch the initial `call`, and not the subsequent `redispatch` invocations. Normally, an Autograd key (if we're running with autograd) is the highest dispatch key, so the initial `call` that we profile is an Autograd key, and we collect the sequence number. But when we're dealing with a python subclass, the first dispatch key is PythonTLSSnapshot, which eventually redispatches into Autograd. We don't record the Autograd sequence number in that case because we don't record redispatches.

To fix this, this PR collects a sequence number whenever the dispatch key **set** contains an Autograd key. That means we might sometimes collect multiple events with the same sequence number, or possibly attach sequence numbers when we won't actually use them? (e.g. maybe if the initial dispatch key handler removes Autograd for some reason). Although this might be a bit confusing for users looking directly at the sequence_nr directly, I think the main use case is for the profiler to create fwd-bwd links; and those should still be generated correctly in these cases.

Differential Revision: [D55724190](https://our.internmc.facebook.com/intern/diff/D55724190)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123304
Approved by: https://github.com/soulitzer
2024-04-12 02:01:15 +00:00
358ace1a1b functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599)
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.

This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.

This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.

To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.

Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

Test plan:

```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
2024-04-12 01:48:49 +00:00
5b648afba4 Enable UFMT on test/test_multiprocessing (#123840)
part of https://github.com/pytorch/pytorch/issues/123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123840
Approved by: https://github.com/ezyang
2024-04-12 01:21:54 +00:00
7efaf54dc4 Fakeifying views shouldnt create symbols when dynamic=False (#123348)
Fixes https://github.com/pytorch/pytorch/issues/123298

I was also seeing some crashes in torchtrain due to dynamic shapes, even when I set `compile(dynamic=False)` (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @wanchaol). This doesn't fix the underlying dynamic shape issues with compile + DTensor, but it does prevent dynamic shapes from leaking in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123348
Approved by: https://github.com/ezyang
ghstack dependencies: #122502, #122751
2024-04-12 01:12:23 +00:00
96fe3c5d46 fix correctness for dynamo inlining RangeVariable __contains__ (#122751)
Fixes https://github.com/pytorch/pytorch/issues/122379

It looks like `iter_contains()` in dynamo expects to take in something like `iter_contains(List[VariableTracker], VariableTracker])`. Previously, when we called this function where the list in question was a `RangeVariable`, we would pass in `RangeVariable.items` as our list.

This is wrong, though since `RangeVariable.items` just contains the underlying [start, stop, step]. It looks like `unpack_var_sequence` does the right thing of "materializing" the range into a list of `VariableTrackers`, so I used that instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122751
Approved by: https://github.com/anijain2305, https://github.com/jansel
ghstack dependencies: #122502
2024-04-12 01:12:23 +00:00
2fe672b146 compile: ban mutations on non-compositional uses of as_strided (#122502)
Fixes https://github.com/pytorch/pytorch/issues/104505

I was originally going to ban all usages of as_strided + mutation in functionalization. But I'm pretty sure that as_strided + mutation is fine when we are calling as_strided on a base tensor.

So in this PR I added a slightly more conservative check: if we see an as_strided + mutation, where the input to an as_strided was **another** view op, then I error loudly in functionalization and link to the github issue above (in case anyone runs into this in the real world)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122502
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-12 01:12:23 +00:00
22ba180e55 [c10d] add more fields for periodic logging (#123860)
Summary:
Added the names of the last enquened, started and completed colletives,
in addition to their seq ID
Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123860
Approved by: https://github.com/XilunWu
2024-04-12 00:11:07 +00:00
78824fd212 [inductor] Fix recompiles bug for torch.full (#123811)
Fixes #123810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123811
Approved by: https://github.com/peterbell10
2024-04-12 00:07:47 +00:00
5b8c81eb82 [PT] [FSDP] fix HSDP sharding placement (#123778)
Summary:
https://github.com/pytorch/pytorch/pull/123230 formalized the contract for `ShardedTensor` sub group rank placement validation by making sure the placement rank is global rank, to align with general `torch.distributed` convention.

The current HSDP allows for both `ShardedTensor` and `DTensor`. While `DTensor` will eventually will replace `ShardedTensor`, its usage still exists and there's at least one test verifying the state dict with ST output.

This got broken as the test is run periodically only so it didn't block the other PR.
Fixes [#123749](https://github.com/pytorch/pytorch/issues/123749)

Test Plan: CI

Differential Revision: D55991256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123778
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-04-12 00:05:49 +00:00
7f6884f620 Added some extra repr to triton template buffers and added autotuned block configs to templated attention (#123813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123813
Approved by: https://github.com/drisspg, https://github.com/shunting314
ghstack dependencies: #123768
2024-04-11 23:57:47 +00:00
3d9dc976ae Handle unqualified imports in custom Triton kernels (#123703)
Summary: If in a custom (user-written) Triton kernel an externally imported symbol is used directly, we need to codegen the corresponding import outside the kernel body in the Python wrapper. E.g., if the user code has this:

```
    from triton.language.extra.cuda.libdevice import fast_dividef

    @triton.jit
    def my_kernel(...):
        ...
        x = fast_dividef(...)
        ...
```

The `from triton.language.extra.cuda.libdevice import fast_dividef` line needs to be carried over together with the `my_kernel` function. The PR adds this.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py
...
----------------------------------------------------------------------
Ran 464 tests in 113.512s

OK
```

Differential Revision: [D55953241](https://our.internmc.facebook.com/intern/diff/D55953241)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123703
Approved by: https://github.com/jansel, https://github.com/oulgen
2024-04-11 23:49:25 +00:00
604c9c5601 Enable UFMT on all of test/jit (#123623)
Partially addresses #123062

Ran lintrunner on:

- `test/jit`

with command:

```bash
lintrunner -a --take UFMT --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123623
Approved by: https://github.com/ezyang
2024-04-11 23:45:05 +00:00
d0ccf599cc [export] Restore original placeholder names (part 2: higher-order-op subgraph naming) (#123587)
Summary:
note: breaking the original diff [D55225818](https://www.internalfb.com/diff/D55225818) into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.

Stacked PR to restore original names to placeholder nodes, replacing the default names arg0_1, arg1_1, ...

This PR propagates node names to higher-order-op subgraph placeholders, retaining the top-level names and handling naming collisions by suffixing other non-placeholder nodes in the subgraph with an index. This is the same handling as in fx.Graph/fx.Node, but implemented separately as a pass.

Since the input schemas of HOO subgraphs are very different, they are enumerated in _name_hoo_subgraph_placeholders(). Currently cond, map_impl, and wrap_with_set_grad_enabled are handled, but other ops can be easily added.

Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py

Differential Revision: D55456749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123587
Approved by: https://github.com/angelayi
2024-04-11 22:40:46 +00:00
b9675e820e [dynamo][cpp-guards] Improve the logs (#123780)
For this program

~~~
@torch.compile(backend="eager")
def fn(x, y, d):
    return x * y * d["foo"] * d["bar"]
~~~

Python logs are

~~~
V0410 15:48:57.778000 140318524949632 torch/_dynamo/guards.py:1785] [0/0] [__guards] GUARDS:
V0410 15:48:57.778000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] ___check_type_id(L['d'], 8833952)                             # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.778000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] len(L['d']) == 2                                              # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] list(L['d'].keys()) == ['foo', 'bar']                         # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] hasattr(L['y'], '_dynamo_dynamic_indices') == False           # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] ___check_type_id(L['d']['bar'], 8842592)                      # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] L['d']['bar'] == 2                                            # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] ___check_type_id(L['d']['foo'], 8842592)                      # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] L['d']['foo'] == 4                                            # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:450 in init_ambient_guards
V0410 15:48:57.779000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1])  # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:48:57.780000 140318524949632 torch/_dynamo/guards.py:1803] [0/0] [__guards] check_tensor(L['y'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1])  # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
~~~

CPP logs are

~~~
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1792] [0/0] [__guards] GUARDS:
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards]
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] TREE_GUARD_MANAGER:
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] +- RootGuardManager
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | +- DEFAULT_DEVICE: utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:450 in init_ambient_guards
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | +- GLOBAL_STATE: ___check_global_state()
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | +- DictSubclassGuardManager: source=L['d'], accessed_by=DictGetItemGuardAccessor(d)
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- KeyValueManager pair at index=0
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | +- KeyManager: GuardManager: source=list(L['d'].keys())[0]
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | | +- EQUALS_MATCH: list(L['d'].keys())[0] == 'foo'                               # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | +- ValueManager: GuardManager: source=L['d']['foo']
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | | +- EQUALS_MATCH: L['d']['foo'] == 4                                            # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- KeyValueManager pair at index=1
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | +- KeyManager: GuardManager: source=list(L['d'].keys())[1]
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | | +- EQUALS_MATCH: list(L['d'].keys())[1] == 'bar'                               # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | +- ValueManager: GuardManager: source=L['d']['bar']
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | | | +- EQUALS_MATCH: L['d']['bar'] == 2                                            # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | +- GuardManager: source=L['x'], accessed_by=DictGetItemGuardAccessor(x)
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- TENSOR_MATCH: check_tensor(L['x'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1])  # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- NO_HASATTR: hasattr(L['x'], '_dynamo_dynamic_indices') == False           # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- NO_TENSOR_ALIASING: check_no_aliasing(L['x'], L['y'])
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | +- GuardManager: source=L['y'], accessed_by=DictGetItemGuardAccessor(y)
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- TENSOR_MATCH: check_tensor(L['y'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[4], stride=[1])  # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- NO_HASATTR: hasattr(L['y'], '_dynamo_dynamic_indices') == False           # return x * y * d["foo"] * d["bar"]  # examples/ord_dicts.py:24 in fn
V0410 15:49:41.607000 140481927914624 torch/_dynamo/guards.py:1769] [0/0] [__guards] | | +- NO_TENSOR_ALIASING: check_no_aliasing(L['x'], L['y'])
~~~~

This info is also present in this gist for better viewing - https://gist.github.com/anijain2305/b418706e4ad4ec2d601530bc24cf8a20

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123780
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #123773, #123787
2024-04-11 22:23:28 +00:00
2e6871f924 [dynamo][guards-cpp] Early return in DictGuardManager for empty dicts (#123787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123787
Approved by: https://github.com/jansel
ghstack dependencies: #123773
2024-04-11 22:23:28 +00:00
b0b7aa201c [dynamo][cpp-guards] Introduce DictSubclassGuardManager (#123773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123773
Approved by: https://github.com/jansel
2024-04-11 22:23:28 +00:00
bd225189f1 [inductor] Change OverridesData to take callables instead of strings (#123397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123397
Approved by: https://github.com/lezcano
2024-04-11 22:22:54 +00:00
af1e03fb8f Quantized relu (#123004)
Summary: Add Quantized relu ops.

Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[  PASSED  ] 417 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[  PASSED  ] 77 tests.
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large

 9 FAILED TESTS
  YOU HAVE 8 DISABLED TESTS

Reviewed By: copyrightly

Differential Revision: D52344264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123004
Approved by: https://github.com/copyrightly
2024-04-11 21:55:25 +00:00
c75a32b9f9 [FSDP2] Fixed is_last_backward for 1f1b (#123857)
`FSDPState` only uses `TrainingState.PRE_BACKWARD` as a backward training state, not `TrainingState.POST_BACKWARD`, because the FSDP state itself does not run post-backward (only its `FSDPParamGroup`, which may not exist if the state does not manage any parameters).

This meant that when `is_last_backward=False`, the `FSDPState` was incorrectly still in `TrainingState.PRE_BACKWARD`, and the next `_pre_forward` would not run due to the early return logic for activation checkpointing:
7c451798cc/torch/distributed/_composable/fsdp/_fsdp_state.py (L148-L151)

We fix this by always transitioning to `TrainingState.IDLE` at the end of the current backward task, regardless of `is_last_backward`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123857
Approved by: https://github.com/weifengpy
2024-04-11 21:46:38 +00:00
13070e2753 [DCP] Adds better handling in logging of specific kwargs (#123658)
Adds additional signpost integrations to DCP Logger, to add support for MLU and metric collection.

Differential Revision: [D55803461](https://our.internmc.facebook.com/intern/diff/D55803461/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123658
Approved by: https://github.com/fegin
2024-04-11 21:09:38 +00:00
b7fac76fc2 [DCP fixes for _load_state_dict_keys and supports nested keys (#123679)
Fixes some issues with `_load_state_dict_keys`, including:
  * updates broken test, which was failing due to incorrect parameters
  * adds support for specifying nested keys e.g. (load state dict keys can now specify `something like "optimizer.state"`, which loads all keys under `optimzier.state`.
  * updates call site to use the private implementation of `_load_state_dict`, which properly handles empty state dicts (otherwise the keys are ignored)

Big shout out to @diego-urgell who not only identified current issues, but recommended the right solutions!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123679
Approved by: https://github.com/diego-urgell, https://github.com/wz337
2024-04-11 20:52:06 +00:00
e70bf23b7b [dynamo] apply certain bytecode cleaning transformations unconditionally (#123785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123785
Approved by: https://github.com/jansel
2024-04-11 20:25:21 +00:00
3cd06f56b1 [ez] test_profiler in serial (#123665)
Add test_profiler to the serial list since we keep needing to reopen disable issues and I think its due to being incompatible with parallelism
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123665
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-04-11 20:24:47 +00:00
fa013f69bb dynamo assertion that graph has no fake-tensor constants should check for subclasses (#118644)
This would have caught some of the nasty errors in https://github.com/pytorch/pytorch/pull/118191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118644
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519
ghstack dependencies: #118647
2024-04-11 20:10:15 +00:00
e979f45610 [while_loop] add a simiple op_info test (#123814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123814
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519
2024-04-11 19:59:04 +00:00
f82d20c207 [NEON] Remove implicit type promotion in Vectorized<c10::Half>::operator!= (#123864)
To make code compilable with `gcc`, which `clang` does not allow transparent type promotion between vectorized NEON types of the same sizes, see https://godbolt.org/z/xoasoGM81 as an example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123864
Approved by: https://github.com/malfet
2024-04-11 19:37:11 +00:00
5a7fd20aa1 [dynamo] Support autograd.FunctionCtx.needs_input_grad (#123700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123700
Approved by: https://github.com/anijain2305
2024-04-11 19:30:55 +00:00
016ca546aa Adding health check server hook in torch elastic (#122750) (#123504)
Summary:

Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55837899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504
Approved by: https://github.com/kurman
2024-04-11 19:10:56 +00:00
ee869c9bb7 Avoid COW materialization in backward ops (4) (#123798)
Affected ops:
* embedding_bag
* mse_loss
* huber_loss
* grid_sample
* ctc_loss
* nll_loss
* pdist
* _segment_reduce

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123798
Approved by: https://github.com/ezyang
ghstack dependencies: #123797
2024-04-11 18:41:41 +00:00
69249a218b Avoid COW materialization in backward ops (3) (#123797)
Affected ops:
* conv ops
* glu
* prelu
* scaled_dot_product_attention
* threshold
* logsigmoid
* binary_cross_entropy
* gelu
* unfold
* smooth_l1_loss
* embedding

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123797
Approved by: https://github.com/ezyang
2024-04-11 18:35:08 +00:00
9c3b87833a AOTAutograd: keep set_() input mutations in the graph, ban other cases (#122981)
We have some (limited) support for `set_()` input mutations in `torch.compile`, but one restriction today is that we force them to run outside of the graph, in the opaque runtime epilogue.

This is a problem for ppFSDP. Why? The usage pattern of ppFSDP forward graphs look something like this:
```
def forward_fsdp(sacrificial_param, sharded_param, inp):
    allgathered_param = allgather(sharded_param)
    sacrificial_param.set_(allgathered_param)  # hidden in an autograd.Function that we trace
    out = matmul(sacrificial_param, inp)
    sacrificial_param.untyped_storage().resize_(0)
    return out
```
When we functionalize this graph, `sacrificial_param` sees two distinct types of input mutations, that we must preserve: a `set_`, and a `resize_`. Importantly, at runtime the `set_()` must run **before** the `resize_()`. Why? the `set_()` updates the storage of our sacrificial param to the allgather'd data, which allows the call to `sacrificial_param.resize_()` to free the allgathered data later. If we run the two mutations in reverse order, we will never free the allgathered data.

We want to put the `resize_()` mutation op inside of the graph (see next PR, also there's a much longer description in that PR for anyone interested). However, this will require us to put `set_()` in the graph as well, in order for them to run in the correct order.

In order to do this, I had to add some extra restrictions: You are now required to run `set_()` under `no_grad()` if you use it with `torch.compile`, and if you perform any other mutations to the input, those must be under no_grad as well (otherwise, the mutations may mutate the `grad_fn` of the input, making it no longer safe to keep in the graph). These restrictions are hopefully reasonable, since `set_()` doesn't see much usage today (and the original impetus for adding set_() support a few months ago was for fsdp anyway)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122981
Approved by: https://github.com/jansel
ghstack dependencies: #122433, #123646
2024-04-11 18:21:57 +00:00
10a03c56e5 fix leaky fake tensor on attribute assignment, support buffer assignment (#122337)
In non-strict, assignment of attributes in a model causes their state to contain fake tensors post-tracing, which leads to incorrect results on running the exported model. We now error when this happens, asking the user to use buffers instead.
Next, we add support for assignment of buffers. The final values of the buffers turn into outputs of the graph. Since the buffers are already lifted as inputs and populated with the initial values when the model is run, this leads to a simple programming model where the driver of the model can feed the outputs back as inputs for successive runs.

Differential Revision: D55146852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122337
Approved by: https://github.com/bdhirsh, https://github.com/tugsbayasgalan
2024-04-11 18:08:31 +00:00
7c451798cc [inductor] Disable channels_last heuristic when channels==1 (#123758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123758
Approved by: https://github.com/shunting314
2024-04-11 17:47:07 +00:00
79c565b24e [inductor] Write generated files from parent process (#123409)
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file.  Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
2024-04-11 17:39:16 +00:00
902374cc09 [CI] show doc coverage repro instructions (#123688)
remind devs they can reproduce the doc coverage error locally with following msg
```You can reproduce locally by running 'cd pytorch/docs && make coverage && cat build/coverage/python.txt'```

I spent 20min to figure out how to test locally so want to enrich the error msg
<img width="542" alt="Screenshot 2024-04-09 at 5 22 45 PM" src="https://github.com/pytorch/pytorch/assets/134637289/2c619d9d-74b5-4bda-8903-999ef5c255c2">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123688
Approved by: https://github.com/clee2000
2024-04-11 17:34:47 +00:00
6b24ec480c [Tensor] Detect more cases of symbolic sizes/strides (#123696)
Previously, we'd just check `has_symbolic_sizes_strides()` to know whether a tensor has symbolic sizes or strides; if does, we skip some profiler logic. But sometimes `has_symbolic_sizes_strides()` returns false, but we do actually have symbolic sizes or strides.

So in this change, we add `may_have_symbolic_sizes_strides()` - which should never return false if the tensor has symbolic sizes and strides

Why not change `has_symbolic_sizes_strides()`? It seems like there's preexisting logic that assumes that "if has_symbolic_sizes_strides(), then we can assume that this tensor is guaranteed to have symbolic sizes or strides". In this case, we have python-implemented sizes or strides, which should follow a different code path.

Differential Revision: [D55947660](https://our.internmc.facebook.com/intern/diff/D55947660/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123696
Approved by: https://github.com/aaronenyeshi, https://github.com/soulitzer
2024-04-11 16:51:52 +00:00
fe092da874 Revert "[quant] Enable backward for choose_qparams_per_token_asymmetric (#123452)"
This reverts commit c83900887f2fb5c7a04e7fd78ad8de7a20f356d4.

Reverted https://github.com/pytorch/pytorch/pull/123452 on behalf of https://github.com/clee2000 due to broke test_quantization.py::TestQuantizedTensor::test_decomposed_choose_qparams_per_token_asymmetric_backward on multiple jobs c83900887f https://github.com/pytorch/pytorch/actions/runs/8648781225/job/23714753103, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/123452#issuecomment-2050056601))
2024-04-11 16:19:28 +00:00
efa36ef092 Natively support int truncation, don't guard on positive/negative (#122827)
This doesn't entirely fix the original problem that prompted this, but
it seems to just be getting stuck in export constraint formatting now
which seems like progress to me.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827
Approved by: https://github.com/avikchaudhuri
2024-04-11 15:22:32 +00:00
c83900887f [quant] Enable backward for choose_qparams_per_token_asymmetric (#123452)
Summary: When running the backward for this op, we get the error:
```
RuntimeError: derivative for aten::aminmax is not implemented
```
This commit replaces this call with separate amin and amax
calls instead, which do have implemented derivatives.

Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward

Reviewers: jerryzh168, digantdesai

Subscribers: jerryzh168, digantdesai, supriyar

Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452
Approved by: https://github.com/digantdesai, https://github.com/jerryzh168
2024-04-11 14:51:42 +00:00
134e56fa33 inductor: log unique id to match output_code to aot graphs (#118647)
I found it helpful to be able to see, given some inductor output code, which AOT graph it came from. When you have large models with multiple graphs floating around this can be difficult, so I added the aot_config.aot_id to the printed inductor output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118647
Approved by: https://github.com/ezyang
2024-04-11 14:37:07 +00:00
638729c0cd Switch quantized_decomposed over to new custom ops API (#123454)
We are taking API feedback. Changes:
- I removed some of the default values (they weren't being used).
- I was unable to convert the last op (which is essentially an
  autograd.Function registered as CompositeImplicitAutograd). That one
  is "incorrectly registered"; I punt fixing it to the future.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123454
Approved by: https://github.com/andrewor14
ghstack dependencies: #123453, #123578
2024-04-11 13:18:06 +00:00
1b4419dc4d Refresh OpOverloadPacket if a new OpOverload gets added (#123578)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123578
Approved by: https://github.com/albanD
ghstack dependencies: #123453
2024-04-11 13:18:06 +00:00
8a5e7a01b5 [custom_op] Schema inference now includes default values (#123453)
If the function has default values, we should be able to do schema
inference and put the default values into the schema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123453
Approved by: https://github.com/albanD
2024-04-11 13:18:02 +00:00
b2a0b8c446 Simplify ATen sparse semi-structured operators based on CUTLASS (#123473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473
Approved by: https://github.com/cpuhrsch
2024-04-11 11:56:27 +00:00
4f244cfaa0 Enable int4mm for both half and bfloat16 (#123794)
By making performant kernels a template specialization. This is a prep change for enabling ARM+float16 fast int4 kernel
TODO:
 - Add float32 and some testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123794
Approved by: https://github.com/mikekgfb, https://github.com/jgong5
2024-04-11 11:40:56 +00:00
02b29e7d07 Add meta function for channel_shuffle operation (#123033)
This commit introduces a meta function for the `channel_shuffle` operation, enabling PyTorch to perform shape inference and optimizations related to this operation without actual computation. The meta function assumes input shape (*, C, H, W) and validates that the number of channels (C) is divisible by the specified number of groups.

Fixes #122771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123033
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-04-11 10:07:18 +00:00
84580f76d9 fix flop counter issue with out parameters (#123768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123768
Approved by: https://github.com/zou3519
2024-04-11 09:39:53 +00:00
e8e9261b90 Add Matmul recipe into x86_inductor_quantizer (#122776)
**Summary**
Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default.

**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #122775
2024-04-11 09:32:47 +00:00
8798f5bf0d Add Quantization recipe filter per operator type for x86_inductor_quantizer (#122775)
**Summary**
Default recipes are enabled in `X86InductorQuantizer` and request comes to customize recipes based on these defaults.

- Avoid annotation propagation and restrict annotation only to annotate `conv`/`linear`.
- Add `matmul`  in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models.

To meet these requests, we made changes in this PR by introducing interface as `set_function_type_qconfig` and `set_module_type_qconfig`

- `set_function_type_qconfig` accepts functional input as `torch.nn.functional.linear` or `torch.matmul`; `set_module_type_qconfig` accepts nn.Module input as `torch.nn.Conv2d`.
- To disable the recipe for this operator, user can simply exclude it from the list of operations as `quantizer.set_function_type_qconfig(op, None)`.
- To modify or extend the recipe for this operator with default recipe, user can customize as `quantizer.set_function_type_qconfig(op, config)`.

**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_conv2d_recipe
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_linear_recipe
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_maxpool2d_recipe
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122775
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-04-11 09:30:31 +00:00
6b7741546b Fixed arange decomp for float dtype (#121013)
## Description:

- [x] Fixed arange decomp for float dtype
- [x] Added a test

## Current state

Arange graph and C++ generated code are not optimal when arange is created directly using float32 dtype:
```python
import torch

def func(x):
    s = x.shape[-1]
    a = torch.arange(s, dtype=torch.float32)
    return s + a

c_func = torch.compile(func)
out = c_func(torch.rand(10))
```

Graph on `main`:
```
 ===== Forward graph 0 =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:8 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64);  iota = None
        mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1);  convert_element_type = None
        add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:9 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10);  convert_element_type_1 = None
        return (add_1,)

 ===== AFTER POST GRAD =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f64[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float64);  iota = None
        mul: "f64[10]" = torch.ops.aten.mul.Tensor(convert_element_type, 1);  convert_element_type = None
        add: "f64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type_1: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:16 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type_1, 10);  convert_element_type_1 = None
        return (add_1,)
```
and C++
```c++
extern "C" void kernel(float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = c10::convert<long>(x0);
            auto tmp1 = c10::convert<double>(tmp0);   // <---- useless ops
            auto tmp2 = static_cast<double>(1.0);     // <----
            auto tmp3 = decltype(tmp1)(tmp1 * tmp2);  // <----
            auto tmp4 = static_cast<double>(0.0);     // <----
            auto tmp5 = decltype(tmp3)(tmp3 + tmp4);  // <----
            auto tmp6 = c10::convert<float>(tmp5);
            auto tmp7 = static_cast<float>(10.0);
            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
            out_ptr0[static_cast<long>(x0)] = tmp8;
        }
    }
}
```

However, if we manually create arange on i64 and then put to float32, generated graph and C++ code are more natural and benefit of a speed-up.
```python
import torch

def func(x):
    s = x.shape[-1]
    a = torch.arange(s).to(dtype=torch.float32)
    return s + a

c_func = torch.compile(func)
out = c_func(torch.rand(10))
```

Graph on `main`:
```
 ===== Forward graph 0 =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:14 in func, code: a = torch.arange(s).to(dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32);  iota = None

        # File: check_arange_decomp.py:15 in func, code: return s + a
        add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add,)

 ===== AFTER POST GRAD =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:21 in func, code: a = torch.arange(s).to(dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(iota, torch.float32);  iota = None

        # File: check_arange_decomp.py:22 in func, code: return s + a
        add: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add,)
```

C++ on `main`
```c++
extern "C" void kernel(float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = c10::convert<long>(x0);
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = static_cast<float>(10.0);
            auto tmp3 = decltype(tmp1)(tmp1 + tmp2);
            out_ptr0[static_cast<long>(x0)] = tmp3;
        }
    }
}
```

For example, the speed-up seen on upsample_nearest2d on cpu:
```
[----------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu ----------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                |  Eager (2.3.0a0+gitb4324ed) PR  |  Compiled (2.3.0a0+gitb4324ed) PR  |  Compiled (2.3.0a0+git0d1e705) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+git0d1e705) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)      |        287.988 (+-10.399)       |         200.034 (+-8.630)          |            285.143 (+-8.412)            |     1.425 (+-0.000)      |          287.991 (+-11.302)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)          |        697.206 (+-27.033)       |         171.650 (+-7.381)          |            193.280 (+-5.840)            |     1.126 (+-0.000)      |          701.642 (+-26.461)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)    |        149.149 (+-6.045)        |         222.780 (+-6.852)          |            299.968 (+-12.354)           |     1.346 (+-0.000)      |          145.055 (+-7.232)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)        |        596.741 (+-27.970)       |         205.923 (+-8.648)          |            233.912 (+-7.742)            |     1.136 (+-0.000)      |          598.000 (+-25.630)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)      |       1095.734 (+-51.658)       |         700.850 (+-24.852)         |           1044.255 (+-38.216)           |     1.490 (+-0.000)      |         1097.977 (+-35.521)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)          |       2741.813 (+-122.917)      |         583.073 (+-16.998)         |            665.029 (+-36.331)           |     1.141 (+-0.000)      |         2722.388 (+-116.263)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (256, 256)    |        578.183 (+-37.266)       |         833.295 (+-42.264)         |           1131.341 (+-54.710)           |     1.358 (+-0.000)      |          584.953 (+-45.549)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (256, 256)        |       2332.508 (+-103.556)      |         840.194 (+-47.664)         |            935.625 (+-47.467)           |     1.114 (+-0.000)      |         2334.314 (+-91.644)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)    |        272.631 (+-11.348)       |         195.988 (+-5.748)          |            274.021 (+-9.475)            |     1.398 (+-0.000)      |          272.752 (+-12.716)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)        |        640.409 (+-25.465)       |         164.773 (+-7.372)          |            185.018 (+-8.349)            |     1.123 (+-0.000)      |          639.390 (+-30.761)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)  |        158.602 (+-6.593)        |         220.478 (+-6.809)          |            286.376 (+-8.981)            |     1.299 (+-0.000)      |          158.557 (+-6.143)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)      |        548.903 (+-22.889)       |         202.788 (+-9.158)          |            227.404 (+-8.995)            |     1.121 (+-0.000)      |          554.096 (+-21.330)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)    |       1036.061 (+-35.285)       |         680.728 (+-30.925)         |            986.254 (+-42.732)           |     1.449 (+-0.000)      |         1038.718 (+-43.070)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)        |       2504.520 (+-125.805)      |         550.067 (+-21.383)         |            628.000 (+-27.589)           |     1.142 (+-0.000)      |         2523.134 (+-113.336)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (200, 300)  |       1058.188 (+-57.853)       |        1216.427 (+-76.160)         |           1380.231 (+-98.939)           |     1.135 (+-0.000)      |         1057.031 (+-66.075)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (200, 300)      |       2305.911 (+-116.864)      |        1080.189 (+-79.934)         |           1141.561 (+-67.959)           |     1.057 (+-0.000)      |         2306.606 (+-121.544)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)      |       1689.489 (+-60.579)       |        1077.401 (+-44.948)         |           1634.264 (+-64.340)           |     1.517 (+-0.000)      |         1693.945 (+-67.998)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)          |       4198.368 (+-179.096)      |         886.656 (+-30.355)         |           1028.568 (+-46.310)           |     1.160 (+-0.000)      |         4174.351 (+-141.020)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)    |        716.572 (+-51.954)       |        1175.864 (+-52.191)         |           1674.373 (+-51.815)           |     1.424 (+-0.000)      |          715.724 (+-41.104)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)        |       3604.989 (+-132.489)      |        1096.933 (+-54.290)         |           1270.347 (+-60.932)           |     1.158 (+-0.000)      |         3601.864 (+-140.218)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)      |       6721.610 (+-355.997)      |        4203.213 (+-134.362)        |           6423.763 (+-225.311)          |     1.528 (+-0.000)      |         6715.626 (+-288.233)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)          |      16695.467 (+-709.620)      |        3460.013 (+-149.456)        |           4001.810 (+-218.093)          |     1.157 (+-0.000)      |        16621.138 (+-713.320)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: nearest, align_corners: None, osize: (600, 700)    |       3020.017 (+-147.314)      |        4743.164 (+-135.850)        |           6709.494 (+-281.025)          |     1.415 (+-0.000)      |         3015.602 (+-105.852)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: nearest, align_corners: None, osize: (600, 700)        |      14456.688 (+-752.839)      |        5150.893 (+-201.571)        |           5737.315 (+-138.011)          |     1.114 (+-0.000)      |        14464.472 (+-720.027)

Times are in microseconds (us).
```

## PR

This PR improves arange decomp such that `arange(s, dtype=torch.float32)` removing extra dtype conversion to double:

Code:
```python
import torch

def func(x):
    s = x.shape[-1]
    a = torch.arange(s, dtype=torch.float32)
    return s + a

c_func = torch.compile(func)
out = c_func(torch.rand(10))
```

Graph on this PR:
```
 ===== Forward graph 0 =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:15 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1);  iota = None
        add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:16 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add_1,)

 ===== AFTER POST GRAD =====
 /pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self):
        # File: check_arange_decomp.py:16 in func, code: a = torch.arange(s, dtype=torch.float32)
        iota: "i64[10]" = torch.ops.prims.iota.default(10, start = 0, step = 1, dtype = torch.int64, device = device(type='cpu'), requires_grad = False)
        mul: "i64[10]" = torch.ops.aten.mul.Tensor(iota, 1);  iota = None
        add: "i64[10]" = torch.ops.aten.add.Tensor(mul, 0);  mul = None
        convert_element_type: "f32[10]" = torch.ops.prims.convert_element_type.default(add, torch.float32);  add = None

        # File: check_arange_decomp.py:17 in func, code: return s + a
        add_1: "f32[10]" = torch.ops.aten.add.Tensor(convert_element_type, 10);  convert_element_type = None
        return (add_1,)
```
and C++ on this PR:
```c++
extern "C" void kernel(float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(10L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = c10::convert<long>(x0);
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = static_cast<float>(10.0);
            auto tmp3 = decltype(tmp1)(tmp1 + tmp2);
            out_ptr0[static_cast<long>(x0)] = tmp3;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121013
Approved by: https://github.com/peterbell10
2024-04-11 09:02:31 +00:00
2ac99d539b Only initialize state if needed in SGD (#123757)
Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757
Approved by: https://github.com/janeyx99
2024-04-11 08:56:06 +00:00
e00282fecf [c10d] make monitorThread sleep when we try to dump (#123788)
Summary:
We seperated the FR dump logic from the desync debug logic,
so we no longer set collectiveDebugInfoMode_ to true when we just need FR
dump. That's why monitor thread did not sleep and try to kill the
process without waiting for the dump.

The fix is simple, we should sleep whenever shouldDump_ is true
Test Plan:
Existing unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788
Approved by: https://github.com/wconstab
2024-04-11 07:10:46 +00:00
a510afb885 [aot] refactor runtime_wrapper's epilogue args access (#123674)
I want runtime_wrapper args to be stealable by call_func_at_runtime_with_args, since the args may contain activations which we don't want to hold alive in this scope.

The args to runtime_wrapper **should always be** from a list created within aot_autograd, so it **should always be** safe to steal them: a4a49f77b8/torch/_functorch/aot_autograd.py (L928-L932)

There are some accesses after we execute the compiled_fn, but those index accesses are already inferred at compile time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123674
Approved by: https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #123630
2024-04-11 07:07:50 +00:00
a8d2504eec [aot] always pass inputs to runtime_wrapper as list and add type annotations (#123630)
`runtime_wrapper` unpacking the arguments as a Tuple[arg] will prevent them from being freed within its scope. This is problematic if inductors wants to free those inputs, which could be activations in the compiled backwards case. This PR only changes the signature to pass as list, but does not clear it, keeping same refcount as before.

Also adding some mypy annotations. Ideally, instead of `Any`, I would want a type to describe single arg which seems to be usually Tensor or int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123630
Approved by: https://github.com/jansel, https://github.com/bdhirsh
2024-04-11 07:07:50 +00:00
c2f687f32c Option to include stride and device annotation in gm.print_readable() (#123690)
Summary:
Sample output for gm.print_readable(include_stride=True, include_device=True)

```
        getitem_21: "i32[1200][1]cuda:0" = auto_functionalized_4[1]
        copy_2: "f32[2, 60][60, 1]cuda:1"  = ....
```

Test Plan: CI

Differential Revision: D55949129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123690
Approved by: https://github.com/Chillee
2024-04-11 06:53:10 +00:00
8aad72b0d3 Support all unsigned int sizes on unique (#123643)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643
Approved by: https://github.com/albanD, https://github.com/kit1980
2024-04-11 06:50:12 +00:00
416f532753 [AOTI] Serialize large weights (#123002)
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights`

Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question

Added unites to validate that it works as expected
TODO:
  - Extend support to CUDA
  - munmap region if the same library is reused

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
2024-04-11 06:39:58 +00:00
7fc4b170d8 [EZ] Update mypy to 1.9.0 (#123595)
TODO:
 - Add linter that keeps `requirements-ci.txt` and `.lintrunner.toml` in sync
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123595
Approved by: https://github.com/kit1980
2024-04-11 06:36:09 +00:00
cacc8e27a5 [inductor][cpp] refactor code to use define_kernel and call_kernel similar to CUDA (#123704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123704
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-04-11 06:34:44 +00:00
2a597cfd2c [EZ] Pin scipy to 1.12 for Py-3.12 (#123795)
This caused false positive failures/reverts for https://github.com/pytorch/pytorch/pull/123689 and https://github.com/pytorch/pytorch/pull/123595

Fixes https://github.com/pytorch/pytorch/issues/123655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123795
Approved by: https://github.com/huydhn
2024-04-11 06:32:16 +00:00
57a2032c7a Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-11 05:51:06 +00:00
bbcdd28409 Report LRU cache stats at end of program for symbolic shapes (#123724)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123724
Approved by: https://github.com/Chillee
2024-04-11 05:12:43 +00:00
3ebbeb75fd [Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425) (#123650)
Summary:

Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
2024-04-11 04:29:20 +00:00
ec00daf4f1 [aotinductor] Fix benchmarks with self.autocast for run_performance_test (#123699)
## Pitch
Similar to https://github.com/pytorch/pytorch/pull/110490 which fixes the `self.autocast` in the `check_accuracy` function, this PR fixes the `self.autocast` context in the `run_performance_test` function.

## Description
The code inside `check_accuracy` after the fix on https://github.com/pytorch/pytorch/pull/110490:
a4a49f77b8/benchmarks/dynamo/common.py (L2490-L2500)

The current code on main branch before this PR in `run_performance_test` does not have the `self.autocast` context:
a4a49f77b8/benchmarks/dynamo/common.py (L2685-L2692)

For eager mode, the `model_iter_fn`  (which is actually [forward_pass](e8ad5460c0/benchmarks/dynamo/huggingface.py (L556-L558))) is used in [warmup](e8ad5460c0/benchmarks/dynamo/common.py (L2690))  and    [speedup_experiment](e8ad5460c0/benchmarks/dynamo/common.py (L648)). The `forward_pass` has the `self.autocast` context thus it could run into BF16 when AMP is on. While for AOTInductor, we will call `export_aot_inductor` in both [warmup](e8ad5460c0/benchmarks/dynamo/common.py (L2695)) and [speedup_experiment](e8ad5460c0/benchmarks/dynamo/common.py (L644-L646)), which doesn't have the `autocast` context thus will always run into FP32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123699
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-11 01:40:44 +00:00
902cb2c842 [multi_tensor_apply] revert the optimization introduced in #119764 (#123763)
The optimization causes regression in some torchbench benchmarks and
with some older versions of nvcc. The regression is preventable, but it
might require additional template specialization which would increase the
binary size.

Reverting it for now to re-evaluate.

Keeping the introduced tests and cuda-to-hip-mappings since these are
not specific to the optimization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123763
Approved by: https://github.com/janeyx99
2024-04-11 01:39:49 +00:00
0d0fd80033 [AOTI] fix relocation overflow error when .data is large (#123639)
https://github.com/pytorch/pytorch/pull/123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
a9a9ce6d9c/torch/_inductor/codecache.py (L1685-L1691)

However, it may cause relocation overflow when the `.data` section is large.

Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section

```
.text
.rodata
.data
.bss
.lrodata
.ldata
```

We met this issue when fixing https://github.com/pytorch/pytorch/issues/114450 and running the below models on CPU:
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BlenderbotForCausalLM
- DebertaV2ForMaskedLM
- DebertaV2ForQuestionAnswering
- XGLMForCausalLM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123639
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-11 01:37:43 +00:00
281810e307 Avoid COW materialization in backward ops (2) (#123740)
Affected ops:
* pooling ops
* relu
* pad
* interpolate
* upsample
* multi_margin_loss
* multilabel_margin_loss
* multilabel_soft_margin_loss

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123740
Approved by: https://github.com/ezyang
ghstack dependencies: #123657
2024-04-11 01:35:38 +00:00
793df52dc5 Aviod sync for privateuse1 backend inside Onehot. (#123621)
Onehot skip class value check for cuda and mps backend which avoid sync, and it's also needed for privateuse1. This PR add privateuse1 check to avoid sync too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123621
Approved by: https://github.com/ezyang
2024-04-11 01:19:53 +00:00
3e43dc086a implement bmm to support nested tensor and normal tensor (#119762)
implement bmm to support nested and normal tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119762
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-04-11 01:10:04 +00:00
b3feb01910 [dynamo] Update co_names if needed in fix_vars (#123697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123697
Approved by: https://github.com/williamwen42
2024-04-11 01:00:01 +00:00
c64184b097 [FSDP] Made patch functions thread safe with barrier (#123754)
I think if we do not have barriers as added in the PR, we could have a race condition with multi-threading (e.g. MTPG). I think this mainly matters if the test function itself does not run collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123754
Approved by: https://github.com/weifengpy
ghstack dependencies: #122962, #123290, #123362
2024-04-11 00:59:16 +00:00
4d1d71ecac Add aoti_torch_dtype<> specializations for half and bfloat16 (#123692)
Fixes #122989. (Note that while the missing symbol issue is fixed, the
test itself is still disabled, because the test runner now segfaults on
`atexit()`; but I think that issue is unrelated to the missing symbol.)

In addition to defining the specializations, I also `= delete`d the
default un-specialized version of `aoti_torch_dtype`, so future missing
dtype references will show up as compile-time instead of link-time
errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123692
Approved by: https://github.com/chenyang78
2024-04-11 00:16:05 +00:00
e29e990ddc Add the VecConvert between 8bits and float (#123512)
**Summary**
Fix the issue https://github.com/pytorch/pytorch/issues/123448 by adding intrinsic specialization between `int8/uint8` and `float32`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123512
Approved by: https://github.com/jgong5
2024-04-11 00:15:32 +00:00
01ab5a3104 aot_eager and aot_eager_decomp_partition: include input mutations in graph (#123646)
In the next PR I force `set_()` input mutation to require always been in the graph.

It's a lot easier to do this if we make our other debugging backends allow input mutations in the graph. Input mutations are relatively hardened at this point, so I'd rather just have our debugging backends consistently allow input mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123646
Approved by: https://github.com/ezyang
ghstack dependencies: #122433
2024-04-11 00:07:20 +00:00
8d36354187 AOTAutograd: fix detection for mutations under no_grad when there are outstanding aliases (#122433)
Fixes https://github.com/pytorch/pytorch/issues/122436.

The problem was that even though we were detecting when mutations happen under no_grad or not, we were recording these mutations when they happened to the FunctionalTensor - we should really just be recording them on the underlying storage.

In particular, what would happen is that we would mutate an alias under no_grad (marking the mutation as under no_grad properly), but if we use the base tensor outside of the no_grad region, we would lazily regenerate the base at this point, propagate the mutation to the base, and at that point mark that the base witnessed a mutation (outside of the no_grad region)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122433
Approved by: https://github.com/ezyang
2024-04-11 00:07:20 +00:00
57634ce74f Don't intersect when clamping for size oblivious (#123675)
Fixes https://github.com/pytorch/pytorch/issues/123651

Previously, when we performed a size oblivious test, we would only modify the lower bound, e.g., if we knew something had range `[0, 100]`, the size oblivious test would do `[2, 100]`. But what if your original range was `[0, 1]`? Naively intersecting this with `[2, sympy.oo]` would result in an empty set: that's a big no no. And in general, this intersection is kind of questionable: if your original range was `[0, 2]`, do we really want to assume that this quantity is exactly equal to 2 in the size oblivious test?

So here's an idea: when we're doing a size oblivious test, just forget about the max bound entirely. The idea is that the max bound probably wasn't actually helping you discharge the size oblivious test (because size oblivious tests are all about "well, if we can assume thing isn't zero or one, we know what the static value is.") So you can use the max bound OR you can use the size oblivious bound, but you're not allowed to use both at the same time. (It doesn't actually seem necessary to use the max bound, but it would be easy to permit this without using the size oblivious refinement.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123675
Approved by: https://github.com/PaulZhang12
2024-04-10 23:10:41 +00:00
b36b523c05 Fix guard_size_oblivious on non-symbolic expression (#123743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123743
Approved by: https://github.com/avikchaudhuri
2024-04-10 22:45:54 +00:00
a13cf5d396 Workaround dind-rootless bind mount permissions (#123641)
ARC uses dind-rootless which causes bind mounts to always be mounted as the "root" user inside the container rather than the "jenkins" user as expected. We run chown to ensure that the workspace gets mapped to the jenkins user as well as a trap to ensure this change gets reverted when the script ends for any reason. This is the same workaround as in #122922 but adapted for onnx tests.

Issue: pytorch/ci-infra#112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123641
Approved by: https://github.com/jeanschmidt, https://github.com/seemethere
2024-04-10 22:44:57 +00:00
69c6e0b851 [ROCm] Fix ROCm bug that causes numerical errors in float8_experimental (#123275)
Recently there has been work in an experimental repo to start implementing the intrinsics necessary handle F8 workloads. (see: https://github.com/pytorch-labs/float8_experimental)

A recent PR was submitted to add support for AMD F8 types (fnuz). This PR uncovered a bug in the rocm code that caused unit tests to fail due to numerical inaccuracy. This PR fixes that bug by swapping `abs_()` with `abs()` as the former performs elementwise absolute value on the tensor in-place causing the final assertion to fail due to the tensor only containing positive values.

Important to note, this fix is part of a workaround as hipblasLT does not yet support amax (HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER). This functionality has been implemented internally and is going through the proper channels to propagate to the community.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123275
Approved by: https://github.com/drisspg, https://github.com/jeffdaily
2024-04-10 21:52:02 +00:00
6b18daf205 Revert "Delete Lark (#123689)"
This reverts commit a631461eef7317efccf981989c5cf5c5b486ab0a.

Reverted https://github.com/pytorch/pytorch/pull/123689 on behalf of https://github.com/PaliC due to This PR seems to be breaking  test_binary_ufuncs.py ([comment](https://github.com/pytorch/pytorch/pull/123689#issuecomment-2048489549))
2024-04-10 21:48:04 +00:00
49d5553f5a Avoid COW materialization in backward ops (1) (#123657)
Affected ops:
* cdist
* sparse.sampled_addmm
* sparse.mm
* cross_entropy
* norm ops

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123657
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-04-10 21:07:07 +00:00
4bee4c7c25 [3.12] enable inductor unittests (#123654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123654
Approved by: https://github.com/jansel
2024-04-10 20:51:43 +00:00
f688d7a2f7 Only suggest debug envvar when debug is on (#123647)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123647
Approved by: https://github.com/Chillee
2024-04-10 20:41:39 +00:00
cda383e7bc [inductor] Fix fresh_inductor_cache() (#122661)
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.

Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
2024-04-10 20:38:56 +00:00
cf8139b956 Revert "Fix derived dim bugs in ep.run_decomp (#123326)"
This reverts commit 43228742820d8045a3980826f3ef85158dc9032c.

Reverted https://github.com/pytorch/pytorch/pull/123326 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123326#issuecomment-2048389042))
2024-04-10 20:35:01 +00:00
63c24f73ef Upsample2d backwards to int64_t (#123682)
Summary: To unblock training where upsamplenearest2d involves input or output tensors which are larger than 2^31. Comes up frequently in image & video applications.

Test Plan:
```
buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_upsamplingnearest2d_backward_64bit_indexing
```

Benchmarking (N5207417):
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
118.03993721008301 124.09385920000001 98.72685525972494
# after changes
118.05780944824218 124.10893509999994 98.71190944734577
```

Differential Revision: D55625666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123682
Approved by: https://github.com/ezyang
2024-04-10 20:26:08 +00:00
a631461eef Delete Lark (#123689)
Now that we are using MLIR bindings inside triton, lets delete Lark parser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123689
Approved by: https://github.com/jansel
2024-04-10 19:41:54 +00:00
8d9af8b91c Revert "[Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387)"
This reverts commit 82e0153487c2cd1abc92598963be5b57ab1948d4.

Reverted https://github.com/pytorch/pytorch/pull/122387 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122387#issuecomment-2048294643))
2024-04-10 19:34:26 +00:00
30c4efe6d2 Update preferred-citation in CITATION.cff (#123575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123575
Approved by: https://github.com/ezyang
2024-04-10 19:01:38 +00:00
2bcc83dfbd Preserve dispatch state across function tracing (#122073)
If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that.

Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code.  Fixed that to reset after as well.

Repro:

before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
======== 1 passed, 6173 deselected in 5.21s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
========= 1 skipped, 6172 deselected, 1 error in 5.29s =========
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 6173 deselected in 5.42s =====================
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64'
===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ======================
```
(note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073
Approved by: https://github.com/zou3519
2024-04-10 18:57:01 +00:00
a65e9a06f0 Revert "[AOTI] Serialize large weights (#123002)"
This reverts commit 27eb5daee494c42425392a327feff7b3e78c342c.

Reverted https://github.com/pytorch/pytorch/pull/123002 on behalf of https://github.com/DanilBaibak due to There is conflict to land the diff internally ([comment](https://github.com/pytorch/pytorch/pull/123002#issuecomment-2048215990))
2024-04-10 18:54:31 +00:00
4322874282 Fix derived dim bugs in ep.run_decomp (#123326)
Differential Revision: [D55730289](https://our.internmc.facebook.com/intern/diff/D55730289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123326
Approved by: https://github.com/avikchaudhuri
2024-04-10 18:54:03 +00:00
cd3c1132a9 Create a mock benchmark results for torchao cudagraphs_low_precision (#123419)
I copy the results of `cudagraph` backend as the mock data.  The new backend is called `cudagraphs_low_precision` and the new dtype is `quant`.

One gotcha thing is that GitHub has an upper limit of 10 inputs for a workflow dispatch. Before we can sort it out, I let torchao `cudagraphs_low_precision` mock data generated as part of cudagraphs.

If this works out ok, I'll need to work on another PR on test-infra to add the new `quant` dtype on the dashboard.

### Testing

Manually dispatch one round of training and inference benchmark with cudagraphs https://github.com/pytorch/test-infra/pull/5066, this should start populating the mock data into Rockset.

The dashboard change is at https://github.com/pytorch/test-infra/pull/5066.  The mock data shows on the preview at https://torchci-git-fork-huydhn-add-torchao-inducto-132c5a-fbopensource.vercel.app/benchmark/compilers?startTime=Fri%2C%2029%20Mar%202024%2020%3A00%3A48%20GMT&stopTime=Fri%2C%2005%20Apr%202024%2020%3A00%3A48%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=quant&lBranch=torchao-benchmark-template&lCommit=bc2ef535b412f84a9d071727fa6f0628b231fbd2&rBranch=torchao-benchmark-template&rCommit=bc2ef535b412f84a9d071727fa6f0628b231fbd2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123419
Approved by: https://github.com/ZainRizvi, https://github.com/desertfire
2024-04-10 18:51:21 +00:00
c3de2cc154 Enable UFMT on test/test_foreach.py (#123718)
Part of https://github.com/pytorch/pytorch/issues/123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123718
Approved by: https://github.com/ezyang
2024-04-10 18:22:12 +00:00
c9c099b271 Add kwargs to RecordFunctionFast (#123600)
Differential Revision: [D55897888](https://our.internmc.facebook.com/intern/diff/D55897888/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123600
Approved by: https://github.com/davidberard98
2024-04-10 18:17:50 +00:00
26a9b05bce Set stacklevel on checkpoint warning (#123717)
Partially addresses https://github.com/pytorch/pytorch/issues/123626

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123717
Approved by: https://github.com/Skylion007
2024-04-10 17:25:06 +00:00
66e61af467 [ROCm][CI] skip float16 for TestTemplatedSDPA (#123668)
Fixes #123531 and #123610 by replacing DISABLED issues with code skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123668
Approved by: https://github.com/atalman
2024-04-10 17:09:22 +00:00
49be96efe8 Instantiate VaryingShape<c10::Stride> (#123542)
Fixes #123248

As the ISSUE stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123542
Approved by: https://github.com/ezyang
2024-04-10 16:35:59 +00:00
7a925c2657 Migrate linux-focal-py3_8-clang10-onnx-build to ARC (#123435)
Migrate linux-focal-py3_8-clang10-onnx-build to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123435
Approved by: https://github.com/zxiiro, https://github.com/atalman
2024-04-10 16:18:47 +00:00
d017645dc7 Revert "Support all unsigned int sizes on unique (#123643)"
This reverts commit 8aa08b8b9d1fab2a13dc5fbda74c553cb2a08729.

Reverted https://github.com/pytorch/pytorch/pull/123643 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing lots of jobs with the new dtype 8aa08b8b9d ([comment](https://github.com/pytorch/pytorch/pull/123643#issuecomment-2047905094))
2024-04-10 15:49:40 +00:00
5c1bde99c0 Fix the uncorrect return value of Tensor.numpy() (#123538)
Fixes #123494

As the ISSUE stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123538
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2024-04-10 14:47:24 +00:00
bdee35a870 [BE] rewrite logical-and expression to if-statement (#123638)
```diff
- push and self.push(value)
+ if push:
+     self.push(value)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123638
Approved by: https://github.com/ezyang
2024-04-10 14:34:17 +00:00
8aa08b8b9d Support all unsigned int sizes on unique (#123643)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643
Approved by: https://github.com/albanD, https://github.com/kit1980
2024-04-10 11:46:10 +00:00
4a4baff0f3 [dynamo, 3.12] force LOAD_SUPER_ATTR second bit on (#123686)
This was pretty painful to find haha

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123686
Approved by: https://github.com/jansel
2024-04-10 10:31:46 +00:00
d60135e915 [FSDP1] fix _same_storage check for DTensor (#123617)
for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However,  ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` https://github.com/pytorch/pytorch/issues/123272

credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123617
Approved by: https://github.com/awgu
2024-04-10 10:26:12 +00:00
37fd547518 [DeviceMesh] Make dtype of mesh tensor from init_device_mesh() consistent with directly calling DeviceMesh() (#123677)
Currently, mesh tensor from `init_device_mesh()` has a dtype of `torch.int64` while mesh tensor from `DeviceMesh()` would have dtype of `torch.int32`. Making them consistent in this PR.

DeviceMesh ctor dtype pointer:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/device_mesh.py#L217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123677
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol
2024-04-10 09:14:34 +00:00
1346ebf12e [dynamo][guards] Delay DUPLICATE_INPUT guard because of incorrect ordering (#123605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123605
Approved by: https://github.com/jansel
ghstack dependencies: #123606
2024-04-10 07:30:02 +00:00
1dc4e1e335 [dynamo][logs] Bug fix (#123606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123606
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-04-10 07:30:02 +00:00
cdc47ad991 fix amp for AOTInductor (#122883)
## Pitch
This PR disables the amp when calling the inference_compiler in AOTInductor path (after having exported the model graph), following the way we disable AMP in Inductor path in https://github.com/pytorch/pytorch/pull/86515.

## Description
When testing AOTInductor AMP accuracy on CPU using the dynamo benchmark suites, multiple workloads will fail in this assertion: [assert pattern_repr not in _seen_patterns](1d52c2d985/torch/_inductor/pattern_matcher.py (L1095)) which is called when registering SDPA patterns.

The `inference_compiler` ([fw_compiler_base](1d52c2d985/torch/_inductor/compile_fx.py (L1234))) will call into [_recursive_joint_graph_passe](1d52c2d985/torch/_inductor/compile_fx.py (L1241)) and then [_sfdp_init](1d52c2d985/torch/_inductor/fx_passes/fuse_attention.py (L847)).

When testing accuracy, we'll set [inductor_config.fallback_random = True](1d52c2d985/benchmarks/dynamo/common.py (L3526)), which will make the `search_fn` to be `None` [here](1d52c2d985/torch/_inductor/fx_passes/serialized_patterns/central_index.py (L117-L118)), thus the pattern will be generated runtime [here](1d52c2d985/torch/_inductor/pattern_matcher.py (L1083-L1084)). When AMP is on, the generated pattern for SDPA FP32 will be the same as that of FP16, which makes the assertion fail.

Inductor path disables amp inside [aot_dispatch_base](1d52c2d985/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L124-L128)). We follow the same way to disable for AOTInductor path here (after having exported the model graph) to fix this issue.

## UT
For the added UT, there's one case
`python test/inductor/test_aot_inductor.py -k test_amp_fallback_random_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface` fails with the below error which is not caused by this PR itself. Marked it as skipped for now.
```
RuntimeError: Error in dlopen: /tmp/torchinductor_user/cf5vk3gqkbvud56qeotdxqvns4wbk3sjnlnuadolt7b6g7a6kspb/cfzjo5ackvrth2gp6oq4lfpdyfafoagodfpjvbzhsi2u64hza2vn.so: undefined symbol: _Z16aoti_torch_dtypeIN3c108BFloat16EEiv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122883
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-10 06:20:17 +00:00
fe4d1aff05 UFMT formatting on test/export (#123520)
Partially addresses https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
test/export

Detail:
```Shell
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123520
Approved by: https://github.com/ezyang
2024-04-10 05:38:42 +00:00
e8ad5460c0 Fix skip logic bug in dynamo benchmark runner (#123544)
Fix huggingface and timms_model did not uses TorchBenchmarksRunner class issue.
![image](https://github.com/pytorch/pytorch/assets/84730719/358eed37-4d70-4034-85f9-58a922b5c532)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123544
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire
2024-04-10 05:14:31 +00:00
65710d95c9 Fix example in torch.distributed.new_subgroups docstring (#123492)
Summary: As title

Test Plan: Run the example locally

Reviewed By: zhaojuanmao

Differential Revision: D55617871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123492
Approved by: https://github.com/wconstab, https://github.com/wz337
2024-04-10 03:33:07 +00:00
713f065c8d Enable UFMT on test/test_dispatch (#123644)
Part of https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
test/test_dispatch.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123644
Approved by: https://github.com/ezyang
2024-04-10 03:09:38 +00:00
eqy
e15ae63a42 [cuBLAS][cuBLASLt] Remove CUDA 11 heuristics for dispatching to cuBLASLt (#119939)
Revisiting an old workaround to see if things have improved since then...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119939
Approved by: https://github.com/atalman
2024-04-10 03:01:59 +00:00
247646333e Fix py opcode (#118977)
Added a C file that includes the symbols  _PyOpcode_Deopt and _PyOpcode_Caches since they are not available in the python lib but they are available on Linux in order to fix linking issues in Windows in python 3.11.
Fixes #93854

Test by running on python 3.11 `python test/functorch/test_dims.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118977
Approved by: https://github.com/ezyang
2024-04-10 02:39:17 +00:00
b7f898c4a6 Generalize host allocator to be device-agnostic (#123079)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we would like to generalize device and host allocator to be device-agnostic. We prioritize the host allocator as it is simpler and more native than the device allocator. In this PR, we intend to refactor the host allocator to make it be shared across different backends. In 2nd PR, we will support host allocator on XPU backend.

# Design
The previous design:
- `CUDAHostAllocatorWrapper` inherits from `c10::Allocator`, and `CUDAHostAllocator` is an implementation of `CUDAHostAllocatorWrapper`.

The design in this PR:
- `CachingHostAllocatorImpl` is an interface that implements the caching host allocator logic that can be sharable across each backend.
- `CachingHostAllocatorInterface` inherits from `c10::Allocator` as an interface and accepts `CachingHostAllocatorImpl` as its implementation.
- `CUDACachingHostAllocator` is a CUDA host allocator whose implementation is `CUDACachingHostAllocatorImpl` which is specialized from `CachingHostAllocatorImpl`.

This design can
- share most code of caching mechanism across different backends, and
- keep the flexibility to expand its exclusive feature on each backend.

# Additional Context
In addition, we will continue to generalize the device allocator in the next stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123079
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/albanD, https://github.com/gujinghui
2024-04-10 02:38:07 +00:00
82e0153487 [Quant][PT2E] Enable linear-binary(-unary) post-op recipe for X86Inductor quantizer (#122387)
As the title
**Test plan**
python test/test_quantization.py -k test_linear_binary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122387
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-04-10 01:34:14 +00:00
69c7bd4587 [Compile FSDP2][3/n] Check all_gather_work is distributed_c10d.Work before calling .wait() (#123491)
In FSDP2, we have this:
```python
if all_gather_work is not None:  # async op
    all_gather_work.wait()
```

In eager, there are only two possible values for `all_gather_work`:
1. `distributed_c10d.Work` object (when `async_op=True`)
2. `None` (when `async_op=False`)

So the existing `if` statement is sufficient for eager mode.

In compile, there is one additional possible value for `all_gather_work` which is `FakeTensor` object (not None), because we return regular tensor for collective call in compile mode. If we use the existing `if` statement as-is, we will always call `.wait()` on `all_gather_work`, which is not the same semantics as eager.

There are a few ways to fix this:
Option 1: Properly support `distributed_c10d.Work` in Dynamo. This is the best long-term fix but it will take much more time to make it work.

Option 2: Allow calling `.wait()` on FakeTensor in compile mode (and just return None there) - this seems hacky because FakeTensor wouldn't normally have this method.

Option 3: Check whether `all_gather_work` is `distributed_c10d.Work` before calling `.wait()` on it. **<-- This PR**

Option 3 is chosen in this PR because it seems to also make the eager program semantics clearer (we don't need to think about whether `all_gather_work` can be `.wait()` on in all scenarios, as long as we know `distributed_c10d.Work` can be waited on).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123491
Approved by: https://github.com/awgu
2024-04-10 01:23:54 +00:00
7bcac56140 Update Kineto Submodule in PyTorch (#123565)
Summary: Update the Kineto Submodule in PyTorch third_party/kineto.

Test Plan: GitHub CI

Differential Revision: D55875347

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123565
Approved by: https://github.com/sraikund16
2024-04-10 01:07:42 +00:00
bd59e1113d Improve docstring for tensorboard add_embedding() (#120408)
Fixes missing parameter documentation (`metadata_header`).
Fixes a typo.
Adds a note explaining a somewhat confusing behavior of Tensorboard Projector where categorical values with more than 50 unique values are not permitted to be used for coloring. This was not documented anywhere. The confusion caused https://github.com/tensorflow/tensorboard/issues/61.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120408
Approved by: https://github.com/albanD
2024-04-10 00:32:29 +00:00
0288fa7cae [inductor][cpp] expose config options via env vars (#123519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123519
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-04-10 00:11:32 +00:00
786c6db519 Revert "UFMT formatting on test/export (#123520)"
This reverts commit ec7551d1b783e284cedddeb9aeabb285e653c480.

Reverted https://github.com/pytorch/pytorch/pull/123520 on behalf of https://github.com/PaliC due to lint is still broken ([comment](https://github.com/pytorch/pytorch/pull/123520#issuecomment-2046223260))
2024-04-10 00:06:30 +00:00
eqy
624e58f2c6 [CUDA] Update size_1 conv tests with TF32 thresholds (#118022)
Seeing some numerical mismatches on A100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118022
Approved by: https://github.com/atalman
2024-04-09 23:49:40 +00:00
af27bc443b fix typo in 4 files (#123529)
fix typo: `information` has no plural.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123529
Approved by: https://github.com/albanD
2024-04-09 23:37:35 +00:00
8a566161cd [ATen-VK] Remove duplicate function from Resource.cpp (#123659)
Summary:
We want to bundle both ATen-VK and ET-VK in one library. There's a lot of copied code between the two libraries and most of it's fine since it is guarded by different namespaces. This function is the one exception and so we delete it from ATen-VK.

```
Action failed: fbsource//xplat/wearable/wrist/ml:wristmlcore (cxx_link libwristmlcore.so)
Local command returned non-zero exit code 1
Reproduce locally: `env -- 'BUCK_SCRATCH_PATH=buck-out/v2/tmp/fbsource/c81367d319075390/xplat/wearable/wrist/ml/__wristm ...<omitted>... /fbsource/c81367d319075390/xplat/wearable/wrist/ml/__wristmlcore__/libwristmlcore.so.linker.argsfile (run `buck2 log what-failed` to get the full command)`
stdout:
stderr:
ld.lld: error: duplicate symbol: operator<<(std::__ndk1::basic_ostream<char, std::__ndk1::char_traits<char>>&, VmaTotalStatistics)
>>> defined at Resource.cpp:6 (xplat/caffe2/aten/src/ATen/native/vulkan/api/Resource.cpp:6)
>>>            Resource.cpp.pic.o:(operator<<(std::__ndk1::basic_ostream<char, std::__ndk1::char_traits<char>>&, VmaTotalStatistics)) in archive buck-out/v2/gen/fbsource/c81367d319075390/xplat/caffe2/__torch_vulkan_api__/libtorch_vulkan_api.pic.a
>>> defined at Resource.cpp:14 (xplat/executorch/backends/vulkan/runtime/api/Resource.cpp:14)
>>>            Resource.cpp.pic.o:(.text._ZlsRNSt6__ndk113basic_ostreamIcNS_11char_traitsIcEEEE18VmaTotalStatistics+0x1) in archive buck-out/v2/gen/fbsource/c81367d319075390/xplat/executorch/backends/vulkan/__vulkan_compute_api__/libvulkan_compute_api.pic.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Buck UI: https://www.internalfb.com/buck2/fc1cf878-690d-48ab-acdb-ece2c48dab42
Network: Up: 43MiB  Down: 391MiB  (reSessionID-830cf8b1-c9c8-474a-b8ed-45c37fceb21b)
Jobs completed: 9227. Time elapsed: 1:31.3s.
Cache hits: 22%. Commands: 5665 (cached: 1261, remote: 4002, local: 402)
BUILD FAILED
Failed to build 'fbsource//xplat/wearable/wrist/ml:wristmlcore (ovr_config//platform/android:arm32-clang-r21e-api29-opt-malibu#c81367d319075390)'
```

Test Plan:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin
```

Reviewed By: copyrightly

Differential Revision: D55926906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123659
Approved by: https://github.com/SS-JIA
2024-04-09 23:28:16 +00:00
ec7551d1b7 UFMT formatting on test/export (#123520)
Partially addresses https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
test/export

Detail:
```Shell
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123520
Approved by: https://github.com/shink, https://github.com/ezyang
2024-04-09 23:24:13 +00:00
c773913407 Add torch.while_loop support to AOT Inductor (#123586)
Summary: Previously, `torch.while_loop` was supported only in JIT inductor (added in https://github.com/pytorch/pytorch/pull/122069). Here we extend the support to AOT Inductor.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_while_loop
...
----------------------------------------------------------------------
Ran 24 tests in 129.236s

OK (skipped=8)

$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 50 tests in 136.199s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123586
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-04-09 22:53:10 +00:00
3908ebca86 Test COW materialization in backward ops (#123593)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123593
Approved by: https://github.com/ezyang
2024-04-09 22:31:50 +00:00
298171df5c [benchmark] Add namedtuple pytree serialization (#123648)
Fixes https://github.com/pytorch/pytorch/pull/123388#issuecomment-2045289729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123648
Approved by: https://github.com/desertfire
2024-04-09 22:25:36 +00:00
60d7fbe89a Register matmul out variant so it is used (#122979)
Fixes https://github.com/pytorch/pytorch/issues/122774

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122979
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2024-04-09 22:21:37 +00:00
27eb5daee4 [AOTI] Serialize large weights (#123002)
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config._force_mmap_aoti_weights`

Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question

Added unites to validate that it works as expected
TODO:
  - Extend support to CUDA
  - munmap region if the same library is reused

Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
2024-04-09 22:18:57 +00:00
d6fb1da806 Fix doc example of masked_scatter (#123664)
mask has to be a bool tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123664
Approved by: https://github.com/peterbell10, https://github.com/albanD
2024-04-09 22:15:12 +00:00
adcfc2b582 Add meta reg for addcdiv/addcmul ScalarList (#123486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123486
Approved by: https://github.com/awgu
2024-04-09 22:05:58 +00:00
b287dbbc24 [export] Fix naming if state dict contains colons (#123601)
Test Plan:
buck2 run mode/opt //aps_models/pyper/ads:train\[inplace\] +training.ir_serializer=on_disk

https://www.internalfb.com/intern/everpaste/?handle=GICWmAB0g_Z1StMCAMxuhJI6U9pHbsIXAAAz

Reviewed By: tugsbayasgalan

Differential Revision: D55894742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123601
Approved by: https://github.com/pianpwk
2024-04-09 21:25:08 +00:00
a96e4ad0d1 [Inductor] Pass device interface to the worker compile (#122492)
Summary: In `codecache.py` pass the device_interface directly to `_worker_compile()` instead of calling `get_device_interface()` from inside the function.

If the device_interface is registered by an out-of-tree module then it will only be registered inside the main process and not inside the worker process. This fixes this issue. Happy to add a test if required.

Test plan:
No tests added

Co-authored-by: brothergomez <brothergomez@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122492
Approved by: https://github.com/ezyang
2024-04-09 21:23:33 +00:00
f7cdc1b9bb Add test_aot_inductor to test_inductor (#123340)
AOTI changes have been breaking for ROCm on trunk because we do not have testing of AOTI in inductor/pull/trunk workflow for ROCm. This PR adds `test_aot_inductor` to inductor workflow to catch such issues.

More context here: https://github.com/pytorch/pytorch/pull/123164#issuecomment-2033494012

Runtime increase for inductor workflow:
CUDA:
PR corresponding to base commit used for this PR: [100 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23415210028?pr=123290)
This PR: [183 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23465530389?pr=123340)

ROCM:
PR corresponding to base commit used for this PR: [105 mins](https://github.com/pytorch/pytorch/actions/runs/8545475047/job/23416422145?pr=123290)
This PR: [148 mins](https://github.com/pytorch/pytorch/actions/runs/8562003098/job/23466516866?pr=123340)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123340
Approved by: https://github.com/atalman, https://github.com/desertfire
2024-04-09 21:22:11 +00:00
bff321716c Remove special handling of step with closure (#123620)
Implements https://github.com/pytorch/pytorch/issues/123479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123620
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496, #123497, #123551, #123552, #123618
2024-04-09 21:15:24 +00:00
3db618d656 [CUDA] Use 64-bit indexing in CUDA_KERNEL_LOOP in im2col (#118005)
#117736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118005
Approved by: https://github.com/atalman
2024-04-09 21:04:20 +00:00
b3eb1b2f74 Revert "fix amp for AOTInductor (#122883)"
This reverts commit a4a49f77b8c45ea459263c2242ab391b3d0577f2.

Reverted https://github.com/pytorch/pytorch/pull/122883 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122883#issuecomment-2046026363))
2024-04-09 20:51:53 +00:00
041be901b3 fix ctc_loss zero-length/neg-length corner cases (#123193)
Fixes #84827, fixes #86596, fixes #88047, fixes #89208.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123193
Approved by: https://github.com/mikaylagawarecki
2024-04-09 20:39:39 +00:00
a6080f79e9 [Build] Add linker script optimization (#121975)
This PR adds a linker script optimization based on prioritized symbols that can be extracted from the profiles of popular workloads. The present linker script was generated to target ARM+CUDA and later can be extended if necessary. The reason we target ARM is shown below:

> PyTorch and other applications that access more than 24x 2MB code regions in quick succession can result in performance bottlenecks in the CPU front-end.  The link-time optimization improves executable code locality and improve performance. We recommend turning on the optimization always for PyTorch and other application that behaves similarly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121975
Approved by: https://github.com/ptrblck, https://github.com/atalman
2024-04-09 20:22:25 +00:00
178ce1433c Hoist out auxiliary values in optional-typed arguments (#123613)
This fixes #123176, and partially addresses #121814 too. #123176 uses an
optional device arg while #121814 uses an optional list arg.

For optional arguments that have auxiliary info -- specifically, tuples
/ lists with their length parameter, and device types with their device
index -- we need to hoist out the extra argument. E.g. when passing a
device with ID 1, we want to emit

```
auto var_0 = cached_torch_device_type_cpu;
aoti_torch_foo(..., &var_0, 1);
```

instead of the (syntactically incorrect)

```
auto var_0 = cached_torch_device_type_cpu,1;
aoti_torch_foo(..., &var_0);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123613
Approved by: https://github.com/desertfire
2024-04-09 20:17:35 +00:00
1970a802b3 Only print bw result for the first time we benchmark a kernel (#123568)
Summary:
As title.

Before this change, we use the benchmark result saved as cache and print out every time we call a kernel. The information is the same. Let's just print out at the first iteration.

Test Plan: Local test.

Differential Revision: D55878382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123568
Approved by: https://github.com/jackiexu1992
2024-04-09 19:53:57 +00:00
5712c326a5 Teach pattern_matcher to use a pre-traced pattern if given (#121314)
The check_fn portion of pattern_matcher was retracing the pattern even if a pre-traced pattern was provided.
I think that as long as the patterns don't have control flow based on their inputs then this should be safe.

For this benchmark
```
python benchmarks/dynamo/huggingface.py --training --amp --performance --only MobileBertForQuestionAnswering --backend=inductor
```
this improves the performance of `joint_graph_passes` from about 9s down to 3s.

In the performance dashboard it seems to be a small win - most of the compilation times dropped by a couple seconds:
Torchbench 126s -> 124s
Huggingface 114s -> 110s
TIMM models 209s -> 208s
Dynamic 44s -> 43s
Blueberries 84s -> 81s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121314
Approved by: https://github.com/eellison
ghstack dependencies: #121313
2024-04-09 19:42:19 +00:00
4044e93a51 Add mm_pattern and bmm_pattern to serialized_patterns (#121313)
Make it easier to serialize patterns by adding `pattern_matcher.gen_register_replacement()` which is like `pattern_matcher.register_replacement()` but also requires the replacement to be precompiled.

To precompile patterns (and save to disk) run:
```
torchgen/fuse_attention_patterns/gen_attention_patterns.py
```

- Updated the sfdp patterns to use `gen_register_replacement`.
- Add serialized patterns for mm_pattern and bmm_pattern (The 'misc' patterns don't serialize cleanly so can't be added).
- Updated the testing so it checked the round-trip patterns match and not just that it serialized the same way.
- Checking that the patterns round-trip properly found that the `users` field wasn't being serialized properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121313
Approved by: https://github.com/eellison
2024-04-09 19:42:19 +00:00
f772ea5493 Improve return value docs for Module.load_state_dict (#123637)
Sorry to add this to your plate but I hope it helps. I find it's ambiguous what "missing keys" and "unexpected keys" are, and the documentation does not add clarity. Today I realized I've been double-guessing myself on this for years.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123637
Approved by: https://github.com/mikaylagawarecki
2024-04-09 19:39:31 +00:00
10d06fc92e Revert "[EZ] Update mypy to 1.9.0 (#123595)"
This reverts commit f61b04a1f0ed55aa3f9b75e1266e7a5dc71fc90d.

Reverted https://github.com/pytorch/pytorch/pull/123595 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/123595#issuecomment-2045865407))
2024-04-09 18:53:55 +00:00
1b5944358e Ignore logging.Logger.* calls during dynamo export (#123402)
Follow up for https://github.com/pytorch/pytorch/pull/123368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123402
Approved by: https://github.com/williamwen42
2024-04-09 18:51:00 +00:00
2a37793249 [Dynamo] Ensure that Higher Order Ops can be composed in dynamo (#123357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123357
Approved by: https://github.com/zou3519
ghstack dependencies: #122211
2024-04-09 18:50:17 +00:00
497bac223c Add XPU backend check on NamedTensor (#123081)
# Motivation
Support `NamedTensor` on XPU backend.

# Motivation
No need UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123081
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/ezyang
2024-04-09 18:45:17 +00:00
56dd7603da Cleanup comment (#123467)
Realized that these comments are not correct anymore. Updated to match what the code does
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123467
Approved by: https://github.com/mikaylagawarecki
2024-04-09 18:00:20 +00:00
7a78534468 [Compile FSDP2][1/n] Support using user-defined object instance method as hook (#123399)
FSDP2 has this pattern of using user-defined object instance method as hook, and it will throw this error under compile:
`torch._dynamo.exc.Unsupported: call_function UserDefinedObjectVariable(_pre_forward) [FSDPManagedNNModuleVariable(), TupleVariable(), ConstDictVariable()] {}`

This PR adds support for it by always allowing to trace into a UserDefinedObjectVariable that's an instance method (i.e. `MethodType`).

Supersedes https://github.com/pytorch/pytorch/pull/123320.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123399
Approved by: https://github.com/jansel
2024-04-09 17:29:08 +00:00
9a661636e3 Make lint clean on OS X (#123052)
I don't know why I get different mypy problems when I run on my Macbook,
but they weren't too hard to fix so I justed fixed them.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123052
Approved by: https://github.com/tugsbayasgalan, https://github.com/cyyever, https://github.com/albanD
2024-04-09 17:10:16 +00:00
46903d978b fix maybe_initialize_device for custom device. (#121379)
1. fix maybe_initialize_device for custom device.
@wanchaol  @albanD

@albanD  I am very sorry that I have resubmitted a PR by new e-mail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121379
Approved by: https://github.com/albanD
2024-04-09 16:58:52 +00:00
270dd99180 Fix record issue on XPUGuardImpl (#123523)
# Motivation
Previously,  `xpu_event` became a dangling pointer because the variable on the stack is destroyed when the scope ends. It results in these event-related functions (`destroyEvent`, `record`, `block`, and `queryEvent`)  used in `c10/core/impl/InlineEvent.h`, which serves `c10::Event`, do not work correctly.

# Solution
Use `new` allocated on the heap to assign `xpu_event` to avoid the dangling pointer.

# Additional Context
Add a UT to cover this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123523
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
2024-04-09 16:24:13 +00:00
266e278ccf UFMT formatting on test/distributions, test/error_messages, test/forward_backward_compatability (#123527)
Partiall addresses #123062

UFMT formatting on
- test/distributions
- test/error_messages, test/forward_backward_compatability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123527
Approved by: https://github.com/huydhn
2024-04-09 16:03:46 +00:00
c96bd3de06 Enable UFMT on all of test/fx (#123622)
Partially addresses #123062

Ran lintrunner on:

- `test/fx`

with command:

```bash
lintrunner -a --take UFMT --all-files
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123622
Approved by: https://github.com/ezyang
2024-04-09 15:59:17 +00:00
3b3962f7b3 Enable UFMT on torch_version.py and types.py (#123131)
Part of efforts described in #123062.

---

This PR enables the `µfmt` formatting for the following files:

- `torch_version.py`
- `types.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123131
Approved by: https://github.com/ezyang
2024-04-09 15:03:17 +00:00
2834c68deb Migrate linux-jammy-py3_10-clang15-asan-build to ARC (#123434)
Migrate linux-jammy-py3_10-clang15-asan-build to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123434
Approved by: https://github.com/zxiiro, https://github.com/atalman
2024-04-09 14:23:18 +00:00
6980c5048d UFMT formatting on test/mobile (#123521)
Partially addresses https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
test/mobile

Detail:
```Shell
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123521
Approved by: https://github.com/shink, https://github.com/ezyang
2024-04-09 14:06:22 +00:00
f61b04a1f0 [EZ] Update mypy to 1.9.0 (#123595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123595
Approved by: https://github.com/kit1980
2024-04-09 13:36:57 +00:00
491e2ed6d1 [AOTI] Fix an internal test regression (#123481)
Summary: https://github.com/pytorch/pytorch/issues/123174 causes some internal tests to fail, because when the generated model.so uses the MinimalArrayRefInterface, inputs are in ArrayRefTensor which still need to be converted using convert_arrayref_tensor_to_tensor. So let's bring back the relevant code with an enhanced way to detect numbers.

Test Plan: CI

Differential Revision: D55823570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123481
Approved by: https://github.com/chenyang78
2024-04-09 13:03:44 +00:00
a4a49f77b8 fix amp for AOTInductor (#122883)
## Pitch
This PR disables the amp when calling the inference_compiler in AOTInductor path (after having exported the model graph), following the way we disable AMP in Inductor path in https://github.com/pytorch/pytorch/pull/86515.

## Description
When testing AOTInductor AMP accuracy on CPU using the dynamo benchmark suites, multiple workloads will fail in this assertion: [assert pattern_repr not in _seen_patterns](1d52c2d985/torch/_inductor/pattern_matcher.py (L1095)) which is called when registering SDPA patterns.

The `inference_compiler` ([fw_compiler_base](1d52c2d985/torch/_inductor/compile_fx.py (L1234))) will call into [_recursive_joint_graph_passe](1d52c2d985/torch/_inductor/compile_fx.py (L1241)) and then [_sfdp_init](1d52c2d985/torch/_inductor/fx_passes/fuse_attention.py (L847)).

When testing accuracy, we'll set [inductor_config.fallback_random = True](1d52c2d985/benchmarks/dynamo/common.py (L3526)), which will make the `search_fn` to be `None` [here](1d52c2d985/torch/_inductor/fx_passes/serialized_patterns/central_index.py (L117-L118)), thus the pattern will be generated runtime [here](1d52c2d985/torch/_inductor/pattern_matcher.py (L1083-L1084)). When AMP is on, the generated pattern for SDPA FP32 will be the same as that of FP16, which makes the assertion fail.

Inductor path disables amp inside [aot_dispatch_base](1d52c2d985/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L124-L128)). We follow the same way to disable for AOTInductor path here (after having exported the model graph) to fix this issue.

## UT
For the added UT, there's one case
`python test/inductor/test_aot_inductor.py -k test_amp_fallback_random_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface` fails with the below error which is not caused by this PR itself. Marked it as skipped for now.
```
RuntimeError: Error in dlopen: /tmp/torchinductor_user/cf5vk3gqkbvud56qeotdxqvns4wbk3sjnlnuadolt7b6g7a6kspb/cfzjo5ackvrth2gp6oq4lfpdyfafoagodfpjvbzhsi2u64hza2vn.so: undefined symbol: _Z16aoti_torch_dtypeIN3c108BFloat16EEiv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122883
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-09 12:08:33 +00:00
4656ea5768 Migrate linux-jammy-py3_8-gcc11-pch to ARC (#123433)
Migrate linux-jammy-py3_8-gcc11-pch to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123433
Approved by: https://github.com/zxiiro, https://github.com/atalman
2024-04-09 10:51:45 +00:00
15745a52b0 Inductor: don't change the stride_order of FlexibleLayout if it's already the same as required (#122945)
## Pitch
Fixes https://github.com/pytorch/pytorch/issues/122489.
Don't change the `stride_order` of `FlexibleLayout` if it already has the stride with the order required.

## Description
For a layout that's both contiguous and channels last contiguous (for example `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]` where the `C` dim is `1`), the behavior of calling  [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)) (where the order is specified as channels last) on it is different when it's a `FixedLayout` or a `FlexibleLayout`.

- For a `FixedLayout`, the size and stride is unchanged after the call: `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]`.
- For a `FlexibleLayout`, it will become `size=[s0, 1, 28, 28]`, `stride=[784, 1, 28, 1])`.

When weight is not prepacked (in dynamic shapes cases), the Conv extern kernel returns output in channels **first** for input with `size=[s0, 1, 28, 28]`, `stride=[784, 784, 28, 1]` but output in channels **last** for `size=[s0, 1, 28, 28]`, `stride=[784, 1, 28, 1])`.

In this PR, for a `FlexibleLayout`, we add a check to see if it already has the stride in the required order. If that's the case, we don't change its stride order when freezing it. This makes the behavior of  calling [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)) aligned for `FixedLayout` and `FlexibleLayout`.

## Additional context
For a `FixedLayout`, when calling  [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)), it will firstly run into [x.get_layout().is_stride_ordered(order)](069270db60/torch/_inductor/ir.py (L4067-L4070)) to check if it's already ordered as expected.

If it is a `FlexibleLayout`, when calling  [require_stride_order](069270db60/torch/_inductor/ir.py (L4053)),  it runs into [as_storage_and_layout](069270db60/torch/_inductor/ir.py (L4063-L4065)), which will always [freeze_layout_with_stride_order](069270db60/torch/_inductor/ir.py (L1805)) and will always call [as_stride_order](069270db60/torch/_inductor/ir.py (L2909)), without checking if the default stride of this `FlexibleLayout` (which has been realized before) is already as expected ([link](069270db60/torch/_inductor/ir.py (L2693-L2700))).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122945
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-09 10:00:30 +00:00
7c23fed12c Move step to cpu if state is already initialized (#123618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123618
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496, #123497, #123551, #123552
2024-04-09 09:04:18 +00:00
526a69f5ee Remove incorrect check (#123616)
Summary: This was a micro optimization that I thought would save time but it is not correct. For example, we cannot compare fake tensors.

Test Plan:
```
buck2 run 'fbcode//mode/opt' fbcode//langtech/edge/ns/tools/tests:test_ns_jit_traced_model_all_optimization_f328819347_portal_ns
```
now passes

Differential Revision: D55904083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123616
Approved by: https://github.com/aakhundov
2024-04-09 08:45:34 +00:00
d04957c0c6 Revert "Ignore logging.Logger.* calls during dynamo export (#123402)"
This reverts commit 75933ff5231b1caed333065ea9f5a847caa4cdaa.

Reverted https://github.com/pytorch/pytorch/pull/123402 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123402#issuecomment-2044236088))
2024-04-09 06:28:12 +00:00
b9d2b75bac Revert "Add test for skipping hf logging during export (#123410)"
This reverts commit ba55ef8e2165c718a269e5bca0cb83c635731c83.

Reverted https://github.com/pytorch/pytorch/pull/123410 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123402#issuecomment-2044236088))
2024-04-09 06:28:12 +00:00
666a628bea [Inductor pattern] support int8 woq mm pattern matcher with freezing passe (#122955)
There exist some issues in the previous PR (https://github.com/pytorch/pytorch/pull/120985) of supporting int8 WOQ mm pattern matcher. This PR tends to further optimize it.

1. New patterns are added to match int8 woq mm in gpt-fast model, due to different input layouts.

2. In constant folding, `int8_weight -> dq -> bf16_weight` should be kept for pattern match.

3. Currently, GPT-Fast enables `coordinate_descent_tuning` for CPU. This flag is only useful for CUDA, but it could change the graph: from the non-decomposed fallback pass to the decomposed one. We will disable the flag in GPT-Fast script for CPU, in order to have neat patterns. @yanbing-j

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122955
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-04-09 05:06:52 +00:00
d86cb9c747 [Quant][Inductor] Add qlinear_pointwise.binary op for X86Inductor backend (#123144)
**Note**: This is a reopen of https://github.com/pytorch/pytorch/pull/122288, which was merged by `ghstack land` to its base (not main) by mistake.

**Description**
Add qlinear_binary op for X86Inductor backend of quantization PT2E. It only supports `add` and `add_relu` now.
It will use post op sum if the extra input has the same dtype as output. Otherwise, it uses binary add.
```
+-------------------+--------------+---------------+
| Extra input dtype | Output dtype | Post op       |
+-------------------+--------------+---------------+
| Fp32/bf16         | fp32/bf16    | sum or add*   |
+-------------------+--------------+---------------+
| Fp32/bf16         | int8         | add           |
+-------------------+--------------+---------------+
| int8              | fp32/bf16    | not supported |
+-------------------+--------------+---------------+
| int8              | int8         | sum           |
+-------------------+--------------+---------------+
*Use sum if extra input and output have the same dtype; otherwise use add.
```

**Test plan**
python test_quantization.py -k test_qlinear_add_pt2e
python test_quantization.py -k test_qlinear_add_relu_pt2e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123144
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-04-09 04:56:37 +00:00
7283c37c98 [dynamo] Keep guards on global function (#123423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123423
Approved by: https://github.com/jansel
2024-04-09 04:23:11 +00:00
98cf183629 [merge rules] add mkldnn_lowerings.py to CPU inductor rule (#123627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123627
Approved by: https://github.com/kit1980
2024-04-09 04:07:20 +00:00
0d3a771f7b Allow for git worktree when computing clangtidy scm root (#123060)
When you make a git worktree, the .git "folder" in the worktree
is not a directory, it's a file pointing at the actual .git directory.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123060
Approved by: https://github.com/albanD
2024-04-09 03:49:27 +00:00
0fd072bf90 [inductor] easy: move mkldnn lowerings to its own file (#123556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123556
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-04-09 03:44:27 +00:00
f07c0977d5 [dynamo, 3.12] avoid using co_lnotab in symbolic_convert (#123577)
Accessing co_lnotab causes a deprecation warning to be issued, causing some dynamo-wrapped tests to fail. We do not need to remove co_lnotab from tests as of now, as they are still useful as an additional check for linetable correctness, but we will need to deal with co_lnotab removal by 3.14.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123577
Approved by: https://github.com/jansel
2024-04-09 03:40:05 +00:00
b63477faa2 [PT2][Inductor] Add decompose_mem_bound_mm to the customization pre and post grad passes (#123376)
Summary: Titled. It can give more flexibility to customize passes

Test Plan:
# unit test
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm
```

# local reproduce
### with decompose

```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split-decompose --model_type "cmf" --flow_id 540761965
```
optimus parameter sent to the scuba:
P1204802500
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLxXNRNO6q8ixo4CAIn5QalwADlsbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GN8MQhmQrZaii4sFAJ37FLW-yjkobr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKYr2xa3vOkEKIQDAL5eKqkDWQAebr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAjzORm9OYV951kBAF5WyqbckVY2br0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMpzQhbeucEI_BwDAOK0nUGoCsZkbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDJ2whaLisgDsYMDABd4ox_-2gp5br0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJY4Pxkg0hntj9UCALgYP3xMdmMMbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO1gCxfWSaDFqhIBABzCPhU827F7br0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GPzyNBaxHtNFJdADADH7AsWMwixBbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBxaWAofHuojr0EBALKAINF-n_Ebbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GPTR_RZdqWhlGmwEADUfB1t_xKN-br0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 3615, 'pattern_matcher_count': 3231, 'normalization_pass': 825, 'remove_split_with_size_one_pass': 673, 'merge_splits_pass': 85, 'merge_getitem_cat_pass': 11, 'batch_aten_mul': 11, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'decompose_mm': 4, 'decompose_mmt': 4, 'batch_aten_sub': 3, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_aten_add': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'batch_relu': 1, 'batch_linear_post_grad': 1}), 'PreGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEp2SBdLi4Q9SfYCALG5AsLl-LJubr0LAAAz', 'BatchReLuPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMry_BbFwSc8epcBAP7-LFeL-aRbbr0LAAAz', 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKpamxaU2v4MlyANANGbWkDgUAQabr0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOMotxYcfE3jsWEBAFi0ABcmUboYbr0LAAAz', 'PostGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GC3yQRku3hY9VmkBAH3QvuAf5z8Cbr0LAAAz'}
```
### without decompose
optimus parameter sent to the scuba:
P1204807273
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLDLYxbo4HnP1ssDAKDGl5fN9SUnbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOKrQBnK6dfZg3YDALrJX7r23dN8br0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GER6ChcNzZ9NX94DAH6ZWJFFD5Uzbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNmRphbUGk2zvswAAJ3sOh3WWGBAbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDYJJBQRWfOYB0wFAJpCr7RsFnsQbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GM2ABxPOewvvdm8FAMPnyXSb6Fwzbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOkqSBYgyv9G4tQCAFBtGCq1OUhkbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMdbSBeSWQGyOGkDANtexORtG0lMbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GF8rvghuhGGVZXMBAPKAC7WPIeUGbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDWyCheBejGMvq0FAApYMMDOu7Jwbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDRgqxE_qCrmMyIDAL5TQ977TQknbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2323, 'pattern_matcher_count': 2071, 'normalization_pass': 825, 'remove_split_with_size_one_pass': 673, 'merge_splits_pass': 85, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'batch_aten_mul': 1, 'batch_relu': 1}), 'PreGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNyUxRYGchkL_gMLAKk5mC-cbU9zbr0LAAAz', 'BatchReLuPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GD19_BZMbHm46BMNAE05wMFtvB9mbr0LAAAz'}
```
# e2e
ads_dper3:f044a2eb340c5477d3347540114a83b0
training_platform:cb4809460a19df86785e790d3a9b92a6

### with decompose
use old config:
```
"decompose_mem_bound_mm": true
```
f549189471
{F1478026031}
add to the post grad fusion options:
```
"post_grad_fusion_options": {
            "batch_linear_post_grad": {},
            "decompose_mm_pass": {
              "min_first_dimension_decomposition": 10240,
              "max_other_dimention_decomposition": 32
            }
          },
```
f549189811
{F1478026133}
### without decompose
with optimize_compress off
f549190692
 {F1478027534}
with optimize_compress on
f549191653
### QPS and NE

 {F1481917745}

 {F1481917870}{F1481917871}

### conclusion
1. when compare  with no optimize_compress, has ~2% qps gain with NE neutral
2. when compare with  optimize_compress, qps is neutral but the optimize_compress shows NE gap compared to the baseline

Differential Revision: D55679277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123376
Approved by: https://github.com/jackiexu1992
2024-04-09 03:35:36 +00:00
8865425ff7 [minifier] Add config flag to ignore non-fp values (#123006)
When minifying, the after-aot minifier ignores non-floating values by
default but does check them when running the the initial graph dump step.
This means we may capture a graph that doesn't fail the tester and doesn't have
any meaningful divergence.

For example, the derivative of `elu(x)` depends on `x > 0` so this value is
saved for backwards and so becomes a graph output. However, the difference
between `FLT_MIN` and `0` in `x` is now enough to trigger an accuracy failure.

I fix this by adding a config variable and environment variable to ignore these
non floating point values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123006
Approved by: https://github.com/ezyang
ghstack dependencies: #123005
2024-04-09 03:34:09 +00:00
6d005ca590 [PT2][Observability] Add model_type and global_rank for the scuba log for the dashboard Optimus pattern frequency monitor (#123398)
Summary: We also log the model type and global rank for easier scuba query to develop the dashbord monitor. More context: https://docs.google.com/document/d/1RuUCOBOgVt9pp-Jgoo4oEXWvoYv6GN0DljypsqgVTp4/edit

Test Plan:
# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542
```
optimus parameter sent to the scuba:
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO3rCxej_mk0RV0DAPE1wtdadgNkbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GESWQhm_XNqiIYYCAJ2nCcg9PPwnbr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAOv_RZb5hEwKIQBAPc7kNFDN2kEbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMUFOxkqRm1ellcDAFLjROHAy4NXbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAuEghZfCNtAVtcCACAqgBH3h4R0br0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GL_q2xZIJ9gRUp4GAAnBc-_frnUpbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GGIerBZMJvpn5moBAH4lzgkY5_Rjbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GKPngRabKVDgodEHAJNTi6H37kwbbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBuDPxmwQPFoGJkCAOsLt_QwVNxvbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJQypRaGi3AMr3MBAMWUDs5rHztkbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMce9xaOaCu3l9YCAM41j-H0hWZMbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2281, 'pattern_matcher_count': 2081, 'normalization_pass': 864, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1}), 'model_type': None, 'global_rank': None}
```

# e2e test
I have no resouce to run the test right now due to the MC proposal deadline. Will add it next week. Should ok based on the local reproduce results.

Differential Revision: D55777055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123398
Approved by: https://github.com/Yuzhen11
2024-04-09 03:28:10 +00:00
7420b8c5be [effects] Add way to register effectul op (#122348)
This adds a way to register an operator as being effectful. I also added a test case which mimics our solution for intermediate logging ([doc](https://docs.google.com/document/d/1eLyGDVe4iplVFiO0I021uLgA4Y6HxK9eqn55e9KzQkc/edit#heading=h.uwec2ukkwhea)), which is by creating a custom op and registering it as effectful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122348
Approved by: https://github.com/zou3519
ghstack dependencies: #122347
2024-04-09 03:22:32 +00:00
493478db4a [effects] Add inductor support for tokens (#122347)
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
    def forward(self, L_x_ : torch.Tensor):
        l_x_ = L_x_
        _print = torch.ops.aten._print('moo')
        res = l_x_ + l_x_;  l_x_ = None
        _print_1 = torch.ops.aten._print('moo')
        return (res,)
```

AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
        with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo');  arg0_1 = None
        getitem: "f32[0]" = with_effects[0];  with_effects = None
        add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
        with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
        getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
        return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
   def forward(self, arg1_1: "f32[2, 3]"):
       _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
       with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo');  _make_dep_token_default = None
       getitem: "f32[0]" = with_effects[0];  with_effects = None
       add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
       with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
       getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
       _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,));  getitem_2 = None
       return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    # Source Nodes: [_print], Original ATen: []
    buf2 = aten._print.default('moo')
    # Source Nodes: [_print_1], Original ATen: []
    buf3 = aten._print.default('moo')
    buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, buf4)
    del arg1_1
    return (buf4, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
2024-04-09 03:22:32 +00:00
9bd6d6e8b0 Add mem-eff-attention's sliding window arg to align with xformers (#123571)
# Summary
Updates to align with implementation in xformers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123571
Approved by: https://github.com/danthe3rd
2024-04-09 02:19:12 +00:00
565e8c0645 [Reland] Enable dynamo'd tests disabled for #115679 (#123552)
Relanding https://github.com/pytorch/pytorch/pull/123315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123552
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496, #123497, #123551
2024-04-09 02:14:32 +00:00
8bd6223730 [export] construct set_grad_enabled HOO subgraph inside other HOO subgraphs (#123391)
Summary:
Reference: https://github.com/pytorch/pytorch/pull/121736

Previously set_grad_enabled nodes in HOO subgraphs (e.g. cond) were inlined and not replaced with their own HOO subgraphs. This diff recursively does that.

Example:
```
class Model(torch.nn.Module):
    def forward(self, x, y):
        def true_fn(x, y):
            with torch.enable_grad():
                return x - y

        return torch.cond(
            x.sum() > 0,
            true_fn,
            lambda x, y: x + y,
            [x, y],
        )
```

Before (printing out `ep.graph_module.true_graph_0`):
```
        class <lambda>(torch.nn.Module):
            def forward(self, arg0_1: "i64[]", arg1_1: "i64[]"):
                # No stacktrace found for following nodes
                _set_grad_enabled = torch._C._set_grad_enabled(True)
                sub: "i64[]" = torch.ops.aten.sub.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
                _set_grad_enabled_1 = torch._C._set_grad_enabled(False)
                return (sub,)
```

After:
```
        class GraphModule(torch.nn.Module):
            def forward(self, arg0_1: "i64[]", arg1_1: "i64[]"):
                # No stacktrace found for following nodes
                submod_3 = self.submod_1
                sub: "i64[]" = torch._higher_order_ops.wrap.wrap_with_set_grad_enabled(True, submod_3, arg0_1, arg1_1);  submod_3 = arg0_1 = arg1_1 = None
                return (sub,)

            class GraphModule(torch.nn.Module):
                def forward(self, arg0_1: "i64[]", arg1_1: "i64[]"):
                    # No stacktrace found for following nodes
                    sub: "i64[]" = torch.ops.aten.sub.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
                    return sub
```

Differential Revision: D55770138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123391
Approved by: https://github.com/tugsbayasgalan
2024-04-09 02:08:03 +00:00
3969f85769 add TORCH_NCCL_HIGH_PRIORITY option (#122830)
There are many existing ProcessGroupNCCL features controlled by env vars.  This PR adds TORCH_NCCL_HIGH_PRIORITY to force the use of high-priority CUDA or HIP streams for the NCCL or RCCL kernels, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122830
Approved by: https://github.com/kwen2501
2024-04-09 01:11:18 +00:00
9faa8848ea [aotinductor] Add test case for outputs with views (#123415)
Also test views instead of .contiguous() for outputs with multiple aliases.

```
        output_handles[0] = buf0.release();
        output_handles[1] = output_handles[0];
        output_handles[2] = wrap_with_raii_handle_if_needed(tmp_tensor_handle_0).release();
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123415
Approved by: https://github.com/chenyang78
2024-04-08 23:56:01 +00:00
c3d37a88ed Fix a perf regression in MultiTensorApply (#123566)
#119764 inadvertantly increased the register usage of `multi_tensor_apply_for_fused_optimizer` and caused a perf regression. The increase was due to an unnecceary indirection from `multi_tensor_apply_kernel` to `multi_tensor_apply_kernel_dev`. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123566
Approved by: https://github.com/eqy, https://github.com/janeyx99
ghstack dependencies: #119764
2024-04-08 23:29:21 +00:00
ba55ef8e21 Add test for skipping hf logging during export (#123410)
https://github.com/pytorch/pytorch/pull/123402 already supports hf
logging because HF logger is based on logging module

This PR adds a test to guard this against regression, only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123410
Approved by: https://github.com/BowenBao, https://github.com/malfet
ghstack dependencies: #123402
2024-04-08 23:20:30 +00:00
1b9eebb6bb [AOTI] Handle null outputs (#123460)
Summary:

I skipped over the codegen for output handle assignment if the outputs
are null -- in addition to being redundant, it was causing compile
errors.

I also modified the runtime to do the necessary null checks.

Fixes #123173.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123460
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-04-08 23:07:03 +00:00
75933ff523 Ignore logging.Logger.* calls during dynamo export (#123402)
Follow up for https://github.com/pytorch/pytorch/pull/123368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123402
Approved by: https://github.com/williamwen42
2024-04-08 22:50:54 +00:00
aa9aed2fcf Removes forgotten print statement (#123579)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123579
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-04-08 22:37:19 +00:00
d8e0c26e64 [dynamo] Support warnings.catch_warnings (#123511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123511
Approved by: https://github.com/anijain2305
2024-04-08 22:27:46 +00:00
6951626735 [Reland] Enable tests disabled for #115607 (#123551)
Relanding https://github.com/pytorch/pytorch/pull/123314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123551
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496, #123497
2024-04-08 21:29:28 +00:00
4765570359 [onnx.export] Avoid building vals_to_params_map (#123025)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- Building vals_to_params_map costs linear time in N (number of nodes), when instead we can index into this dictionary directly.
- No need to call HasField on the final else, since c10::nullopt is the default returned value if a field does not exist.
- Resolves (3) in #121422.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123025
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-04-08 21:02:55 +00:00
c48f6680ff Skip test_artificial_grid_cpp_wrapper (#123211)
Summary: This test is actually broken and probably succeeding by mistake because of a cache hit. Forcing a fresh cache or removing the errant setting cause a consistent failure. Disabling for now until we have time to investigate further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123211
Approved by: https://github.com/desertfire
2024-04-08 20:55:47 +00:00
1be2126ff6 [pytree] Fix namedtuple serialization (#123388)
Summary:
Previously we were serializing namedtuple treespecs incorrectly:
```python
Point = namedtuple("Point", ["x", "y"])
p = Point(1, 2)
flat, spec = pytree.tree_flatten(p)

print(flat)  # [1, 2]
print(spec)  # TreeSpec(type=namedtuple, context=Point, children=[*, *])

dumped_spec = pytree.treespec_dumps(spec)
print(dumped_spec)
"""
We only serialize the name of the class and the fields of the namedtuple:

TreeSpec {
  type='collections.namedtuple',
  context={class_name='Point', class_fields={'x', 'y'}},
  children=[Leaf, Leaf]
}
"""

reconstructed_spec = pytree.treespec_loads(dumped_spec)
print(reconstructed_spec)
"""
When we load, we create a new namedtuple class containing the same fields as before,
but the is class is now a completely different class than the original one:

TreeSpec(type=namedtuple, context=torch.utils._pytree.Point, children=[*, *])
"""

spec == reconstructed_spec  # False
```

So, we introduce a new API called `pytree._register_namedtuple` where users can pass in the serialized name for each namedtuple class:
```python
Point = namedtuple("Point", ["x", "y"])
pytree._register_namedtuple(Point, "Point")

p = Point(1, 2)
flat, spec = pytree.tree_flatten(p)

print(flat)  # [1, 2]
print(spec)  # TreeSpec(type=namedtuple, context=Point, children=[*, *])

dumped_spec = pytree.treespec_dumps(spec)
print(dumped_spec)
"""
TreeSpec {
  type='collections.namedtuple',
  context='Point',
  children=[Leaf, Leaf]
}
"""

reconstructed_spec = pytree.treespec_loads(dumped_spec)
print(reconstructed_spec)  # TreeSpec(type=namedtuple, context=Point, children=[*, *])

spec == reconstructed_spec  # True
```

Test Plan: `python test/test_pytree.py`

Differential Revision: D55771058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123388
Approved by: https://github.com/zou3519
2024-04-08 20:55:19 +00:00
c797fbc4e1 Enable UFMT on test/cpp_api_parity, test/cpp_extensions, test/create_dummy_torchscript_model.py, test/custom_backend, test/custom_operator (#123518)
Partially addresses #123062

Ran lintrunner on:

- `test/cpp_api_parity`
- `test/cpp_extensions`
- `test/create_dummy_torchscript_model.py`
- `test/custom_backend`
- `test/custom_operator`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123518
Approved by: https://github.com/huydhn
2024-04-08 20:18:42 +00:00
b279034e5a [DDP][PT2D] Add the trace rules for DDP (#121741)
Add the trace rules for DDP and refactor the tests to verify both DDP and replicate.

Differential Revision: [D54815909](https://our.internmc.facebook.com/intern/diff/D54815909/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121741
Approved by: https://github.com/yf225
ghstack dependencies: #123206, #123207
2024-04-08 19:53:13 +00:00
89e6292d48 Defer setting capturable in optimizer variable (#123497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123497
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496
2024-04-08 19:31:25 +00:00
73e235f0a6 Swap to ID guard for optimizer Variable (#123496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123496
Approved by: https://github.com/anijain2305
2024-04-08 19:28:25 +00:00
6a3b47ec8f [PT2D][DDP] Remove the hack to pass None as the process group (#123207)
Functional collectives can now handle None as the process group.

Differential Revision: [D55658338](https://our.internmc.facebook.com/intern/diff/D55658338/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123207
Approved by: https://github.com/kwen2501
ghstack dependencies: #123206
2024-04-08 19:24:29 +00:00
aa73d5bb5c Register COLLECTIVE_COMM profiler activity type, if available (#121461)
**Summary:**
Instantiate a collective communications profiler. If a collectives profiler exists then add COLLECTIVE_COMM activity type to the CUDA activity types.

**Test Plan:**
Sample output trace (**internal** use only): https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/clientAPI/0/1707520266/devgpu007.pci1/nccl_activities_2643654_1707520266977.json.gz&bucket=gpu_traces
Co-authored-by: Darshan Sanghani <dsang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121461
Approved by: https://github.com/aaronenyeshi
2024-04-08 18:23:56 +00:00
a2327d203b [PT2D][DDP] Remove some hacks to get the test work (#123206)
It seems that these bugs are fixed (not sure what PRs) and we don't need to disable the buffer reused.

Differential Revision: [D55657388](https://our.internmc.facebook.com/intern/diff/D55657388/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123206
Approved by: https://github.com/kwen2501, https://github.com/yifuwang
2024-04-08 17:40:14 +00:00
3e8d3577be Revert "Swap to ID guard for optimizer Variable (#123496)"
This reverts commit 26bf05ccacc0377f0ef40d1d9c792c403267d5d5.

Reverted https://github.com/pytorch/pytorch/pull/123496 on behalf of https://github.com/PaliC due to seems to have broken distributed/fsdp/test_fsdp_hybrid_shard.py as per 26bf05ccac ([comment](https://github.com/pytorch/pytorch/pull/123496#issuecomment-2043251234))
2024-04-08 17:06:05 +00:00
d9ac80f80c Revert "Defer setting capturable in optimizer variable (#123497)"
This reverts commit 76b290344f917ee0b9e1c69863ae04354a298dd2.

Reverted https://github.com/pytorch/pytorch/pull/123497 on behalf of https://github.com/PaliC due to seems to have broken distributed/fsdp/test_fsdp_hybrid_shard.py as per 26bf05ccac ([comment](https://github.com/pytorch/pytorch/pull/123496#issuecomment-2043251234))
2024-04-08 17:06:05 +00:00
e4e5449dfc [aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
2024-04-08 16:51:43 +00:00
61be8843c9 [TD] Use label to configure td on distributed for rollout (#122976)
Gate TD on distributed behind label

TODO:
auto add label to certain people's prs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122976
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-04-08 15:53:55 +00:00
4f66db80ca Migrate linux-jammy-py3.8-gcc11-no-ops to ARC (#123432)
Migrate linux-jammy-py3.8-gcc11-no-ops to ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123432
Approved by: https://github.com/zxiiro, https://github.com/malfet, https://github.com/atalman
2024-04-08 15:50:53 +00:00
3a7351bf91 [xla hash update] update the pinned xla hash (#123549)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123549
Approved by: https://github.com/pytorchbot
2024-04-08 11:27:18 +00:00
76b290344f Defer setting capturable in optimizer variable (#123497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123497
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496
2024-04-08 08:34:19 +00:00
26bf05ccac Swap to ID guard for optimizer Variable (#123496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123496
Approved by: https://github.com/anijain2305
2024-04-08 05:03:34 +00:00
bb04f3f66a [dynamo][logger] Log graph break on Unsupported bytecodes (#122684)
This would have saved me a few hours while debugging an internal model. We could not support a LOAD_ATTR bytecode, because it was a property, and the inlining failed due to skip. Since LOAD_ATTR does not support continuation function, we would fallback to eager for the whole frame aka skip. But, we should also log this as graph break. This PR does it.

Bonus - removes skip from a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122684
Approved by: https://github.com/ezyang
2024-04-08 01:50:04 +00:00
07cecf4168 [dynamo][cpp-guards] Fix bug for slices (#123516)
Automatic testing as soon as we turn on cpp guards by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123516
Approved by: https://github.com/jansel
ghstack dependencies: #123515
2024-04-07 21:09:05 +00:00
6ceec53579 [dynamo][cpp-guards] Fix test for CPP guard manager (#123515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123515
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel
2024-04-07 21:09:05 +00:00
212e460dce [dynamo] Support custom __setattr__ on UserDefinedObjectVariable (#123318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123318
Approved by: https://github.com/anijain2305
2024-04-07 21:06:52 +00:00
89724843bb Use graph.find_nodes in pattern matcher (#122331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122331
Approved by: https://github.com/jansel
ghstack dependencies: #121565, #122255, #122256, #122257, #122258
2024-04-07 18:51:22 +00:00
5aab2b9acf Use graph.find_nodes in functorch (#122258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122258
Approved by: https://github.com/jansel
ghstack dependencies: #121565, #122255, #122256, #122257
2024-04-07 18:51:22 +00:00
287680176b Use graph.find_nodes in dynamo (#122257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122257
Approved by: https://github.com/jansel
ghstack dependencies: #121565, #122255, #122256
2024-04-07 18:51:18 +00:00
f8465df9f0 Use graph.find_nodes in inductor (#122256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122256
Approved by: https://github.com/jansel
ghstack dependencies: #121565, #122255
2024-04-07 18:51:14 +00:00
33783e43e9 Use graph.find_nodes in inductor/fx_passes (#122255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122255
Approved by: https://github.com/jansel
ghstack dependencies: #121565
2024-04-07 18:51:09 +00:00
03b13851d9 [FX] Add side table to FX Graph for O(1) op/target query (#121565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121565
Approved by: https://github.com/jansel
2024-04-07 18:51:05 +00:00
0355f6e954 [Bug Fix] Fix Cuda 12.4 compilation - Refactor SFINAE boxing logic (#123377)
Summary:

PyTorch fails to compile from source using CUDA 12.4. The relevant log is extracted below. This was a recurring issue, which would cause the compilation to fail again on further objects if the first offending object was skipped.

While searching for whether others had experienced this issue before attempting a fix myself, I found this suggested fix by @christian-heusel in https://github.com/pytorch/pytorch/issues/122169#issuecomment-2008455468 written by @lahwaacz. The code written by @lahwaacz at bb1f1a4c54 fixes the issue.

The original issue (#122169) seems to have gone quiet, so I am submitting this PR. I made no substantive adjustments to @lahwaacz' code. My only adjustment was, for the sake of consistency, to remove the double underscores in the struct name, as double underscores are reserved to the implementation in C++ Standard. My change has no functional effect on the original code.

The ArchLinux package from which the original code was committed is licensed under the BSD license. https://archlinux.org/packages/extra/x86_64/python-pytorch/

```
[7900/8804] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o
FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o
/usr/bin/ccache /usr/local/cuda-12.4/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUSPARSELT -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/elliot/compile_test-pytorch/build/aten/src -I/home/elliot/compile_test-pytorch/aten/src -I/home/elliot/compile_test-pytorch/build -I/home/elliot/compile_test-pytorch -I/home/elliot/compile_test-pytorch/cmake/../third_party/benchmark/include -I/home/elliot/compile_test-pytorch/third_party/onnx -I/home/elliot/compile_test-pytorch/build/third_party/onnx -I/home/elliot/compile_test-pytorch/third_party/foxi -I/home/elliot/compile_test-pytorch/build/third_party/foxi -I/home/elliot/compile_test-pytorch/aten/src/THC -I/home/elliot/compile_test-pytorch/aten/src/ATen/cuda -I/home/elliot/compile_test-pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/elliot/compile_test-pytorch/build/caffe2/aten/src -I/home/elliot/compile_test-pytorch/aten/src/ATen/.. -I/home/elliot/compile_test-pytorch/build/nccl/include -I/home/elliot/compile_test-pytorch/c10/cuda/../.. -I/home/elliot/compile_test-pytorch/c10/.. -I/home/elliot/compile_test-pytorch/third_party/tensorpipe -I/home/elliot/compile_test-pytorch/build/third_party/tensorpipe -I/home/elliot/compile_test-pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/elliot/compile_test-pytorch/torch/csrc/api -I/home/elliot/compile_test-pytorch/torch/csrc/api/include -isystem /home/elliot/compile_test-pytorch/build/third_party/gloo -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/gloo -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/elliot/compile_test-pytorch/third_party/protobuf/src -isystem /home/elliot/miniforge3/envs/torchtest/include -isystem /home/elliot/compile_test-pytorch/third_party/gemmlowp -isystem /home/elliot/compile_test-pytorch/third_party/neon2sse -isystem /home/elliot/compile_test-pytorch/third_party/XNNPACK/include -isystem /home/elliot/compile_test-pytorch/third_party/ittapi/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.4/include -isystem /home/elliot/compile_test-pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/elliot/compile_test-pytorch/third_party/ideep/include -isystem /home/elliot/compile_test-pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_86,code=sm_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-unused-function,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-maybe-uninitialized -Wno-deprecated-copy -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o.d -x cu -c /home/elliot/compile_test-pytorch/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/torch/csrc/distributed/c10d/quantization/quantization_gpu.cu.o
/home/elliot/compile_test-pytorch/aten/src/ATen/core/IListRef_inl.h: In static member function ‘static c10::detail::IListRefConstRef<at::OptionalTensorRef> c10::detail::IListRefTagImpl<c10::IListRefTag::Boxed, at::OptionalTensorRef>::iterator_get(const c10::List<std::optional<at::Tensor> >::const_iterator&)’:
/home/elliot/compile_test-pytorch/aten/src/ATen/core/IListRef_inl.h:171:13: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  171 |     const auto& ivalue = (*it).get();
      |             ^~~~~~
/home/elliot/compile_test-pytorch/aten/src/ATen/core/IListRef_inl.h:171:33: note: the temporary was destroyed at the end of the full expression ‘(& it)->c10::impl::ListIterator<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::get()’
  171 |     const auto& ivalue = (*it).get();
      |                      ~~~~~~~~~~~^~
/home/elliot/compile_test-pytorch/aten/src/ATen/core/boxing/impl/boxing.h: At global scope:
/home/elliot/compile_test-pytorch/aten/src/ATen/core/boxing/impl/boxing.h:42:103: error: expected primary-expression before ‘>’ token
   42 | struct has_ivalue_to<T, std::void_t<decltype(std::declval<IValue>().to<T>())>>
      |                                                                                                       ^
/home/elliot/compile_test-pytorch/aten/src/ATen/core/boxing/impl/boxing.h:42:106: error: expected primary-expression before ‘)’ token
   42 | struct has_ivalue_to<T, std::void_t<decltype(std::declval<IValue>().to<T>())>>
      |                                                                                                          ^
/home/elliot/compile_test-pytorch/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h: In lambda function:
/home/elliot/compile_test-pytorch/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h:154:24: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  154 |         for (const at::Tensor& tensor : ivalue.toTensorList()) {
      |                        ^~~~~~
/home/elliot/compile_test-pytorch/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h:154:53: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const at::Tensor&, at::Tensor>()’
  154 |         for (const at::Tensor& tensor : ivalue.toTensorList()) {
      |                                                     ^
...

ninja: build stopped: subcommand failed.
```
```
PyTorch version: 2.4.0a0+git595613d
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 23.10 (x86_64)
GCC version: (Ubuntu 13.2.0-4ubuntu3) 13.2.0
Clang version: 16.0.6 (15)
CMake version: version 3.29.0
Libc version: glibc-2.38

Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.38
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti
Nvidia driver version: 550.67
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i7-13700K
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           1
CPU(s) scaling MHz:                 19%
CPU max MHz:                        5400.0000
CPU min MHz:                        800.0000
BogoMIPS:                           6835.20
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          640 KiB (16 instances)
L1i cache:                          768 KiB (16 instances)
L2 cache:                           24 MiB (10 instances)
L3 cache:                           30 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] optree==0.11.0
[pip3] pytorch-triton==3.0.0+989adb9a29
[pip3] torch==2.4.0a0+git595613d
[conda] magma-cuda124             2.6.1                         1    pytorch
[conda] mkl-include               2024.1.0              intel_691    intel
[conda] mkl-static                2024.1.0              intel_691    intel
[conda] numpy                     1.26.4          py311h64a7726_0    conda-forge
[conda] optree                    0.11.0          py311h9547e67_0    conda-forge
[conda] pytorch-triton            3.0.0+989adb9a29          pypi_0    pypi
[conda] torch                     2.4.0a0+git595613d          pypi_0    pypi
```

Tagging @colesbury per https://github.com/pytorch/pytorch/issues/122169#issuecomment-2008232619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123377
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-04-07 18:37:47 +00:00
1ea2f1eaa1 [BE][MPS] Reorganize logics and naming in copy.mm (#123310)
Was trying to address https://github.com/pytorch/pytorch/issues/119367 but hesitated to do so without knowing how the blit copy works under the hood. So I did some BE on naming and logics
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123310
Approved by: https://github.com/kulinseth
2024-04-07 07:14:02 +00:00
77681facac [fix] inductor split lowering fails if item() is captured (#123032)
Fixes #122937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123032
Approved by: https://github.com/jansel
2024-04-07 04:23:57 +00:00
e3ea316623 [dynamo] Save/restore cublas_allow_tf32 in convert_frame (#123509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123509
Approved by: https://github.com/anijain2305
2024-04-07 03:37:47 +00:00
eff1e4899c Add sparse COO/CSR/CSC/BSR/BSC meta tensor input support to torch.sum (#121673)
As in the title.

Fixes an issue reported in https://github.com/pytorch/pytorch/pull/117907#issuecomment-1987212514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121673
Approved by: https://github.com/cpuhrsch
2024-04-06 21:11:22 +00:00
7ce42ebd44 Generalise mod value ranges (#123253)
We also add the usual comment where we note that we don't handle
negative values in mod properly.

We should also fix this in the definition of ModularIndexing. I'll do that
in a later PR, as for that one I'll also need to fix a number of tests that
are testing an incorrect behaviour.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123253
Approved by: https://github.com/peterbell10
2024-04-06 20:19:24 +00:00
caed7f6727 profile pt2 compile time with strobelight (#123311)
For oss this diff adds a decorator @profile_sb_fbcode that is a nop for non meta workload.

Facebook:
With this diff someone can generate a strobelight profile for pt2 compilation.
users need to set the env variable TORCH_COMPILE_SL_PROFILE =TRUE .

For example:
```
TORCH_COMPILE_SL_PROFILE =TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profile_example
```
see sample output bellow, at the end of summary.

The way this works, is that a unique id is generated and associated with all samples that are collected for functions that are decorated with profile_sb_fbcode.
This id can then be used to combine different strobe light profile into one. (for example three compilation events happens in the code bellow).

Right now the following two functions are annotated with  profile_sb_fbcode.  bw_compiler and _compile. if two profile_sl_fbcode is called recursively, recursive invocations are ignored and a log is printed.

The output is:
```
Strobelight is enabled for pt2 compilation
Unique user-id for this run is: 2024-04-03-13:59:49147091devvm4561.ash0.facebook.com
You can use the following link to access the strobelight profile at the end of the run:
https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22run_user%22%2C%22op%22%3A%22eq%22%2C%22value%22%3A[%22[%5C%222024-04-03-13:59:49147091devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&pool=uber&graphprofiler_filter=&graphprofiler_column_to_sort_by=exclusive
the link below takes you to the collected strobelight profile
https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%22-6800545191281321%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181610%22%2C%22start%22%3A%221712174410%22%7D&view=GraphProfilerView&
1 storbelight success runs out of 1 non-ignored runs.
strobelight run id is: 6181728288420687
the link below takes you to the collected strobelight profile
https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%226181728288420687%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181621%22%2C%22start%22%3A%221712174421%22%7D&view=GraphProfilerView&
2 storbelight success runs out of 2 non-ignored runs.
strobelight run id is: -1026103682715688
the link below takes you to the collected strobelight profile
https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22dimensions%22%3A%5B%5D%2C%22param_dimensions%22%3A%5B%7B%22anchor%22%3A%220%22%2C%22param%22%3A%220%22%2C%22op%22%3A%22edge%22%2C%22dim%22%3A%22py_async_stack%22%7D%5D%2C%22constraints%22%3A%5B%5B%7B%22value%22%3A%5B%22%5B%5C%22-1026103682715688%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_id%22%7D%2C%7B%22value%22%3A%5B%22%5B%5C%222024-04-03-13%3A59%3A49147091devvm4561.ash0.facebook.com%5C%22%5D%22%5D%2C%22op%22%3A%22eq%22%2C%22column%22%3A%22run_user%22%7D%5D%5D%2C%22top%22%3A10000%2C%22end%22%3A%221712181647%22%2C%22start%22%3A%221712174447%22%7D&view=GraphProfilerView&
3 storbelight success runs out of 3 non-ignored runs.
```

Test Plan:
Was tested on buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profile_example

This was also tested in one of the ads benchmarks
```
TORCH_COMPILE_SL_PROFILE =TRUE buck2 run mode/opt mode/inplace //pytorch/benchmark:run -- ads_mc_igctr_mc3_v0 -d cuda -t train --torchdynamo inductor
```
The results matches the results reported in
https://fb.workplace.com/groups/257735836456307/permalink/657458576484029

Differential Revision: D55672271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123311
Approved by: https://github.com/aorenste
2024-04-06 18:57:44 +00:00
c66d503194 Revert "[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)"
This reverts commit 6f7dd2f84a4237b31eac29054b86a5284ef6cb6b.

Reverted https://github.com/pytorch/pytorch/pull/122425 on behalf of https://github.com/malfet due to Breaks ROCM builds ([comment](https://github.com/pytorch/pytorch/pull/122425#issuecomment-2041129241))
2024-04-06 16:19:00 +00:00
89cbb2d86d Allow docs build on workflow dispatch (#123493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123493
Approved by: https://github.com/kit1980
2024-04-06 14:42:39 +00:00
ecb2418dd6 Revert "Adding health check server hook in torch elastic (#122750)"
This reverts commit 61d431fab07f65d3e54c28f1ec420c517c7ada92.

Reverted https://github.com/pytorch/pytorch/pull/122750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122750#issuecomment-2041104931))
2024-04-06 14:31:07 +00:00
7b02910163 [Compile FSDP2][2/n] Support streams created outside of compile region (#123487)
FSDP2 creates CUDA streams outside of compile region in its 1st iteration eager run, and then torch.compile will attempt to record method calls on these streams (e.g. `stream.record_event()`) in >1st iteration compiled run.

Before this PR, stream proxy is None which causes "None doesn't have attribute record_event" error when we try to call `record_event()` on it. After this PR, stream proxy has the correct value which makes calling methods on it possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123487
Approved by: https://github.com/jansel
2024-04-06 08:42:42 +00:00
6f7dd2f84a [Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces

Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55087993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
2024-04-06 06:04:28 +00:00
a4ef9cdd28 benchmark: raise tolerance to unblock triton upgrade (#123484)
Debugging is happening in https://github.com/pytorch/pytorch/issues/123126 .

Upgrading triton cause accuracy failure for mixer_b16_224  and levit_128 .

mixer_b16_224 is debugged specifically. It due to extra FMA instructions being used in a single kernel. That kernel itself only introduce small numerical difference. We conclude that this is not some 'real' accuracy issue and we should raise the tolerance to unblock the triton pin update.

The tolerance is picked such that the CI accuracy test can pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123484
Approved by: https://github.com/jansel
2024-04-06 03:43:25 +00:00
77643ed2eb [torch quantization]raise exception when OOM during combine histogram in observer (#123309)
Summary:
Even with changes in D55347133, it is still possible to OOM in histogram observer, because the size of allocated tensor also depends on *downsample_rate*.

For example, I still see OOM due to the attempt of allocating a 10GB+ histogram tensor in multi-task model.

To fix OOM issue better, we use *try-catch* clause to avoid OOM.
Empirically, we set the max size of a single histogram tensor size to 1 GB.

Test Plan: Test the change for Multi-Task model (depth + segmentation)

Differential Revision: D55567292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123309
Approved by: https://github.com/jerryzh168
2024-04-06 03:15:02 +00:00
d3596cf004 [dynamo][cpp-guards] Fix missing decref in GradGuardAccessor (#123488)
Found that there was a peak mem increase while running HF suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123488
Approved by: https://github.com/jansel
ghstack dependencies: #123485
2024-04-06 03:09:29 +00:00
6fa72480d3 Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459)
Summary: Now that we can input shapes as input args for RecordFunctionFast, let's add that to the triton heuristics. Also, lets add the ability to pass in a tuple into the RecordFunctionFast constructor.

Test Plan:
Ran both the _inductor/test_profile.py and profiler/test_profiler.py unit tests. Also added tuple based unit test to profiler/test_profiler.py

Ran record_function_fast.py from the following branch
https://github.com/pytorch/pytorch/compare/sraikund/record_funct_test?expand=1

No shape or args: tests function fast with no args and profile without record_shapes
With shape tests: tests function fast with args and profile with record_shapes true
Args no shape: tests function fast with args inputted but record_shapes set to false
Args shape tuple: tests function fast with args inputted in form of tuple and record_shapes true

Stdout:

No shape or args:: 1.8491458892822266 us
With shape:: 2.211381196975708 us
Args no shape:: 1.9212646484375 us
With shape tuple:: 2.245788335800171 us

Differential Revision: D55809967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123459
Approved by: https://github.com/davidberard98
2024-04-06 02:44:06 +00:00
8c84fe3c86 [dynamo][guards] Forward fix for #123302 (#123485)
For some reason, adding a `TYPE_CHECK` in DATA_PTR_MATCH guard in https://github.com/pytorch/pytorch/issues/123302 increases optimizer guard overhead for `MT5ForConditionalGeneration` by 10x. There is nothing special about MT5. As we are going to move towards the CPP guards soon, there is no reason to investigate this deeper.

We can use `ID_MATCH` instead of `DATA_PTR` match. Today both cant be serialized, so there is no one preference over the other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123485
Approved by: https://github.com/mlazos
2024-04-06 02:34:06 +00:00
841112d074 [dynamo, 3.12] fix graph break issues with BINARY/STORE_SLICE (#123401)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123401
Approved by: https://github.com/jansel
ghstack dependencies: #123392
2024-04-06 02:19:15 +00:00
284b07ba63 [dynamo, 3.12] fix block stack related issues (#123392)
`JUMP_BACKWARD` in 3.12+ may not be in the exception table even though it should be considered a part of the block. Also fix a issue where we didn't propagate the exception table entry to new instructions when expanding the `POP_JUMP_IF_[NOT_]NONE` instruction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123392
Approved by: https://github.com/jansel
2024-04-06 02:19:15 +00:00
9189d04cb1 [inductor] Add explicit ops.fma and use it in softmax_backward (#122518)
This allows us to generate an fma even when fp-fusion is disabled
in the compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518
Approved by: https://github.com/lezcano, https://github.com/Chillee
2024-04-06 02:15:16 +00:00
3e8c64a637 [AOTInductor] Fix non-determinism in CUfunction declarations (#123266)
These use the ordering of sets and dictionaries to determine the output order,
which leads to run-to-run variance in the output code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123266
Approved by: https://github.com/desertfire
2024-04-06 02:08:17 +00:00
e94b81b254 Revert "Enable tests disabled for #115607 (#123314)"
This reverts commit 9564e204c1616ce78434abfdea0f3fd428b675f3.

Reverted https://github.com/pytorch/pytorch/pull/123314 on behalf of https://github.com/atalman due to  break TestOptimRenewedCPU::test_foreach_matches_forloop_Adamax_cpu_float64 ([comment](https://github.com/pytorch/pytorch/pull/123314#issuecomment-2040854499))
2024-04-06 01:59:22 +00:00
239abb2a14 add record function Id to Torch ops (#122948)
Fixes #122833
Add record function ID as additional metadata for PyTorch op events. This enables correlation with PyTorch Execution traces.
* Adds a new field "Record function id" for all PyTorch Op events. This value comes from `handle` in record function callback. This is a unique ID to correlate with the PyTorch Execution Trace.
* Updated unit tests.

## Test
Run a simple example uncommenting the `print trace` in the test below
```pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto```

We can see the new record function ID field in ET and Kineto  **Note: the name is "Record function id" now to match the other strings**
Kineto
![Screenshot 2024-03-28 at 5 48 55 PM](https://github.com/pytorch/pytorch/assets/6922212/08243698-8167-4ea0-9be6-2aede9fe9c43)
Execution Trace.
![Screenshot 2024-03-28 at 5 49 14 PM](https://github.com/pytorch/pytorch/assets/6922212/22e4e876-9fbe-43da-9150-dae2927b6e31)

We also see for cases where "External ID" is drifting but "Record function ID" is still matching.
Kineto
![Screenshot 2024-03-28 at 5 50 34 PM](https://github.com/pytorch/pytorch/assets/6922212/60905ea4-0da1-4c4b-a0d0-24500e8f7006)
Execution Trace
![Screenshot 2024-03-28 at 5 50 28 PM](https://github.com/pytorch/pytorch/assets/6922212/680db244-6725-48bf-a7ab-995c658a01ee)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122948
Approved by: https://github.com/davidberard98, https://github.com/shengfukevin
2024-04-06 01:35:03 +00:00
f4e2a226aa ScoreMod API (#121845)
# Summary

This PR adds a new higher-order_op: `templated_attention`.  This op is designed to extend the functionality of torch.nn.fucntional.scaled_dot_product_attention.  PyTorch has efficient pre-written fused-attention kernels. However, users want to modify how scores are computed (a substep inside attention) -- this traditionally requires the user to write their own attention kernel. One such modification to attention scores that is not currently supported by the top level SDPA op is:[ Attention with Linear Biases (ALiBi](https://arxiv.org/abs/2108.12409)).

This higher-order op will instead accept a callable( 'score_mod') function that is through torch.compile will be used to create an efficient attention kernel instantiation.

### Details

This HOP utilizes the existing fx and HOP infra to capture and convert the User `score-mod` function and convert to an FX graph module. Inductor then consumes this HOP that has a `ir.Subgraph` input. It will inline this lowered subgraph into a triton kernel which performs fused attention with the modification to the scores matrix inlined.

### API

The API for a score_mod function should be as follows:

```Python
def score_mod(score: torch.Tensor, batch: torch.Tensor, head: torch.Tensor, token_1: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor
```

This function receives five parameters:

- `score`: A scalar tensor representing the attention score, with the same data type and device as the query, key, and value tensors.
- `batch`, `head`, `seq_len_q`, `seq_len_kv`: Scalar tensors indicating the batch index, head index, query index, and key/value index, respectively, with torch.int data type and located on the same device as the score tensor.

Consider inputs query, key, value of shapes (2, 4, 16, 8), leading to an intermediate attention score matrix of shape (2, 4, 16, 16)

The score_mod function will be vectorized over each element of this matrix. For instance, modifying the score at the position corresponding to the 0th batch, 2nd head, between the 8th query and the 9th key element, would be invoked as:

```Python
score_mod(score[0,2,8,9], torch.tensor(0), torch.tensor(2), torch.tensor(8), torch.tensor(9))
```

### Examples
```Python
import torch
from torch.nn.attention.templated_attention import templated_attention

torch.manual_seed(0)

# Lets create some input tensors
# The input tensor has shape (batch_size, num_heads, seq_len, head_dim)
query = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32)
key = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32)
value = torch.randn(8, 8, 2048, 64, device="cuda", dtype=torch.float32)

# Lets create a fun new score_modification! I will call this
# Checkerboard. It will reduce the score for neighboring tokens (1 step apart)
# in the sequence. And increase the score for tokens 2 steps apart. For everything
# else, the score will remain the same.

def checkerboard(score, batch, head, token_q, token_kv):
    score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score)
    score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score)
    return score

# Lets call templated_attention with this new score modification
output = templated_attention(query, key, value, score_mod=checkerboard)

compiled_templated_attention = torch.compile(templated_attention)
out_compiled = compiled_templated_attention(query, key, value, score_mod=checkerboard)

torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2)
```

### Future Work
- This PR is currently only forward only. However the triton kernel for backwards where score_modifications to not rely on external buffers has been explored here: https://github.com/drisspg/transformer_nuggets/blob/main/transformer_nuggets/flash/flash_attention.py
- Kernel Improvements; There are has been some larger updates to the fused attention implementation that Triton uses in its tutorials. The implementation of this kernel is based on a prior version and should be updated.
- We may want to unify this API under the top level SDPA API and leave that as a follow up once this is more stable
- Should we error on CPU?
- There are some issues with dynamic shapes
- Capturing of free variables and lifting to inputs to the subgraph is not working correctly today

### Performance
Comparisons generated by this benchmark:

| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     5.412 |              |             |             |             |            |               |                |
| Max     |     8.882 |           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |
| Min     |     3.645 |            8 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |
| Min     |     0.345 |            1 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |

For reference

| Configuration                                 | Forward Time (µ seconds) | Backend          | Speedup |
|-----------------------------------------------|--------------------------|------------------|---------|
| Fastest Config in Sweep (`8 16 4096 4096 64 relative_bias torch.bfloat16`) | 3608                   | Templated Attention                | 1.0  |
| Compiled SDPA (No Mask)                       | 9928                   | Math             | 2.75x   |
| Compiled SDPA (With Mask)                     | 11898                    | Math             | 3.29x   |
| Compiled SDPA (With Mask) | 8704                      | Memory Efficient Attention | 2.42x   |
| Compiled SDPA (No Mask) | 2548                     | FlashAttention2 | 0.706x   |

The speedups are measuring compiled templated attention speed versus different calls to torch.nn.functional.sdpa

<details>

<summary> FULL PERFORMANCE SWEEP NUMBERS </summary>

|   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |   eager_time |   compiled_time |   speedup |
|--------------|-------------|-------------|-------------|------------|---------------|----------------|--------------|-----------------|-----------|
|            1 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |      331.444 |          67.221 |     4.931 |
|            1 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |      335.300 |          64.187 |     5.224 |
|            1 |          16 |         512 |         512 |         64 | head_bias     | torch.bfloat16 |      352.039 |          63.806 |     5.517 |
|            1 |          16 |         512 |         512 |         64 | pathological  | torch.bfloat16 |      371.699 |         711.349 |     0.523 |
|            1 |          16 |        1024 |        1024 |         64 | causal_mask   | torch.bfloat16 |      333.488 |          86.455 |     3.857 |
|            1 |          16 |        1024 |        1024 |         64 | relative_bias | torch.bfloat16 |      322.363 |          82.469 |     3.909 |
|            1 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |      349.967 |          82.233 |     4.256 |
|            1 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |      486.359 |        1412.453 |     0.344 |
|            1 |          16 |        4096 |        4096 |         64 | causal_mask   | torch.bfloat16 |     2794.597 |         551.188 |     5.070 |
|            1 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |     3965.150 |         513.101 |     7.728 |
|            1 |          16 |        4096 |        4096 |         64 | head_bias     | torch.bfloat16 |     2408.013 |         504.759 |     4.771 |
|            1 |          16 |        4096 |        4096 |         64 | pathological  | torch.bfloat16 |     6850.531 |       16733.675 |     0.409 |
|            8 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |      441.939 |         123.576 |     3.576 |
|            8 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |      560.379 |         116.710 |     4.801 |
|            8 |          16 |         512 |         512 |         64 | head_bias     | torch.bfloat16 |      421.172 |         115.825 |     3.636 |
|            8 |          16 |         512 |         512 |         64 | pathological  | torch.bfloat16 |      994.492 |        2132.806 |     0.466 |
|            8 |          16 |        1024 |        1024 |         64 | causal_mask   | torch.bfloat16 |     1436.430 |         309.495 |     4.641 |
|            8 |          16 |        1024 |        1024 |         64 | relative_bias | torch.bfloat16 |     1892.216 |         290.186 |     6.521 |
|            8 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |     1360.665 |         282.956 |     4.809 |
|            8 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |     3525.532 |        8359.702 |     0.422 |
|            8 |          16 |        4096 |        4096 |         64 | causal_mask   | torch.bfloat16 |    22026.839 |        3864.604 |     5.700 |
|            8 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |    31262.746 |        3609.551 |     8.661 |
|            8 |          16 |        4096 |        4096 |         64 | head_bias     | torch.bfloat16 |    20219.079 |        3480.402 |     5.809 |
|            8 |          16 |        4096 |        4096 |         64 | pathological  | torch.bfloat16 |    54654.647 |      116652.357 |     0.469 |
|           16 |          16 |         512 |         512 |         64 | causal_mask   | torch.bfloat16 |      820.606 |         188.683 |     4.349 |
|           16 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |     1058.362 |         179.295 |     5.903 |
|           16 |          16 |         512 |         512 |         64 | head_bias     | torch.bfloat16 |      784.372 |         175.714 |     4.464 |
|           16 |          16 |         512 |         512 |         64 | pathological  | torch.bfloat16 |     1890.792 |        4212.877 |     0.449 |
|           16 |          16 |        1024 |        1024 |         64 | causal_mask   | torch.bfloat16 |     2781.830 |         557.017 |     4.994 |
|           16 |          16 |        1024 |        1024 |         64 | relative_bias | torch.bfloat16 |     3694.050 |         525.249 |     7.033 |
|           16 |          16 |        1024 |        1024 |         64 | head_bias     | torch.bfloat16 |     2634.164 |         507.613 |     5.189 |
|           16 |          16 |        1024 |        1024 |         64 | pathological  | torch.bfloat16 |     6959.917 |       15331.116 |     0.454 |
|           16 |          16 |        4096 |        4096 |         64 | causal_mask   | torch.bfloat16 |    43889.096 |        7582.018 |     5.789 |
|           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |    62784.293 |        7075.846 |     8.873 |
|           16 |          16 |        4096 |        4096 |         64 | head_bias     | torch.bfloat16 |    40308.606 |        6829.587 |     5.902 |
|           16 |          16 |        4096 |        4096 |         64 | pathological  | torch.bfloat16 |   108892.137 |      233090.953 |     0.467 |
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121845
Approved by: https://github.com/Chillee, https://github.com/zou3519
2024-04-06 01:10:44 +00:00
8e98fda7a9 [dynamo][easy] Add AC test and improve graph break message (#121394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121394
Approved by: https://github.com/yanboliang
2024-04-06 01:02:45 +00:00
954d750516 Revert "Enable dynamo'd tests disabled for #115679 (#123315)"
This reverts commit d472ebf94a3f3a3dec31e9d8b2038127b2309727.

Reverted https://github.com/pytorch/pytorch/pull/123315 on behalf of https://github.com/atalman due to break TestOptimRenewedCPU::test_foreach_matches_forloop_Adamax_cpu_float64 ([comment](https://github.com/pytorch/pytorch/pull/123315#issuecomment-2040835229))
2024-04-06 00:57:42 +00:00
d9d25076fe Reduce guards of optimizer state dict to guard once per param group (#123413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123413
Approved by: https://github.com/anijain2305
2024-04-06 00:12:59 +00:00
f7e41a2b7a [pt2] Clean up for removing 2 decompose patterns (#123422)
Summary:
Follow up for D55759235.

should_decompose_mmt and should_decompose_mm_largek should be removed as well.

Test Plan: NA

Differential Revision: D55786581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123422
Approved by: https://github.com/jackiexu1992
2024-04-05 23:31:51 +00:00
d472ebf94a Enable dynamo'd tests disabled for #115679 (#123315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123315
Approved by: https://github.com/janeyx99
ghstack dependencies: #123313, #123314
2024-04-05 23:21:53 +00:00
9564e204c1 Enable tests disabled for #115607 (#123314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123314
Approved by: https://github.com/janeyx99
ghstack dependencies: #123313
2024-04-05 23:21:53 +00:00
61d431fab0 Adding health check server hook in torch elastic (#122750)
Summary:
Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled.

Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action.

Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test

Differential Revision: D55108182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750
Approved by: https://github.com/kurman
2024-04-05 23:17:30 +00:00
22b9987144 [dynamo][cpp-guards] ListGetItemGuardAccessor and TupleGetItemGuardAccessor (#123396)
Speeds up the guard-overhead microbenchmark by around 10% normalized to main-branch CPP guards

~~~
import torch

@torch.compile(backend="eager")
def fn(x, lst):
    for l in lst:
        x = x + l
    return x

n = 1000

lst = [i for i in range(n)]

x = torch.randn(4)
print(fn(x, lst))
print("Sucess")
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123396
Approved by: https://github.com/jansel
ghstack dependencies: #123285, #123302, #123303
2024-04-05 22:10:04 +00:00
cd6c58baea [custom_ops] mutated_args -> mutates_args (#123437)
This seemed better, since when you're construction a custom op you need
to provide "the args that the custom op mutates".

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123437
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109, #123110, #123129
2024-04-05 22:03:51 +00:00
81e7a7c955 Add mutated_args field to custom_op (#123129)
If provided, we:
- autogenerate an ADInplaceOrView implementation
- assume that no mutated inputs are returned as outputs. There are
  already aliasing runtime checks that check this.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123129
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109, #123110
2024-04-05 22:03:51 +00:00
9e8d2b6de2 Add register_autograd to register backward formulas for custom ops (#123110)
The user provides a `setup_context` and a `backward_function`. These
get put into a torch.autograd.Function that gets registered as the
custom op's autograd implementation.

Test Plan:
- we update custom ops in the custom_op_db to use the new
  register_autograd API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123110
Approved by: https://github.com/albanD
ghstack dependencies: #123108, #123109
2024-04-05 22:03:47 +00:00
d8e1c1087d Add is_tensorlist_like_type helper (#123109)
Checks if the type of an argument in a schema is some form of
TensorList.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123109
Approved by: https://github.com/albanD
ghstack dependencies: #123108
2024-04-05 22:03:42 +00:00
067851dd0d Expand is_functional_schema to work with torch._C._FunctionSchema (#123108)
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
2024-04-05 22:03:39 +00:00
42c2a5477c [export] nn_module_stack to return class name str (#123308)
Previously, `node.meta["nn_module_stack"]` had type `Dict[str, Tuple[str, class]]` when exported, and later `Dict[str, Tuple[str, str]]` after de/serialization. This PR changes it to consistently be `Dict[str, Tuple[str, str]]` for round-trippability, i.e.
```
{..., 'L__self___conv': ('conv', 'torch.nn.modules.conv.Conv2d')}
```

`source_fn_stack` is left untouched in this PR.

note: the `Union[type, str]` type annotations in ONNX are because ONNX goes through both `export.export()` and `_dynamo.export()` (which still has the original `Dict[str, Tuple[str, class]]` format). nn_module_stack from `export.export()` should consistently have the new format, and we verify/test for that in `_trace.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123308
Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi
2024-04-05 21:48:22 +00:00
63c221b7fa Clone mutated inputs in first pass of CPP wrapper compilation (#123316)
Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs.

In this PR, we clone mutated inputs in the first pass to avoid input mutation.

Fixes https://github.com/pytorch/pytorch/issues/117364.

Test Plan:

```
$ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda
...
.
----------------------------------------------------------------------
Ran 1 test in 6.368s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316
Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire
2024-04-05 21:47:19 +00:00
4946558dd4 [minifier] Don't recompile for accuracy minification (#123005)
`backend_aot_accuracy_fails` reruns `compile_fx_inner` on the real inputs which
means the graph is recompiled with static shapes. This meant accuracy failures
related to dynamic shapes would never be captured by `REPRO_AFTER=aot`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123005
Approved by: https://github.com/ezyang
2024-04-05 21:22:57 +00:00
f5b8c9b730 Ignore some known duplicated modules in doc build config script (#123425)
This is a follow-up fix of https://github.com/pytorch/pytorch/pull/123244#discussion_r1552935150 as @clee2000 points out a better way to ignore those duplicated entries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123425
Approved by: https://github.com/clee2000
2024-04-05 21:12:14 +00:00
b0c86a5bc1 [PT] [ST] support non contiguous rank validation in sharded tensor (#123230)
Summary:
Previously the validation logic assumes the sharded tensors' global ranks range from `[0 .. WS]`

This is true if we do 1d flat sharding.
But once we get into 2d+, the ranks may not be contiguous any more.

e.g.
```
[0, 2]
[1, 3]
```
The group size is 2 but ranks may be >= 2.

Going forward, the ST will be replaced by DTensor so it's less of an issue but this is just to make it work for stacks still relying on ST (like torchrec).

Test Plan:
added UT
CI

Differential Revision: D55671872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123230
Approved by: https://github.com/kwen2501
2024-04-05 21:05:01 +00:00
d78991a738 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 20:58:16 +00:00
cbde0f048b [dynamo, 3.12] enable tests disabled due to missing dynamo 3.12 support (#123300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123300
Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/zou3519
2024-04-05 20:13:17 +00:00
ae6f8d923c Pass and record process_group_name when creating ProcessGroupNCCL (#123117)
Summary:
Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers.
Also record pg_name in flight recorder entry.

Differential Revision: D55597200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117
Approved by: https://github.com/wconstab
2024-04-05 18:57:45 +00:00
d7f23f6826 [export] Restore original placeholder names (part 1: top-level renaming) (#122904)
Summary:
This PR restores original names to placeholder nodes, replacing the default names arg0_1, arg1_1, and so on.

User inputs now follow the signature of mod.forward(), for example forward(x, y) produces nodes x, y. If the tensors are nested in dictionaries, lists, tuples, or dataclasses, the names are a concatenation of the path to the tensor, e.g. x = {'a': torch.randn(4), 'b': [torch.randn(4), torch.randn(4)]} produces nodes x_a, x_b_0, x_b_1.

Parameters, buffers, constants, and custom objects follow the FQN of the object, prefixed by "p", "b", "c", and "obj" respectively. For example, self.bar.l0.weight gets you p_bar_l0_weight.
Effect tokens are named token_1, token_2, and so on, since they are not grounded in model inputs or named attributes.

note: breaking the original diff into 3 parts (top-level renaming, higher-order-op subgraphs, constant input de/serialization) because of its size.

Examples:
```python
# params, buffers, constants, inputs, torch.cond

ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, p_l0_weight: "f32[4, 4]", p_l0_bias: "f32[4]", c_alpha: "f32[4]", b_beta: "f32[4]", x_0_a: "f32[4, 4]", y: "f32[4, 4]"):
            # No stacktrace found for following nodes
            mul: "f32[4, 4]" = torch.ops.aten.mul.Tensor(x_0_a, x_0_a)
            t: "f32[4, 4]" = torch.ops.aten.t.default(p_l0_weight);  p_l0_weight = None
            addmm: "f32[4, 4]" = torch.ops.aten.addmm.default(p_l0_bias, y, t);  p_l0_bias = y = t = None
            return addmm

# model code

class Bar(torch.nn.Module):
    def forward(self, x):
        return x * x
class Foo(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bar = Bar()
        self.l0 = torch.nn.Linear(4, 4)
        self.alpha = torch.randn(4)
        self.register_buffer('beta', torch.randn(4))
    def forward(self, x, y):
        x = x[0]['a']
        mul = self.bar(x)
        z1 = self.l0(y)
        return z1

# custom objects, dataclasses, tokens, constant inputs

ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, token_1: "f32[0]", obj_attr, data_x: "f32[4, 4]", data_y: "f32[4, 4]", mode):
            # No stacktrace found for following nodes
            mul: "f32[4, 4]" = torch.ops.aten.mul.Scalar(data_x, 30);  data_x = None
            div: "f32[4, 4]" = torch.ops.aten.div.Tensor_mode(data_y, 1.0, rounding_mode = 'floor');  data_y = None
            add: "f32[4, 4]" = torch.ops.aten.add.Tensor(mul, div);  mul = div = None
            with_effects = torch._higher_order_ops.effects.with_effects(token_1, torch.ops._TorchScriptTesting.takes_foo.default, obj_attr, add);  token_1 = obj_attr = add = None
            getitem: "f32[0]" = with_effects[0]
            getitem_1: "f32[4, 4]" = with_effects[1];  with_effects = None
            return (getitem, getitem_1)

# model code

class Foo(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.attr = torch.classes._TorchScriptTesting._Foo(10, 20)
    def forward(self, data, a=1.0, mode="floor"):
        x = self.attr.add_tensor(data.x) + torch.div(data.y, a, rounding_mode=mode)
        x = torch.ops._TorchScriptTesting.takes_foo(self.attr, x)
        return x

dataclass
class DataClass:
    x: Tensor
    y: Tensor
register_dataclass_as_pytree_node(
    DataClass,
    serialized_type_name="test.DataClass"
)

args = (DataClass(x=torch.randn(4, 4), y=torch.randn(4, 4)), )
kwargs = {'mode': 'floor'}
ep = torch.export.export(Foo(), args, kwargs, strict=False)

```

Test Plan: verification checks on placeholder names for all export() calls, unit test in test/export/test_export.py

Differential Revision: D55456418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122904
Approved by: https://github.com/angelayi, https://github.com/thiagocrepaldi
2024-04-05 18:56:00 +00:00
f71e368969 UFMT formatting on test/autograd test/ao test/cpp test/backends (#123369)
Partially addresses #123062

Ran lintrunner on
- test/_test_bazel.py
- test/ao
- test/autograd test/backends test/benchmark_uitls test/conftest.py test/bottleneck_test test/cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123369
Approved by: https://github.com/huydhn
2024-04-05 18:51:38 +00:00
de7edeea25 [DCP] DCP logger (#121352)
Adds additional logging for improved observability in DCP.

Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 17:50:50 +00:00
c8e117fb76 Tiny comments improvement (#123426)
Fixed a typo in `functional.py` and moved comment line to correct place in `transformer.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123426
Approved by: https://github.com/mikaylagawarecki
2024-04-05 17:25:42 +00:00
9b8f446e95 [pytorch profiler] Add metrics for performance timing and other statistics (#123412)
Measure the performance of various calls in PyTorch profiler and ave them to `_ProfilerStats` structure

Differential Revision: [D55457386](https://our.internmc.facebook.com/intern/diff/D55457386/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123412
Approved by: https://github.com/aaronenyeshi
2024-04-05 17:00:45 +00:00
32f9453c2a [dynamo] Emit FUNCTORCH_STACK_MATCH guard in vmap(compile(f)) case (#122786)
Fixes: #122201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122786
Approved by: https://github.com/zou3519
2024-04-05 15:04:16 +00:00
8c7d8f0ff2 Revert "Make torch_geometric models compatible with export (#123403)"
This reverts commit 2ffab6e663b9c6951048b8c8ba82d2cc5ca5c2fc.

Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))
2024-04-05 13:34:41 +00:00
5b0ce8f334 [Wheel] Change libtorch_cpu OpenMP search path (#123417)
To prevent delocate from double-packing it, which makes Torch wheels
unusable with torch.compile out of the box

Fixes https://github.com/pytorch/pytorch/issues/122705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123417
Approved by: https://github.com/atalman
2024-04-05 13:02:38 +00:00
cfd06bd60c [CI] Switched to the _linux-build-label workflow for pull, rocm, slow and trunk jobs (#123255)
Switched to the _linux-build-label workflow for pull requests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123255
Approved by: https://github.com/jeanschmidt, https://github.com/atalman
2024-04-05 09:34:30 +00:00
9743e3a19c [Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895)
As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by:
- Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides
- Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation.

For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing.
We'll reuse all the existing Inductor test case in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-04-05 09:05:11 +00:00
9078191666 [Inductor] Add the possible fusions group by priority (#123067)
**Summary**

Refactor the `Scheduler.fuse_nodes` changes in https://github.com/pytorch/pytorch/pull/121625. In the previous implementation of `Scheduler.fuse_nodes` in https://github.com/pytorch/pytorch/pull/121625, we use the `enable_outer_loop_fusion` context to ensure `OuterLoopFusion` happens after all the norm fusions.

And there is a discussion in https://github.com/pytorch/pytorch/pull/121625/files#r1527177141 to reuse current `score_fusion` mechanism. However, given that [fuse_nodes](f4ff063c33/torch/_inductor/scheduler.py (L1679-L1698)) will invoke `fuse_nodes_once` 10 times. We are concerned that the score approach may potentially disrupt pairs of regular fusion nodes in the 2rd invocation of `fuse_nodes_once` if they have been pick up by the outer loop fusion in the 1st invocation of `fuse_nodes_once`.

In this PR, we propose adding an abstract of `filter_possible_fusions_by_priority`. In each invoking of `fuse_nodes_once`, the possible fusions will be grouped by their priority from the backend. And only the group of possible fusions with highest priority will be fused in this invocation. In this way, we can ensure `OuterLoopFusion` happens after all the norm fusions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123067
Approved by: https://github.com/lezcano, https://github.com/jgong5
ghstack dependencies: #121625
2024-04-05 06:30:41 +00:00
bac2a39aee [Inductor] [ReImplement] Outer Loop Fusion for CPP Backend (#121625)
**Summary**
Re-implement of https://github.com/pytorch/pytorch/pull/121064

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_outer_loop_fusion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121625
Approved by: https://github.com/lezcano, https://github.com/jgong5
2024-04-05 06:24:57 +00:00
2ffab6e663 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 05:26:01 +00:00
18c9d46068 Fixes format utils executable (#123407)
Fixes an issue with the format utils executable, which was causing it to run as a no-op. :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123407
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 03:53:22 +00:00
7b575f0814 Handle transposes in second batch of matrices in bmm (#122194)
1. Add support for Unranked placeholders in the MPS backend.
2. PR is now fusing the Transposes into the GEMM kernel dispatches in MPS backend. This improves the performance of Transformer networks by 5-8%.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194
Approved by: https://github.com/DenisVieriu97, https://github.com/malfet
2024-04-05 03:40:16 +00:00
86c5cc6559 [ONNX][dynamo_export] Integrate onnx-rewriter optimizer (#123379)
Introduces common standard onnx optimization such as constant, if, controlflow folding and pattern rewrites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123379
Approved by: https://github.com/thiagocrepaldi, https://github.com/justinchuby
2024-04-05 03:29:40 +00:00
7ffad9ab04 Use out-of-place version of put inside take_backward (#123268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123268
Approved by: https://github.com/zou3519
ghstack dependencies: #122211, #122212, #122213
2024-04-05 03:29:11 +00:00
c575e378ba Update torch.compile_faq w.r.t to functorch (#122213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122213
Approved by: https://github.com/zou3519
ghstack dependencies: #122211, #122212
2024-04-05 03:29:11 +00:00
dbe0c474a9 Ensure all torch.func.* functions capture can be disabled (#122212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122212
Approved by: https://github.com/zou3519
ghstack dependencies: #122211
2024-04-05 03:29:11 +00:00
84658d9c4f Enable capture_func_transforms by default (#122211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211
Approved by: https://github.com/zou3519
2024-04-05 03:29:11 +00:00
3d20cc1332 Cleanup some duplicated placeholder py:module docs (#123244)
Fixes https://github.com/pytorch/pytorch/issues/123068
Fixes https://github.com/pytorch/pytorch/issues/111256

While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459.  These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads.  It's just by chance that they start to surface now.

The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally.  Then, these warnings shows up consistently.  As `make html` treats warnings as errors, they will fail the build.

```
...
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them
/data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them
...
```

The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-04-05 03:18:53 +00:00
16cb5d48dd Revert "[inductor] Add explicit ops.fma and use it in softmax_backward (#122518)"
This reverts commit 05984e642b16b289f0871d3db9d14426a57b76f0.

Reverted https://github.com/pytorch/pytorch/pull/122518 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing in trunk 05984e642b ([comment](https://github.com/pytorch/pytorch/pull/122518#issuecomment-2038631010))
2024-04-05 02:09:32 +00:00
535a84c125 Publish PyTorch docs to pytorch/cpp repo (#122895)
Updating the documents push to go to https://github.com/pytorch/docs repo instead of https://github.com/pytorch/pytorch.github.io as part of updating the PyTorch docs set up.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122895
Approved by: https://github.com/malfet
2024-04-05 02:00:20 +00:00
d7ccde58a7 [ONNX][dynamo_export] Fix ONNX export with print (#123368)
Partially fixes #123288

Doesn't handle the `method` case, but `print` is a start.

The approach is to mimic `torch.export.export` behavior, whatever that is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123368
Approved by: https://github.com/BowenBao
2024-04-05 01:07:46 +00:00
4cf5a9c505 [pt2] remove 2 decompose patterns (#123371)
Summary:
https://fb.workplace.com/groups/1075192433118967/permalink/1402410947063779/
some investigation. large-k pattern will degrade the perf. remove those patterns. Though mmt patten indeed shows some gain in compiling single operator P1201328502 and  P1201328722. but it will conflict with other opt in inductor and result in a slow down.

Test Plan:
some result from benchmark, manually to hold stride
```
import torch
import torch._inductor.config as inductor_config
import triton

inductor_config.trace.enabled = True

m1 = torch.rand(9388864, 2, device="cuda", dtype=torch.bfloat16)
m2 = torch.rand(9388864, 12, device="cuda", dtype=torch.bfloat16)
print(f"m1.stride {m1.stride()}")
print(f"m2.stride {m2.stride()}")

torch.compile
def fake_mm(a, b):
    return torch.sum(a[:, :, None] * b[:, None, :], dim=0)

tmp = fake_mm(m1, m2)
print(tmp.shape)
s = triton.testing.do_bench(lambda: fake_mm(m1, m2))
print(f"fake mm{s}")
tmp2 = torch.mm(m1.permute(1, 0), m2)
s = triton.testing.do_bench(lambda: torch.mm(m1.permute(1, 0), m2))
print(print(f"mm{s}"))

m3 = m1.permute(1, 0).contiguous()
s = triton.testing.do_bench(lambda: torch.mm(m1.permute(1, 0).contiguous(), m2))
print(print(f"mm without permute{s}"))

result:
fake mm14.968459129333496
mm507.6383972167969
mm without permute0.7466956973075867
```

 single kernel can be speed up from 5ms->3ms
 {F1477685597}
 {F1477685813}

Differential Revision: D55759235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123371
Approved by: https://github.com/mengluy0125
2024-04-05 00:54:10 +00:00
0c8a165b43 [Export] Improve metadata and output parsing during deserialization (#122793)
Summary:
Deserialization of metadata could encounter a bug where commas are used in valid metadata names. This specifically occurs when a split of a `torch.nn.Sequential` stack is used, but may have other possible triggers. Because the deserialization relies on a comma based string split, such names trigger an error. This change uses a simple regular expression to ignore commas within parentheses to avoid the issue.

I add a test that constructs one such problematic sequential stack and show that it can be properly round-tripped with the improved splitting.

Similarly, deserialization could fail when outputs are not a tensor type. Although such outputs like None or constants are not very useful, they do show up in graphs and export should be able to support them. This change improves output node parsing and adds a corresponding test.

Test Plan: buck test //caffe2/test:test_export -- TestSerialize

Differential Revision: D55391674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122793
Approved by: https://github.com/zhxchen17
2024-04-05 00:25:37 +00:00
064a650b63 [AOTI][refactor] Improve generate_extern_kernel_out's signature (#123351)
Summary: Annotate types and make the names more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123351
Approved by: https://github.com/chenyang78
ghstack dependencies: #123346
2024-04-04 23:23:50 +00:00
aa063054ce [AOTI] Fix the codegen for aten.randint.low_out (#123346)
Summary: Fixing https://github.com/pytorch/pytorch/issues/123174. There are two problems here,
* Incorrectly calling convert_arrayref_tensor_to_tensor on int arguments. Removing relevant code since we don't use ArrayRef when there is a fallback op.
* codegen_kwargs generates an argument for the out parameter of ExternKernelOut. The fix is to leave that logic to corresponding wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123346
Approved by: https://github.com/chenyang78
2024-04-04 23:23:50 +00:00
e61d04e467 Revert "[sparse] Add fast semi-structured spasification kernels (#122350)"
This reverts commit c63a7b569133c9d91bde362c68e4f60abd4b619b.

Reverted https://github.com/pytorch/pytorch/pull/122350 on behalf of https://github.com/malfet due to This broke rocm builds, which is visible on PR as well ([comment](https://github.com/pytorch/pytorch/pull/122350#issuecomment-2038424125))
2024-04-04 23:15:36 +00:00
6107cbba1b doc: torch.nn.utils.rnn.pad_sequence: improve the example description (#123183)
doc: `torch.nn.utils.rnn.pad_sequence`: improve the example description.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123183
Approved by: https://github.com/mikaylagawarecki
2024-04-04 23:05:50 +00:00
19f8cf0167 [FSDP2] Used ReduceOp.AVG if bf16 reduce-scatter (#123362)
The motivation is similar to https://github.com/pytorch/pytorch/pull/120919/ -- we want to use only one mul/div kernel instead of one before and after gradient reduction if possible.

Because bf16 has the same dynamic range as fp32, the relative error is the same whether we pre and post-divide vs. just post-divide. In other words, the relative error does not depend on the magnitude.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123362
Approved by: https://github.com/kwen2501, https://github.com/wanchaol
ghstack dependencies: #122962, #123290
2024-04-04 23:01:18 +00:00
5494b2a8d3 enable test_sampled_addmm_zero_sized_cuda for rocm (#121940)
Enable test_sampled_addmm_zero_sized_cuda_* only for ROCm and CUDA issue is currently active.
Passed since ROCm 5.6

test_sampled_addmm_zero_sized_cuda_float32
test_sampled_addmm_zero_sized_cuda_float64
test_sampled_addmm_zero_sized_cuda_complex64
test_sampled_addmm_zero_sized_cuda_complex128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121940
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-04-04 22:38:29 +00:00
4b1b4db231 [export] Add stack_trace for non-strict export (#121034)
This addresses 2 issues with stack_trace metadata:
- stack_trace is currently missing from nodes in non-strict export
- in strict mode, stack_trace is populated for placeholder nodes, which may not be well-defined (with multiple uses)

We filter the call stack during tracing for calls from forward() methods, or ops in `torch.__init__.py` (e.g. sym_size_int, sym_constrain_range, etc.) to populate stack_trace. A node-level check is also added to _export_non_strict().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121034
Approved by: https://github.com/angelayi
2024-04-04 22:35:33 +00:00
512759a3d7 Fix for tensor attribute missing (#123313)
Tensors would sometimes be realized after we already registered attrs on the root nn module. This ensures all stack values are realized before registering attrs on the root nn module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123313
Approved by: https://github.com/anijain2305
2024-04-04 21:11:04 +00:00
30598c162d [pytorch][cuda] Optimized softmax forward native CUDA implementation (#122970)
In Triton's [softmax tutorial](https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html), native performance is significantly lower than Triton's. We accelerated the native code as follows:

--> Wrote a CUDA kernel `cunn_SoftMaxForwardSmem` for softmax forward that caches the inputs in shared memory. Currently the maximum usable shared memory is 48KB to preserve compatibility with older generation Kepler GPUs but we can increase this. This kernel uses vectorized loads and stores and runs on problem sizes that fit in shared memory and use aligned buffers.

--> Modified the default implementation's intra thread block reduction to use warp shuffles as the first step in reduction and use shared memory only to reduce across warps.

--> Simplified the `WriteFpropResults` code because the loop unrolling brought no benefits but had a potentially detrimental effect on register usage.

We can observe that there is still an advantage in the Triton implementation. We were able to recover the gap by using native `__expf` but we decided to leave `std::exp` to avoid affecting numerical stability.

```
Tests are ran on an A100 GPU using the benchmark in the Triton tutorial.

Before

softmax-performance:
          N       Triton  Torch (native)  Torch (jit)
0     256.0   336.946021      595.781814   241.830261
1     384.0   737.741110      762.046555   297.890900
2     512.0   884.128199      860.899863   362.829080
3     640.0   936.228605      901.458039   376.211253
4     768.0  1005.024893      973.306952   384.187594
..      ...          ...             ...          ...
93  12160.0  1336.034308      858.096595   330.642735
94  12288.0  1339.248830      837.047196   331.146707
95  12416.0  1338.877891      839.317673   329.113513
96  12544.0  1335.383669      835.342136   330.067106
97  12672.0  1339.402120      821.690012   329.854051

After

softmax-performance:
          N       Triton  Torch (native)  Torch (jit)
0     256.0   375.833684      602.629893   237.019883
1     384.0   312.572329      739.127852   301.777431
2     512.0   495.546303      863.736375   368.438508
3     640.0   520.953881      884.426455   369.633391
4     768.0   677.374681      975.722054   385.317013
..      ...          ...             ...          ...
93  12160.0  1337.253933     1300.589124   330.655916
94  12288.0  1336.333052     1188.412588   331.116192
95  12416.0  1337.610105     1209.703474   329.232825
96  12544.0  1338.723893     1232.849225   330.003484
97  12672.0  1340.232227     1236.057117   329.925347
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122970
Approved by: https://github.com/malfet
2024-04-04 21:05:23 +00:00
05984e642b [inductor] Add explicit ops.fma and use it in softmax_backward (#122518)
This allows us to generate an fma even when fp-fusion is disabled
in the compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518
Approved by: https://github.com/lezcano, https://github.com/Chillee
ghstack dependencies: #121924
2024-04-04 20:53:14 +00:00
d486cb7c1b Deprecate calling FakeTensor.data_ptr in eager-mode (#123292)
Today, we error out on FakeTensor.data_ptr under torch.compile. This PR
moves to error out on FakeTensor.data_ptr under eager mode to avoid
diverging behavior.

We do this by adding another bit onto FakeTensor that we'll remove after
the deprecation cycle.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282, #123291
2024-04-04 20:35:24 +00:00
fd60752786 Turn _allow_unsafe_data_ptr_access into a config option (#123291)
We're not planning on having this flag around for very long (see
deprecation in next PR), so it's better as a config option.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123291
Approved by: https://github.com/eellison
ghstack dependencies: #123261, #123282
2024-04-04 20:35:24 +00:00
de14717819 [triton] Backport https://github.com/openai/triton/pull/3433 (#122470)
Summary:
Pull cache API changes from https://github.com/openai/triton/pull/3433.

Among other simplifications, this allows us the cache all files in a "group"
atomically, in a single memcache blob, and avoid needing to use other
approaches to handle these files coming from different runs.

Reviewed By: bertmaher

Differential Revision: D55206000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122470
Approved by: https://github.com/bertmaher
2024-04-04 20:24:28 +00:00
5c7e2fd270 [dynamo, 3.12] use pymalloc allocator instead of malloc/free for frames (#123299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123299
Approved by: https://github.com/jansel
ghstack dependencies: #123216
2024-04-04 20:00:54 +00:00
d59c5d7353 [dynamo, 3.12] enable dynamo on 3.12, enable most dynamo unittests on 3.12 (#123216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123216
Approved by: https://github.com/jansel, https://github.com/malfet
2024-04-04 20:00:54 +00:00
c63a7b5691 [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-04 19:07:35 +00:00
d8717c2d68 Revert "Skip test_artificial_grid_cpp_wrapper (#123211)"
This reverts commit a8b9dcb9af8012fd64d310781c85a6e3055e67cc.

Reverted https://github.com/pytorch/pytorch/pull/123211 on behalf of https://github.com/clee2000 due to  test_artificial_zgrid  is failing internally and the PR to skip #123211 is also failing but for a different reason ([comment](https://github.com/pytorch/pytorch/pull/123211#issuecomment-2037979882))
2024-04-04 18:58:55 +00:00
a808559fc6 Revert "[inductor] Fix fresh_inductor_cache() (#122661)"
This reverts commit ba7d396eb73e91c1846ed770f470245ef578a923.

Reverted https://github.com/pytorch/pytorch/pull/122661 on behalf of https://github.com/clee2000 due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/122661#issuecomment-2037977934))
2024-04-04 18:55:55 +00:00
721dcaff94 Revert usage of NJT views in SDPA (#123215)
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.

**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
2024-04-04 18:45:47 +00:00
8b83327cd5 [FSDP] Fixed summon_full_params on submodule (#123290)
This PR fixes https://github.com/pytorch/pytorch/issues/122663.

This PR changes `_unshard_params` to directly look for FSDP modules instead of the two steps of first finding the root FSDP modules and then recursing on their submodules. This should address the issue where we call `summon_full_params` on an FSDP module that is _not_ the root FSDP module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123290
Approved by: https://github.com/weifengpy
ghstack dependencies: #122962
2024-04-04 18:26:08 +00:00
fe84155083 [TP][Tests] Replace assertEqual with deepcopy (#123218)
There were a lot of manual `assertEqual`'s in the tests to make sure `model_tp` was created the same as `model`.
`model_tp = copy.deepcopy(model)` should help us rest assured.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123218
Approved by: https://github.com/wanchaol
2024-04-04 18:11:58 +00:00
98e5238ad8 [codemod][lowrisk] Remove unused exception parameter from caffe2/caffe2/image/image_input_op.h (#123056)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D55548497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123056
Approved by: https://github.com/Skylion007
2024-04-04 17:24:43 +00:00
836a86064c Ensure torch.library doctests runs under xdoctest (#123282)
I'm not sure what "TORCH_DOCTEST_LIBRARY" is, but it prevented these
tests from running under xdoctest. This PR fixes the docstrings and
makes them actually run under xdoctest.

Test Plan:
- wait for CI
- I verified locally that the docstrings are now being tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123282
Approved by: https://github.com/williamwen42
ghstack dependencies: #123261
2024-04-04 16:20:42 +00:00
8f20cf1c71 Update the functionalization error message (#123261)
Previously, it suggested that a user add a manual functionalization
kernel. However, since we have auto_functionalize now, the user's first
course of action should be to modify their op into the form that
auto_functionalize accepts (this is possible in the majority of custom
ops).

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123261
Approved by: https://github.com/williamwen42
2024-04-04 16:20:42 +00:00
e0c9764660 Back out "Precompile triton templates (#121998)" (#123305)
Summary: We are reverting #121998 because the change plus search-autotune-cache led to significant compilation time increase, causing stuck job detector to trigger and then kill the training job.

Test Plan:
CI tests

Reviewed By: nmacchioni

Differential Revision: D55712203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123305
Approved by: https://github.com/eellison, https://github.com/nmacchioni, https://github.com/xw285cornell
2024-04-04 16:05:10 +00:00
595613d746 [CI] Workaround to the dind-rootless limitation to restore user on build.sh and test.sh (#122922)
Co-authored-by: Thanh Ha <thanh.ha@linuxfoundation.org>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122922
Approved by: https://github.com/DanilBaibak
2024-04-04 14:20:51 +00:00
5ecfe58cfb Remove ulimit setting for ARC dind-rootless (#122629)
Since ARC runners use dind-rootless mode setting the ulimit in the docker run command is not possible as the dind-rootless container does not sufficient permissions to do that.

This change looks like it was coming from a migration from another CI system so perhaps it's not necessary anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122629
Approved by: https://github.com/jeanschmidt
2024-04-04 14:18:58 +00:00
26b4ccf9d1 Use numpy 2.0.0rc1 in CI (#123286)
Bump numpy version to 2.0.0rc1 in CI

Related to: https://github.com/pytorch/pytorch/issues/107302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123286
Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/ZainRizvi
2024-04-04 14:00:19 +00:00
d9cbd57dfe Make u/int8 cat inductor fallback cpu-only (#123278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123278
Approved by: https://github.com/Chillee
2024-04-04 13:54:37 +00:00
b5488cbe64 Use Vectorized Half for eager and compile (#123260)
Using implementation added by https://github.com/pytorch/pytorch/pull/122918

Adapt `convert_half_float`/`convert_half_float` to work when better APIs are available
Fix `Vectorized<Half>::reciprocal()` and `::rsqrt()` implementation do use proper divisions rather than `vrecpeq_f16` which computes the estimate, rather than true division. Please note that this pattern already present in `Vectorized<Float>` for NEON, see:
05289a278c/aten/src/ATen/cpu/vec/vec256/vec256_float_neon.h (L618-L622)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123260
Approved by: https://github.com/mikekgfb
2024-04-04 13:28:44 +00:00
54801e6fd6 Revert "[Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122892)"
This reverts commit 0ba16ffd35af3eb56da4892cc5387c5e8ac864bb.

Reverted https://github.com/pytorch/pytorch/pull/122892 on behalf of https://github.com/atalman due to broke cuda tests ([comment](https://github.com/pytorch/pytorch/pull/122892#issuecomment-2037207036))
2024-04-04 13:22:22 +00:00
6890333e3d [inductor] fix tensor overlap detection that cause cudagraphs being disabled (#123327)
If any graph input has overlapping memory, inductor disables cudagraphs. But the function `complex_memory_overlap` detecting memory overlap can have false positive.

E.g. for tensor `rand_strided((8, 1500, 1), (1504, 1, 1), device=self.device)` the function reports overlapping previously.. This is caused by size=1 dimension. The fix is to do squeeze before running the detection algorithm.

This fixes the perf regress for hf_Whisper and timm_efficientdet when we do padding. For these models cudagraphs were dynamically disabled when doing padding due to the issue discussed here and cause perf regress.

This may help the dashboard if this is a common thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123327
Approved by: https://github.com/Chillee
2024-04-04 08:55:00 +00:00
f00ece024b Handle wrong workflow name from GitHub (#123301)
Fixes https://github.com/pytorch/pytorch/issues/122422.  From my testing, the problem is that GitHub didn't return the correct workflow name in some cases and used the path to the workflow instead.

Take https://github.com/pytorch/pytorch/pull/123104 as an example, the returning name from GH graphql was `.github/workflows/generated-linux-binary-conda-nightly.yml` while the name we had on Rockset was `linux-binary-conda`.  The latter was correct, but the mismatch caused mergebot to miss the flaky failures.

This is a weird issue because retrying the graphql query eventually returns the correct name.

First query:
![Screenshot 2024-04-03 at 15 28 37](https://github.com/pytorch/pytorch/assets/475357/81a8ada4-c241-4e6b-b45d-7a6de1c3a151)

After several retries:
![Screenshot 2024-04-03 at 15 31 53](https://github.com/pytorch/pytorch/assets/475357/402c2e8c-f963-45f6-8c10-e1d2f49c5479)

Then I could never get the result like the first query again.

The fix here is to keep track of the job ID so that we can compare it instead of the `workflow / job` name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123301
Approved by: https://github.com/clee2000
2024-04-04 07:00:40 +00:00
dbeb214043 [aot_inductor] Fix issues in pre_grad passes (#123181)
Summary:

Fixed a bug in `sink_cat_after_pointwise` pass for PT IR. The root cause is asumption of existence of input in kwargs or args

Differential Revision: D55617545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123181
Approved by: https://github.com/hl475, https://github.com/khabinov
2024-04-04 05:13:28 +00:00
eee8413b8d [Inductor Intel GPU backend Upstream] Enable triton installation for Intel GPU (#122254)
Following the RFC https://github.com/pytorch/pytorch/issues/114856, Intel GPU Inductor backend depends on triton that  functions with Intel GPUs. This PR enabled the triton installation in Intel GPU CI docker build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122254
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #121883
2024-04-04 05:04:17 +00:00
2a24b54e65 [inductor] simplify expr when looking up size hint (#123140)
## Context

Suppose we have two symbols: `u0` and `s0` where we know that `u0 = s0`. Now, let's say we tried to look up the size hint for `u0 + 1`.
* Before this PR, we would use a fallback hint if one was provided.
3f6acf65fd/torch/_inductor/sizevars.py (L406-L407)

* With this PR, we would try to replace `u0` with `s0` via `simplify()` before using a fallback hint. 3f6acf65fd/torch/_inductor/sizevars.py (L46-L47)

## Concrete Example
A scenario where this is useful is when we're running autotuning benchmarking on bmm with two input nodes: one who has `s0` as the batch size and one who has `u0` as the batch size. During benchmarking, we'll create two example input tensors where the input with `u0` has to use a fallback hint for batch size. This will lead to a mismatch.

e3d80f2fa9/torch/_inductor/select_algorithm.py (L991-L997)

Using the fallback hint (i.e. 8192) leads to a batch size mismatch.

```
# Note: s0 = 7 and u0 = 7 and fallback hint is 8192.
LoweringException: ErrorFromChoice: Expected size for first two dimensions of batch2 tensor to be: [7, 30] but got: [8192, 30].
From choice ExternKernelCaller(extern_kernels.bmm)
```

Differential Revision: D55619331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123140
Approved by: https://github.com/aakhundov
2024-04-04 04:59:59 +00:00
5bc6bd3cb8 Remove excessive warnings and rewrite FSDP docstrings (#123281)
The page at https://pytorch.org/docs/stable/fsdp.html contains a series of warnings and notes that, due to their frequency, may detract from their intended purpose: to highlight crucial information. This PR aims to restructure these notes and warnings into a more coherent narrative, thereby enhancing the readability of the page.

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123281
Approved by: https://github.com/awgu
2024-04-04 04:08:57 +00:00
6694628170 [dynamo][guards] Remove workaround after #122858 (#123303)
Not needed since https://github.com/pytorch/pytorch/pull/122858 has landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123303
Approved by: https://github.com/mlazos
ghstack dependencies: #123285, #123302
2024-04-04 03:52:50 +00:00
5b45ec8892 [dynamo][guards] Use DATA_PTR instead of ID_MATCH for tensors (#123302)
We should sparingly use ID_MATCH guards. When it comes to performance, ID_MATCH is much faster DATA_PTR for Python guards. However, the difference is very small in C++. So, its worth just using DATA_PTR_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123302
Approved by: https://github.com/mlazos
ghstack dependencies: #123285
2024-04-04 03:52:50 +00:00
fb7664d5bf [dynamo][optimizer][guard-overhead] NOT_NONE guard for param.grad instead of TENSOR_MATCH (#123285)
For optimizers, we do an DATA_PTR match for parameters. For param.grad, we were doing TENSOR_MATCH, but what we really need to guard is if param.grad is None or not. Therefore, I add a new guard called NOT_NONE.

Further improves the guard overhead

![image](https://github.com/pytorch/pytorch/assets/13822661/574598ac-ca71-4e5e-9e75-8774577cd58f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123285
Approved by: https://github.com/mlazos, https://github.com/jansel
2024-04-04 03:52:47 +00:00
63d17d3c90 Revert "Revert usage of NJT views in SDPA (#123215)"
This reverts commit 0fcddb56252c9b4401e8b888eddd4bc4bce3e624.

Reverted https://github.com/pytorch/pytorch/pull/123215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it needs to be skipped on ROCm 0fcddb5625 ([comment](https://github.com/pytorch/pytorch/pull/123215#issuecomment-2036080570))
2024-04-04 02:57:09 +00:00
ba7d396eb7 [inductor] Fix fresh_inductor_cache() (#122661)
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.

Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
2024-04-04 02:32:37 +00:00
1ea6d3a9b4 Fix conv decomp when running to core-aten (#123283)
Differential Revision: [D55709374](https://our.internmc.facebook.com/intern/diff/D55709374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123283
Approved by: https://github.com/angelayi
2024-04-04 01:14:09 +00:00
c58b0ac7c2 IntraNodeComm primitives for allgather_matmul (#118038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118038
Approved by: https://github.com/wanchaol
2024-04-04 00:46:08 +00:00
cyy
0ba16ffd35 [Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122892)
This PR continues to fix some clang-tidy warnings in distributed code, following #122884.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122892
Approved by: https://github.com/Skylion007
2024-04-04 00:39:31 +00:00
41e7619875 [EZ] Do not test for an undefined conversion behavior (#123258)
By limiting `VecConvertTests` subtest cases to positive numbers when converting to unsigned types.

As what `static_cast<unsigned int>(-3.0f)` is doing is compiler/architecture  specific, as one can observe by running
```cpp
#include <stdint.h>
#include <iostream>
unsigned int convert(float x) {
    return static_cast<unsigned int>(x);
}

int main(int argc, const char* argv[]) {
    auto inp = std::atof(argc > 1 ? argv[1] : "-3.0");
    std::cout << "cvt(" << inp << ")=" << convert(inp) << std::endl;
    return 0;
}
```
on x86 would print `cvt(-3)=4294967293`, but on ARM would convert to `0`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123258
Approved by: https://github.com/atalman
2024-04-03 23:42:25 +00:00
0fcddb5625 Revert usage of NJT views in SDPA (#123215)
For internal purposes, this PR reverts the use of real views in SDPA -> autograd.Function "views" (i.e. `ViewBufferFromNested` and `ViewNestedFromBuffer`). This is a temporary fix to get the FIRST model launched and working.

**Note: this breaks some other Dynamo tests related to SDPA that rely on real views, but the breakage there isn't expected to be likely in a real-world scenario.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123215
Approved by: https://github.com/YuqingJ
2024-04-03 23:25:31 +00:00
6e99f73923 [CMake] fix cmake regex to match newly introduced 9.0a architecture (#123243)
When people build pytorch extensions with cmake, and the GPU supports 9.0a arch as introduced in https://github.com/pytorch/pytorch/pull/110587 , the cmake regex is not updated to recognize the change, leading to cmake breaks like https://github.com/pytorch/pytorch/issues/113948 and https://github.com/pytorch/pytorch/issues/119946 .

This PR should fix them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123243
Approved by: https://github.com/malfet
2024-04-03 23:24:26 +00:00
05289a278c Fix for MPS regression in #122016 and #123178 (#123234)
Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds.

Before the fix on MacOS 14.4:

```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 3., 3.], device='mps:0')
```

After the fix:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 1., 3.], device='mps:0')
```

This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-04-03 23:00:57 +00:00
4732375042 make RecordFunctionFast take inputs (#123208)
Summary: RECORD_FUNCTION in C++ and torch.profiler.record_function already support recording inputs. Let's do the same for RecordFunctionFast.

Test Plan: Add tests in test_profiler.py that take args and also do not take args so we can support it being an optional parameter

Differential Revision: D55648870

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123208
Approved by: https://github.com/davidberard98
2024-04-03 21:58:09 +00:00
a5cf9a5800 [CI] Do not install Nvidia drivers in ARC (#122890)
ARC Runners will provide working Nvidia drivers through the host configuration so this step is no longer necessary in the workflow as the ARC container is not able to install packages at the host level.

Also simplify the the setup-linux condition on if running in ARC as we can achieve the same result without needing an extra shell step via the hashFiles() function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122890
Approved by: https://github.com/seemethere, https://github.com/jeanschmidt
2024-04-03 21:47:16 +00:00
3e2b7e6052 [dynamo][guard overhead] Data ptr guard optimizer state tensors (#122858)
Stricter (but faster) guarding on optimizer state tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122858
Approved by: https://github.com/anijain2305
2024-04-03 21:42:06 +00:00
63a0ce89a0 [PT2][Inductor][3/n] Customize pre grad and post grad patterns (#121915)
Summary: Currently, we only enabled the group batch fusion customization, we also enable the split cat customization.

Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542
```
P1196013839

Differential Revision: D54861682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121915
Approved by: https://github.com/jackiexu1992
2024-04-03 21:37:21 +00:00
7deb842b0d [MacOS] Default parallel jobs to performance cores (#123038)
By querying `sysctl hw.perflevel0.physicalcpu ` instead of
`std:🧵:hardware_concurrency()` which returns total number of
cores, which is sum of performance and efficient ones

As lots of parallel algorithm in ATen divide the the parallel task into an even region, this end up in faster code execution, compared to when all cores are used by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123038
Approved by: https://github.com/albanD
2024-04-03 21:05:17 +00:00
9e0838ff27 fix typo in export/test_export.py (#123228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123228
Approved by: https://github.com/pianpwk
2024-04-03 20:17:22 +00:00
620aaaf0cb [DCP] Adds ability to create a CPU state dict that is both shared and pinned (#122338)
[DCP] Adds ability to create a CPU state dict that is both shared and pinned, as well as a new utility specific to copying the state dict

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122338
Approved by: https://github.com/fegin
2024-04-03 20:05:01 +00:00
a4035bea5c [while_loop] support closures (#123018)
We add an additional_inputs arguments to the HOP while_loop and rename the operands to carried_inputs based on offline discussion with @zou3519 . This allows us to support closures, parameters and buffers.

The alternative is to pass the lifted inputs directly to outputs of body_fn. But since we want the body_fn's output to not aliasing input. We'll need to copy the inputs and remove the copies later. This is a bit more work to do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123018
Approved by: https://github.com/aakhundov
ghstack dependencies: #123217
2024-04-03 19:35:15 +00:00
5a66c2d65b Update pytorch/xla pin (#123217)
#123018 introduces a necessary bc breaking change and sees a bunch of xla test failures on CI. We made a pr to pytorch/xla to prepare for the breaking change https://github.com/pytorch/xla/pull/6872. We update the pin of pytorch/xla to reflect the change in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123217
Approved by: https://github.com/clee2000
2024-04-03 19:35:15 +00:00
9aed9c8c87 Reduce CPU overhead of copying inputs in CUDAGraph trees via foreach_copy (#123162)
I noticed that when enabling CUDA graphs in Inductor, most of the CPU time was spent issuing copies from the new inputs to the graph's input tensors. This meant that my workload was still somewhat CPU bound.
<img width="1204" alt="Screenshot 2024-03-28 at 14 18 49" src="https://github.com/pytorch/pytorch/assets/120810/9ac2462d-ef46-4051-8b22-e677845ca83e">

I tried to improve this situation by using the new `_foreach_copy_` operator, in order to group all the copies into one operator. There was already a comment in the code indicating that this was a desired optimization. It did indeed improve the situation substantially:
<img width="908" alt="Screenshot 2024-03-28 at 14 21 21" src="https://github.com/pytorch/pytorch/assets/120810/67548ac8-2b41-46ba-8588-cea6470301cc">

On device, the situation also improved, with the memcpys being merged into fewer larger kernels:
Before:
<img width="848" alt="Screenshot 2024-03-28 at 14 24 48" src="https://github.com/pytorch/pytorch/assets/120810/e12e27c4-6d86-40cf-9478-061bc10920d7">
After:
<img width="824" alt="Screenshot 2024-03-28 at 14 24 06" src="https://github.com/pytorch/pytorch/assets/120810/a4771b5c-6848-4510-a841-ffa5bba3023f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123162
Approved by: https://github.com/eellison
2024-04-03 19:34:41 +00:00
691054eeef Fix error message of autograd (#123154)
This PR updates the error message in autograd when an input tensor does not set to `require_grad`. The original message does not contain the index info, making users hard to debug.
The error message style consists with that on line 105-109.
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123154
Approved by: https://github.com/soulitzer
2024-04-03 19:07:21 +00:00
700917c361 Adjust logging content for TS usage logging (#123133)
Summary:
Remove unused/ignore/export TS logging because they do not represent independent TS usage and leads to overload of scribe

Log tupperware job's oncall information so that we have better attribution of who launched the job.

Test Plan: manual testing

Reviewed By: davidberard98

Differential Revision: D55610844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123133
Approved by: https://github.com/clee2000
2024-04-03 18:54:26 +00:00
9aa1a4d386 Remove mypy-ignore-errors from custom_op_db (#123107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123107
Approved by: https://github.com/soulitzer
ghstack dependencies: #122344
2024-04-03 18:36:17 +00:00
44c0c0fc0f Add torch.library.custom_op (#122344)
This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will
never peek into it) custom op. In this PR, you can specify backend impls
and the abstract impl for this op.

NB: most of this PR is docstrings, please don't be intimidated by the
line count.

There are a number of interesting features:
- we infer the schema from type hints. In a followup I add the ability
  to manually specify a schema.
- name inference. The user needs to manually specify an op name for now.
  In a followup we add the ability to automatically infer a name (this
  is a little tricky).
- custom_op registrations can override each other. This makes them
  more pleasant to work with in environments like colab.
- we require that the outputs of the custom_op do not alias any inputs
  or each other. We enforce this via a runtime check, but can relax this
  into an opcheck test if it really matters in the future.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-03 18:36:17 +00:00
aa16c0163f Only update momentum buffers for SGD if momentum is enabled (#122349)
As title

[benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370)

Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex.
ElectraForQuestionAnswering goes from 1090us -> 554us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349
Approved by: https://github.com/janeyx99
2024-04-03 18:29:55 +00:00
fe29a8fbea [quant][be] Simplify fake_quant_per_channel (#123186)
Summary: We probably don't need
`torch._C._AutoDispatchBelowAutograd()`, which is to prevent
infinite recursion if the implementation calls itself. Let's
remove it and see if anything breaks. The other major change
is registering the op to the more general Autograd dispatch
key so it can be used on cuda as well.

Test Plan:
python test/inductor/test_cpu_repro.py -k test_decomposed_fake_quant_per_channel

Reviewers: zou3519, bdhirsh

Subscribers: zou3519, bdhirsh, jerryzh168, leslie-fang-intel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123186
Approved by: https://github.com/zou3519, https://github.com/leslie-fang-intel
2024-04-03 18:06:45 +00:00
eqy
8b6b179a8a [cuBLAS][cuBLASLt][FP8] Enforce restrictions on amax computation for scaled fp8 gemms (#122821)
Word from `cuBLAS` is that `amax` computation is unsupported for non-fp8 outputs when the inputs are in fp8, even if it was "silently" executing in the past.

CC @tinglvv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122821
Approved by: https://github.com/vkuzo
2024-04-03 17:54:36 +00:00
bde1a93bc4 Add lowering for resize, decomp for resize_as. (#122317)
This has been split off from #121354 as the inplace version of these
methods prove to be rather tricky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-04-03 17:47:29 +00:00
a8b9dcb9af Skip test_artificial_grid_cpp_wrapper (#123211)
Summary: This test is actually broken and probably succeeding by mistake because of a cache hit. Forcing a fresh cache or removing the errant setting cause a consistent failure. Disabling for now until we have time to investigate further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123211
Approved by: https://github.com/desertfire
2024-04-03 17:27:55 +00:00
8a0436014d Support map in pre-dispatch functionalization (#121444)
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
2024-04-03 17:14:41 +00:00
8ac0f072e6 [aot eager] Support frontend graphs with list arguments (#123212)
We already support bumpy inputs for 3rd party frontend and compiled backward graph, we should add the behavior to aot_eager too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123212
Approved by: https://github.com/jansel
ghstack dependencies: #122691, #122746, #123007
2024-04-03 17:07:52 +00:00
d895192e87 Fix zeros_like on sparse compressed fake tensors (#123084)
Fixes https://github.com/pytorch/pytorch/pull/117907#issuecomment-2025769663

Adds block compressed sparse tensors support to zeros_like

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123084
Approved by: https://github.com/amjames, https://github.com/peterbell10
2024-04-03 16:11:11 +00:00
74b3a7920e [Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)
* Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels **_inductor.config.cutlass_backend_min_gemm_size**

 * During GEMM algorithm choice generation: **if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend**, even if it is not enabled in **_inductor.config.max_autotune_gemm_backends**

Test plan:
CI
Additional unit test in test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491
Approved by: https://github.com/jansel
ghstack dependencies: #121490
2024-04-03 13:34:16 +00:00
f2e67179ee [Inductor] Make codecache CUDA compilation more robust & flexible (#121490)
Minor changes which make the CUDA compilation within _inductor/codecache.py
more robust and flexible.

Test plan:
CI
Additional test in test_codecache.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490
Approved by: https://github.com/jansel
2024-04-03 12:56:48 +00:00
25ad90adc0 Revert "Support map in pre-dispatch functionalization (#121444)"
This reverts commit 9288b274611abc904a67d9cb02c837aa2cb769fd.

Reverted https://github.com/pytorch/pytorch/pull/121444 on behalf of https://github.com/atalman due to New test test_aot_export_predispatch_map_1 is failing on windows ([comment](https://github.com/pytorch/pytorch/pull/121444#issuecomment-2034526949))
2024-04-03 12:55:23 +00:00
957b8d5c00 [Inductor Intel GPU backend Upstream] Register general runtime device for Intel GPU (#121883)
Following the RFC https://github.com/pytorch/pytorch/issues/114856, Intel GPU Inductor backend uses device specific runtime API. To generalize this and reuse the existing generalize device interface, this PR registers the general device interface for Intel GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121883
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel
2024-04-03 08:34:05 +00:00
3eb84b6343 [dynamo][cpp-guards] Init LocalState only when TENSOR_MATCH guard present (#123152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123152
Approved by: https://github.com/jansel
2024-04-03 08:04:39 +00:00
68cffd19f6 DTensor: add ring attention for _scaled_dot_product_flash_attention (#122460)
Ring attention support for _scaled_dot_product_flash_attention with DTensor.

This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention.

## Notable caveats
* Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version.
* nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue.
* If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism.

## SDPA usage

```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
    dquery = distribute_tensor(query, device_mesh, [Shard(2)])
    dkey = distribute_tensor(key, device_mesh, [Shard(2)])
    dvalue = distribute_tensor(value, device_mesh, [Shard(2)])

    dout: DTensor = torch.nn.functional.scaled_dot_product_attention(
        dquery, dkey, dvalue, is_causal=is_causal
    )
    out = dout.to_local()
```

## Transformer usage

```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
    encoder_layer = nn.TransformerEncoderLayer(
        d_model=dim,
        nhead=nheads,
        dim_feedforward=dim,
        batch_first=True,
    ).to(dtype)
    encoder_layer = parallelize_module(
        module=encoder_layer,
        device_mesh=device_mesh,
        parallelize_plan={
            "self_attn": ContextParallel(),
        },
    )
    model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
```

## Test plan

```
pytest test/distributed/_tensor/test_attention.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460
Approved by: https://github.com/drisspg, https://github.com/wanchaol
2024-04-03 06:45:00 +00:00
f06d77caba [TP] Improve MLPStacked test (#123199)
Improve tests per @wanchaol 's suggestions in #122968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123199
Approved by: https://github.com/wanchaol
2024-04-03 06:14:49 +00:00
eb3a34d280 Optimize multi_tensor_apply (take 2) (#119764)
### Take 2

The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153:
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.

### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar
2024-04-03 05:54:49 +00:00
d91db70295 [dynamo][cpp-guards] Optimize tensor.grad accessor (#123226)
For LayoutLM model, reduces C++ guard overhead by 1.48x. These are the numbers

![image](https://github.com/pytorch/pytorch/assets/13822661/25cfc35b-b67d-4903-8403-71fa931dacdd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123226
Approved by: https://github.com/jansel
2024-04-03 05:32:13 +00:00
9288b27461 Support map in pre-dispatch functionalization (#121444)
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
2024-04-03 03:28:14 +00:00
15bd81bfaf expose transformer header in cmake and wheel (#122586)
expose transformer header in cmake and wheel, some utils functions are used in nested transformer development on IPEX side
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122586
Approved by: https://github.com/drisspg, https://github.com/Neilblaze, https://github.com/gujinghui
2024-04-03 02:27:40 +00:00
102c676418 [DTensor] Added some more foreach ops (#123214)
These ops should already work with the existing strategy. We need these for precomputing fp32 -> fp8 casts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123214
Approved by: https://github.com/wz337
ghstack dependencies: #123142
2024-04-03 02:07:45 +00:00
15529de901 Remove FlameGraph comment from export_stacks (#123102)
Summary: In profiler.export_stacks there was a comment suggesting that the export was compatible with FlameGraph even though it isn't. We should remove this so that users are not confused.

Test Plan: Removed comment

Reviewed By: aaronenyeshi

Differential Revision: D55501792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123102
Approved by: https://github.com/aaronenyeshi
2024-04-03 01:09:13 +00:00
2964b1ef21 Extend XPU merge rules: Add torch/csrc/xpu/, torch/xpu/ and test/xpu (#122856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122856
Approved by: https://github.com/atalman, https://github.com/malfet
2024-04-03 00:52:08 +00:00
0c6e8af257 [AOTI][refactor] Update some test cases (#123093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123093
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
2024-04-03 00:51:11 +00:00
0a2e0eb4c0 [functional collective rewrite] support rewriting reduce op for reduce_scatter_tensor (#122834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122834
Approved by: https://github.com/yf225
ghstack dependencies: #122666
2024-04-03 00:48:24 +00:00
f15fd650b7 [funcol] add deprecation warning for the legacy backend (#122666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122666
Approved by: https://github.com/yf225
2024-04-03 00:27:06 +00:00
31aff29b79 Add clone if output is a view from constant. (#123200)
Summary:
For the original clone we did for output, we only clone when the
corresponding tensor is an constant. We need this because we have to
make sure the constants' ownership maintain in the Model. However we
haven't include if it's a view of a constant.

Test Plan:
Included in commit
test_aot_inductor::test_return_view_constant

Reviewed By: frank-wei, desertfire

Differential Revision: D55645636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123200
Approved by: https://github.com/chenyang78
2024-04-03 00:13:39 +00:00
c77352b5cc Add torch._library.register_fake_class to fakify torchBind class (#122622)
This PR only adds abstract class registration logic without touching existing tests so they still trace with real script object. The added tests are only for registration APIs and test error messages.

Our design is that the abstract implementation should be in Python. This is much better in terms of usability. But this also has implications for custom op that takes script object as input, which is detailed later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122622
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620, #122621
2024-04-02 23:52:17 +00:00
46c7235406 add tensor queue example (#122621)
This PR adds a tensor queue example for later use. It doesn't touch any existing logic. It refactors the tests a little bit to avoid importing the library in unittest setUp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122621
Approved by: https://github.com/zou3519
ghstack dependencies: #122619, #122620
2024-04-02 23:52:17 +00:00
5d6a447357 [torchbind] change to parametrized tests for pre_dispatch (#122620)
Refactor the tests to make the test more robust.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122620
Approved by: https://github.com/zou3519
ghstack dependencies: #122619
2024-04-02 23:52:14 +00:00
071f23f4f3 [torchbind] redispatch call_torchbind in proxy dispatch mode (#122619)
This allows proxy_mode to further dispatch to fake tensor mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122619
Approved by: https://github.com/zou3519
2024-04-02 23:52:11 +00:00
e3d80f2fa9 [ONNX] beartype to emit warning instead of error by default (#123205)
Making exporter more "robust" to advances in beartype tool.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123205
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-04-02 23:17:50 +00:00
b1aca36f4c [export] Allow legacy IR to be unflattened with weaker submodule ordering. (#123192)
Summary: In some cases we don't have information from the old IR about submodule ordering, in this case unflattener should still work in best effort mode.

Differential Revision: D55642005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123192
Approved by: https://github.com/angelayi
2024-04-02 23:08:55 +00:00
d7fe0603a1 Move sparse tests to TestOptimRenewed (#123146)
This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params.

1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present.

2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for:

Adagrad:
```
python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'lr': 0.1}    <--- this and above are CPU
.{'foreach': False, 'lr': 0.1}
{'foreach': True, 'lr': 0.1}
{'maximize': True, 'foreach': False, 'lr': 0.1}
{'maximize': True, 'foreach': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1}
.
----------------------------------------------------------------------
Ran 2 tests in 227.744s

OK
```

SGD
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'dampening': 0.5, 'lr': 0.0048}
.{'foreach': False, 'lr': 0.0048}
{'foreach': True, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': False, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': True, 'lr': 0.0048}
.
----------------------------------------------------------------------
Ran 2 tests in 112.801s

OK
```

SparseAdam
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.04}
.{'maximize': True, 'lr': 0.04}
.
----------------------------------------------------------------------
Ran 2 tests in 35.113s

OK
```

Fixes #103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed.

Next steps:
Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality.

Add more tests!

Maybe add more optimizers!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146
Approved by: https://github.com/albanD
ghstack dependencies: #123134, #123139
2024-04-02 22:51:02 +00:00
f2838c99a0 Add a tensor lr test for optimizers (#123139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123139
Approved by: https://github.com/albanD
ghstack dependencies: #123134
2024-04-02 22:51:02 +00:00
cb8fc30e4a Move LRScheduler integration tests to OptimizerInfo (#123134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123134
Approved by: https://github.com/albanD
2024-04-02 22:51:02 +00:00
12e36dc1df [dynamo] Fix torch._dynamo.disable on flatten_graph_inputs wrapper (#123007)
Existing `innermost_fn` handling of `functools.wraps` is not ideal, but I'm not sure if there's a good fix. This can manifest for GmWrapper (used to handle list inputs from Dynamo -> AOTAutograd) where we don't call the unflatten wrapper at runtime.

Since core parts of Dynamo rely on attribute check for `_torchdynamo_orig_callable`, so I'm adding a test to cover it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123007
Approved by: https://github.com/jansel
ghstack dependencies: #122691, #122746
2024-04-02 21:39:44 +00:00
71085983ae [c10d] [NCCL] Fix work handle for coalescing manager (#122849)
Fixes #122807
The work handle of the coalescing job will be populated:
```python
    with dist._coalescing_manager(group=pg_nccl, device=device, async_ops=True) as cm:
        dist.all_reduce(a)
        dist.all_reduce(b)
    print(len(cm.works)) # prints 1
    cm.wait() # actually waits
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122849
Approved by: https://github.com/kwen2501
2024-04-02 21:25:16 +00:00
5027ef7e9c [TP] Add wildcard support (#122968)
Adding wildcard support for TP's `parallelize_module` API.

Example patterns:
`layers.*.linear`: any characters
`layers.?.linear`: single character
`layers.[1-2]`: digit range, matches `layers.1` and `layers.2`

Example use case:
A model have multiple layers, and we want to parallelize the linear module `lin` inside each layer.
```
model_tp = parallelize_module(
    model,
    device_mesh,
    {
        "layers.*.lin": ColwiseParallel(),
    },
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122968
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wanchaol
ghstack dependencies: #122919
2024-04-02 21:23:39 +00:00
35f4d70240 [sparse] proper sparse iteration (#123128)
The branches were in the wrong order (since sparse tensors will also be instances of regular tensors). This puts the branches in the right order.

This is a small step towards #117188

@pearu to review (this was split of https://github.com/pytorch/pytorch/pull/117907)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123128
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-04-02 20:52:48 +00:00
19c2ed15c0 update submodule onnx==1.16.0 (#123125)
Fixes #121258

CC @malfet @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123125
Approved by: https://github.com/malfet
2024-04-02 20:41:22 +00:00
8244ee00cf Add fuzzer instructions to pt2 bug template (#123156)
Adds fuzzer instructions to our issue template
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123156
Approved by: https://github.com/eellison, https://github.com/anijain2305
2024-04-02 20:33:01 +00:00
0ff6155eee [AOTI] Support module buffer mutation (#123164)
Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164
Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov
2024-04-02 20:25:26 +00:00
a9a9ce6d9c [ez][FSDP2] Removed _contiguous_orig_stride (#123142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123142
Approved by: https://github.com/yifuwang
2024-04-02 20:18:27 +00:00
bcb6e5aa72 [DCP] Support partial load (#122829)
Adds ability to load a subset of keys directly from a checkpoint, avoiding the need to initialize state dict first

Differential Revision: [D55441391](https://our.internmc.facebook.com/intern/diff/D55441391/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122829
Approved by: https://github.com/fegin
2024-04-02 19:22:22 +00:00
feabb645a7 Revert "Handle transposes in second batch of matrices in bmm (#122194)"
This reverts commit 251ad1232b094d5ea0b641907e03bfd8a2011b61.

Reverted https://github.com/pytorch/pytorch/pull/122194 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/122194#issuecomment-2032806360))
2024-04-02 18:49:28 +00:00
eqy
1c61401086 [cuBLAS] Fix typo in CUDA_VERSION ifdef for explicit workspace allocation (#123114)
12.2 is 12020, not 12200, oops...

CC @malfet @atalman @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123114
Approved by: https://github.com/peterbell10
2024-04-02 18:34:27 +00:00
64d743044d Add inline constraints to non-strict exported program (#123017)
Summary: This PR reduces the difference between strict and non-strict exported program by supporting inline_constraints for non-strict exported program,

Test Plan: CI

Differential Revision: D55547830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123017
Approved by: https://github.com/angelayi
2024-04-02 18:16:16 +00:00
d17eea9c0f [dynamo] fix broken 3.11+ windows build failure (#123104)
e.g. https://github.com/pytorch/pytorch/actions/runs/8478510063/job/23230951466#step:12:23296

Caused by https://github.com/pytorch/pytorch/pull/122335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123104
Approved by: https://github.com/atalman
2024-04-02 17:52:14 +00:00
251ad1232b Handle transposes in second batch of matrices in bmm (#122194)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194
Approved by: https://github.com/DenisVieriu97
2024-04-02 17:48:35 +00:00
aaef246c74 remove log2 decomposition; add log2 lowering (#123112)
Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112
Approved by: https://github.com/peterbell10
2024-04-02 16:16:26 +00:00
5f46312dbb Reapply "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864) (#122713)
This reverts commit 92ed8553a65808682aeca59e3cb5823cf2d52839.

No longer importing codecache or boxed_nop at top level, both of which casued issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122713
Approved by: https://github.com/anijain2305
2024-04-02 16:11:00 +00:00
638b003cb7 [NJT] .to() properly updates device of offsets (#122797)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122797
Approved by: https://github.com/jbschlosser
2024-04-02 16:07:27 +00:00
b27ee6548d Add a Dynamo deepdive to documentation (#122305)
This supersedes the previous `Guards Overview" as a more comprehensive
approach to most of the main topics within Dynamo.

In the future, we could add specific sections for each of the topics
discussed here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305
Approved by: https://github.com/msaroufim
2024-04-02 15:08:08 +00:00
c40f386afd [Inductor][1/n]Split cat customization (#123045)
Summary: Change the config and revise the group batch fusion in order not to reuse the exsiting pre_grad and post_grad fusion options

Test Plan:
# unit test
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923560510096
Network: Up: 15MiB  Down: 155MiB  (reSessionID-6a577a14-1772-42d9-9ae8-bfdc62f406a3)
Jobs completed: 267487. Time elapsed: 2:39.7s.
Cache hits: 99%. Commands: 104465 (cached: 104457, remote: 8, local: 0)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test @mode/dev-nosan //caffe2/test/inductor/fb:split_cat_fx_passes_fb
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199283031382
Network: Up: 28MiB  Down: 177MiB  (reSessionID-a3081518-7cba-4c83-b442-c16655ecb2cd)
Jobs completed: 183164. Time elapsed: 1:41.4s.
Cache hits: 99%. Commands: 75875 (cached: 75862, remote: 12, local: 1)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test @mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099189612276
Network: Up: 1.3MiB           Down: 3.1MiB           (reSessionID-0d312a2d-e19e-4ba6-9f96-7eb5863734e7)
Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
Network: Up: 1.4MiB  Down: 3.2MiB  (reSessionID-0d312a2d-e19e-4ba6-9f96-7eb5863734e7)
Jobs completed: 68. Time elapsed: 2:19.9s.
Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:perf
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549804623287
Network: Up: 1.5MiB  Down: 1.1MiB  (reSessionID-8d912a20-fceb-4698-89c3-d28e0708831f)
Jobs completed: 164. Time elapsed: 1:42.2s.
Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12)
Tests finished: Pass 57. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
case 1: with split cat
```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf" --flow_id 524546542
```
optimus parameter sent to the scuba:
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLL6RBZb-ssXJYcBAMzw0oaKtp80br0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GH1LAxcxv0Ae_BkFAHVav3K3oosDbr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GNb0jwR-Ukkqns4CAGRmOqucfedDbr0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHsIQxm-hn3SPrgCAKq1E-HBsoZHbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GOrJORmbMTV_xlQDAOwolqclPsIAbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCqkmRblvVKybGUDACVxkwVIrWxLbr0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCB1QBfko_kVN0wFAKGjSZv4DJULbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMwJPRmu4ry88swDAO1gdA5RCKIXbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLXCORnNiKeQFmoDABR93CRKmP8Sbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBMIPRnlwQyjSD4BANPuaMhV7MUjbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJ9KPxkOv4LL8_0DAA65D4kh4JYDbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 2844, 'pattern_matcher_count': 2604, 'normalization_pass': 886, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_aten_mul': 4, 'batch_sigmoid': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_add': 1}), 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEcvPxmxBj-pd8gCABE1QgB-d6N6br0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEvQxhYomJGj2FMBAEXXAI8Vgzhmbr0LAAAz'}
```
P1202819405

case 2: without split cat
```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch --model_type "cmf" --flow_id 524546542
```
optimus parameter sent to the scuba:
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAY7PxmGthuyjSwEAHF_A767YbMkbr0LAAAz', 'BatchLayernormFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLDPtBacXyybEOICAKaGCPatq5oabr0LAAAz', 'BatchSigmoidPreGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GBu7ORkiDJu42QAEAGmlVTgO_Mpbbr0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GC893BZNl99ftY4BAHm5Z8sM4ptSbr0LAAAz', 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GCAeuRYgzPO5RcsCAPO3Z7tdMNMKbr0LAAAz', 'BatchMulPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHBIQxm1jlU-xhsFAONkzhh2mgknbr0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDoUPhmZ0noiaGMDAJHYuuiwHEAUbr0LAAAz', 'inductor': Counter({'pattern_matcher_nodes': 1189, 'pattern_matcher_count': 757, 'batch_aten_mul': 9, 'batch_aten_sub': 3, 'batch_sigmoid': 2, 'batch_aten_add': 2, 'batch_layernorm': 1, 'batch_linear_post_grad': 1}), 'BatchAddPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GAluthYxi8uxpI4BAIQDzn3OyywUbr0LAAAz', 'BatchSubPostGradFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GDjsJhTK5VAcot4CADIcAixghrYibr0LAAAz', 'PostGradBatchLinearFusion': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GEPfJxfJwktC7wsEAA0QbkqYNuVAbr0LAAAz'}
```
P1202823734

# e2e
training_platform:fd4f02cd855f5cc0ccb49317a5a6c8bb

with split cat
f546646358

without split cat
f546647159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123045
Approved by: https://github.com/jackiexu1992
2024-04-02 14:36:22 +00:00
1f503dffb3 Revert "[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)"
This reverts commit 7eadb157bd96a9e641f64cdfa759aa1dfaaa7dd5.

Reverted https://github.com/pytorch/pytorch/pull/123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/123136#issuecomment-2032163699))
2024-04-02 14:17:03 +00:00
72662bf05b [BE] Add torch.ops.aten._sparse_compressed_tensor_with_dims (#123083)
Used in https://github.com/pytorch/pytorch/pull/123084 and allows simplifying `empty_like` implementation for sparse compressed tensors (see https://github.com/pytorch/pytorch/pull/121900#issuecomment-2029835473).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123083
Approved by: https://github.com/cpuhrsch
2024-04-02 10:12:21 +00:00
f9b2ffa7c4 Forward fix lint after #119727 (#123137)
After #119727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123137
Approved by: https://github.com/albanD
2024-04-02 09:35:20 +00:00
7eadb157bd [aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
2024-04-02 09:00:05 +00:00
969bbf8e82 [dynamo][guards] Skip aliasing guards for optimizers (#123044)
I am ok if people don't want this PR to be merged.

For optimizers, we know that the state dict and param_group have same parameters. So, I think its ok to skip TENSOR_MUST_ALIAS guards.

Similarly for state tensors, all of them are different. Therefore, we can skip the tensor aliasing guards.

With this PR, these are the numbers for Megatron which has 394 parameters

<img width="290" alt="image" src="https://github.com/pytorch/pytorch/assets/13822661/0ce75dc6-4299-46bb-bf3c-7989ebc7cfc4">

C++ numbers jump a lot because of 2 reasons
1) We are now not doing INCREF/DECREF for a large number of tensors.
2) For python guards, we can expect higher numbers but that requires some more plumbing because the Python tensor guards are all collapsed into one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123044
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-04-02 08:51:00 +00:00
1d52c2d985 Add vec256_half_neon (#122918)
Summary: Add `vec256_half_neon.h` Currently not used anywhere.

Differential Revision: D55429392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122918
Approved by: https://github.com/mikekgfb
2024-04-02 04:02:57 +00:00
7a934e4031 [c10d] dump on any exception (timeout + nccl error) (#123023)
Summary:
Existing flight recorder dumping logic is: dump only on timeout, but not
on NCCL error. This resulted in the faulty ranks missing dumps when NCCL
error happens.

So in this PR, we revise the logic of dump such that records are dumped
when any exception is detected. Exception could be 1. NCCL async errors.
2. watchdog timeout

Also the existing code tends to mix the logic of flight recorder dump
and desync debug, which is no desirable. We only dump the desync debug
report only when timeout is detected.
Test Plan:
Added a new unit test to trigger nccl error and dump, and make sure the
dump is triggered by the error.

Also existing dump on timeout tests should still pass.

sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (84bf9d4c)]$ python
test/distributed/test_c10d_nccl.py NcclErrorDumpTest
NCCL version 2.19.3+cuda12.0
[E329 19:15:11.775879730 ProcessGroupNCCL.cpp:565] [Rank 0] Watchdog
caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=10, NumelOut=10, Timeout(ms)=10000) ran for
10028 milliseconds before timing out.
[E329 19:15:11.777459894 ProcessGroupNCCL.cpp:1561] [PG 0 Rank 0]
Exception hit in NCCL work: 2
[E329 19:15:12.660717323 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0]
Received a timeout signal from this local rank and will start to dump
the debug info. Last enqueued NCCL work: 2, last completed NCCL work: 1.
[E329 19:15:12.660932242 ProcessGroupNCCL.cpp:1167] [PG 0 Rank 0]
ProcessGroupNCCL preparing to dump debug info.
[E329 19:15:12.661192990 ProcessGroupNCCL.cpp:1174] [PG 0 Rank 0]
ProcessGroupNCCL dumping nccl trace to /tmp/tmp06psqil3/trace_0
[F329 19:15:12.661485601 ProcessGroupNCCL.cpp:1185] [PG 0 Rank 0] [PG 0
Rank 0] ProcessGroupNCCL's watchdog detected a collective timeout from
the local rank. This is most likely caused by incorrect usages of
collectives, e.g., wrong sizes used across ranks, the order of
collectives is not same for all ranks or the scheduled collective, for
some reason, didn't run. Additionally, this can be caused by GIL
deadlock or other reasons such as network errors or bugs in the
communications library (e.g. NCCL), etc. We tried our best to dump the
debug info into the storage to help you debug the issue.

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123023
Approved by: https://github.com/wconstab
2024-04-02 03:16:54 +00:00
f1c4d0fb2c [dynamo] show inlining reasons from trace_rules (#123014)
show specific inlining reasons with ``TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1``
* before, ``INLINING <code...>,  inlined according trace_rules.lookup``
* after, ``INLINING <code...> inlined according trace_rules.lookup MOD_INLINELIST``

this can distanguish between inlining by default or by MOD_INLINELIST (specific rule)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123014
Approved by: https://github.com/jansel
ghstack dependencies: #123013
2024-04-02 03:04:22 +00:00
0a038cf0cf [TP] Avoid splitting path twice (#122919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122919
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-04-02 02:06:11 +00:00
9d9d2af786 [BE] Move tests using functional API to OptimizerInfo (#122822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122822
Approved by: https://github.com/albanD
2024-04-02 01:35:59 +00:00
597f479643 Add torchbench on-demand test workflow (#122624)
When Torchbench discovers a regression, the PR author would like to know if their fix can pass the test, e.g., https://github.com/pytorch/pytorch/issues/122575

We are adding an on-demand ciflow to test Torchbench models if that is required by the user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122624
Approved by: https://github.com/huydhn
2024-04-02 01:11:01 +00:00
fdc281f258 [inductor] lower min SM requirement for gemm autotuning to 68 (#123121)
Lower the minimum number of CUDA SMs required for GEMM autotuning from V100 to 3080 level, allowing some high-end consumer GPUs to benefit as well.

Fixes #109489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123121
Approved by: https://github.com/jansel
2024-04-02 00:28:59 +00:00
12ced0f986 make user defined triton kernel work with new ASTSource.make_ir API (#123124)
User defined triton kernel calls `ASTSource.make_ir`. Triton recently added an extra require argument to this API and make the call in PyTorch user defined triton kernel related code to fail. The PR make PyTorch work with both old and new version of the API.

Test:
```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_arg_abi_compatible_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123124
Approved by: https://github.com/oulgen, https://github.com/jansel
ghstack dependencies: #123076
2024-04-02 00:18:43 +00:00
bc65c98588 [AOTI] enabled a couple of tests for CPUs (#122992)
Looks like some tests already work for CPUs so we enable those.

Also added links to the relevant issues for the skipped cpu tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122992
Approved by: https://github.com/desertfire
2024-04-01 23:40:53 +00:00
09c72eaa3f [inductor] Remove identity from ops.scan (#119727)
Currently scan has an `init` argument which must be the identity of the
combine function. This isn't strictly necessary if we are more careful about
keeping track of the first element and avoid combining it with anything.

This does additionally require that there are no active load masks, since we can't
do the `where_cond` any more. However, this shouldn't be possible anyway since
scans are always realized and only fused via the scheduler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119727
Approved by: https://github.com/lezcano
2024-04-01 22:47:26 +00:00
4d5cdc2e1e Fix empty_like bug for sparse tensors. (#121900)
Fixes #121671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121900
Approved by: https://github.com/pearu
2024-04-01 22:40:38 +00:00
891994fd1b Update dynamo test failures list (#123111)
After #122728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123111
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2024-04-01 22:25:02 +00:00
489f4a063b Revert "Preserve unbacked SymInt on SymNode (#120816)" (#122988)
This reverts commit 476585b190b16f6b27369679f7e19df9e2d8f073.

I did a bisect and this seems to be the cause of compile time regression in cudagraphs_dynamic test suite between 03/23 and 03/24:
![image](https://github.com/pytorch/pytorch/assets/4063635/21394e06-4906-4690-b5a2-7d16cc475843)
image Particularly BERT_pytorch and hf_T5 seem to have ~50% compile time regression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122988
Approved by: https://github.com/eellison
2024-04-01 22:11:09 +00:00
8b49782ba6 [Inductor] require channels last output for channels last input for max_pool2d_backward (#122749)
Previously we fell back on max_pool2d_with_indices_backward for channels last.. Turns out this was slow because we were inferring a contiguous output for channels last inputs. Fixing the layout and lowering gives a 1-2% TIMM win. It will also unblock saving the indices as int8 kernel offsets since we now lower channels last output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122749
Approved by: https://github.com/Chillee, https://github.com/amjames, https://github.com/jansel, https://github.com/shunting314
2024-04-01 22:02:00 +00:00
d765e223ac [dynamo][PT2D] avoid skipping dynamo_resume_* in torch/testing/_internal (#123013)
this PR ensures ``dynamo_resume_`` survives ``trace_rules.py``. As a ground truth, modules defined outside of ``pytorch/torch`` folders can survive ``trace_rules.py``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123013
Approved by: https://github.com/jansel
2024-04-01 21:12:48 +00:00
5d0ac887b9 [dynamo][higher order ops] Make the subgraph sourceless (#123071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123071
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #123046, #123058, #123059
2024-04-01 21:09:41 +00:00
69fa28f483 [dynamo][cpp-guards] Enable a few tests to prevent frequent regressions (#123059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123059
Approved by: https://github.com/jansel
ghstack dependencies: #123046, #123058
2024-04-01 21:09:41 +00:00
234287aa16 [dynamo][cpp-guards] DUAL_LEVEL guard (#123058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123058
Approved by: https://github.com/jansel
ghstack dependencies: #123046
2024-04-01 21:09:38 +00:00
ffd1e4e9ba [dynamo][cpp-guards] Always Reset relational guards (#123046)
Reset guard at the end of RootGuardManager, even if the result is true. Earlier we reset only when result was False. But this causes extra bookkeeping in each guard. This PR gives a tiny bit improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123046
Approved by: https://github.com/jansel
2024-04-01 21:09:35 +00:00
c4cbad4106 Fix broken test (#123034)
Summary:
Prior commit #122921 broke one unit test.
I renamed log->logger for consistency but forgot to make a similar
change in this one unit test.

Test Plan:
Test passes after fix
```
[cpio@devvm17556.vll0 /data/users/cpio/fbsource
(134660074|remote/fbcode/warm)]$ buck2 test '@fbcode//mode/opt'
fbcode//caffe2/test/distributed/elastic/multiprocessing:tail_log_test --
--exact 'caffe2/test/distributed/elastic/multiprocessing:tail_log_test -
test_tail_logfile_error_in_tail_fn (tail_log_test.TailLogTest)'
File changed:
fbcode//caffe2/test/distributed/elastic/multiprocessing/tail_log_test.py
Buck UI:
https://www.internalfb.com/buck2/19aeef9f-1d93-4505-975b-ecb205f3aad9
Test UI:
https://www.internalfb.com/intern/testinfra/testrun/5348024782558243
Network: Up: 11KiB  Down: 15KiB
(reSessionID-2b0989aa-3fe5-4e9a-943e-36625a0c4969)
Jobs completed: 7. Time elapsed: 8.0s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
[cpio@devvm17556.vll0 /data/users/cpio/fbsource
(134660074|remote/fbcode/warm)]$
```

Reviewers: wanchaol

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123034
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-04-01 20:47:47 +00:00
f461444be8 [inductor] make inductor work with new triton kernel launch API (#123076)
Triton changed its kernel launch API recently. Adapt inductor side call site to make it work with both old and new triton APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123076
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-04-01 20:32:52 +00:00
76a87e33a0 Remove cuda dependencies when building AOTriton (#122982)
Downloading CUDA sometimes fails and breaks the build process, but AOTriton does not need these packages for its own Triton fork. This commit comments out the related downloading scripts.

The actual changes from Triton can be found at: 9b73a543a5

Fixes the following building error
```
[2/6] cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
FAILED: CMakeFiles/aotriton_venv_triton /var/lib/jenkins/.local/lib/python3.8/site-packages/triton/_C/libtriton.so /var/lib/jenkins/workspace/build/aotriton/build/CMakeFiles/aotriton_venv_triton
cd /var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python && /opt/conda/envs/py_3.8/bin/cmake -E env VIRTUAL_ENV=/var/lib/jenkins/workspace/build/aotriton/build/venv PATH="/var/lib/jenkins/workspace/build/aotriton/build/venv/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.8/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" TRITON_BUILD_DIR=/var/lib/jenkins/workspace/build/aotriton/build/triton_build python setup.py develop
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-nvcc-12.1.105-0.tar.bz2 ...
downloading and extracting https://conda.anaconda.org/nvidia/label/cuda-12.1.1/linux-64/cuda-cuobjdump-12.1.111-0.tar.bz2 ...
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 325, in <module>
    download_and_copy(
  File "/var/lib/jenkins/workspace/build/aotriton/src/third_party/triton/python/setup.py", line 151, in download_and_copy
    ftpstream = urllib.request.urlopen(url)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 215, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 521, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 630, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 559, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 492, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/urllib/request.py", line 639, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 524:
ninja: build stopped: subcommand failed.
```

Example of failed build log: https://github.com/pytorch/pytorch/actions/runs/8483953034/job/23245996425
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122982
Approved by: https://github.com/jansel
2024-04-01 17:50:35 +00:00
c422bce131 [codemod] Fix some namespace issues in caffe2 (#121847)
Summary:
Removes `using namespace` from a header file. Having `using namespace` in a header file is *always* a bad idea. A previous raft of diffs provided appropriate qualifications to everything that relied on this `using namespace`, so it is now safe to remove it in this separate diff.

Helps us enable `-Wheader-hygiene`.

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D54838298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121847
Approved by: https://github.com/Skylion007
2024-04-01 17:45:16 +00:00
533c1b6c49 Disable vulkan logsoftmax test (#123103)
Ex https://github.com/pytorch/pytorch/actions/runs/8509797936/job/23306567177

The failure was only surfaced after #122845 (the bug fix to surface cpp test failures) so I don't know when it started

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123103
Approved by: https://github.com/kit1980
2024-04-01 17:41:59 +00:00
d7a274e1b0 [dtensor] switch aten.t to use op strategy (#122950)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122950
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #122929, #122949
2024-04-01 17:39:43 +00:00
9e1447dad6 [dtensor] make sure expected input spec have correct tensor meta (#122949)
as titled, previously we could possibly return the expected input spec
that shared by multiple args, this is not ok since different args might
have different tensor metas, why it was working before is because
redistribute in these cases become a no-op.

This PR fixes it by making each expected input spec to shallow clone the
corresponding input metadata

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949
Approved by: https://github.com/tianyu-l
ghstack dependencies: #122929
2024-04-01 17:39:42 +00:00
afee5bea92 [dtensor] refactor schema suggestions in output sharding (#122929)
This PR refactors the schema_suggestions in OuputSharding to be a single
OpSchema instead of list of schemas, which in practice we only have one,
for the multiple resharding case we also moved to OpStrategy so there's
no case that needs it to be a list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929
Approved by: https://github.com/tianyu-l
2024-04-01 17:39:39 +00:00
b4c810491e [export] Temporarily block mutating ops in quant tests. (#122863)
Summary: After we migrate to torch.export, we won't see ops like add_ and mul_ due to functionalization. We are rolling out pre dispatch export, so for now we just skip those mutating ops in tests.

Test Plan: buck run mode/opt caffe2/test/quantization:test_quantization

Reviewed By: tugsbayasgalan

Differential Revision: D55442019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122863
Approved by: https://github.com/clee2000
2024-04-01 16:41:13 +00:00
526ca5f28e [vec] fix compile warning in vec_n.h (#123090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123090
Approved by: https://github.com/lezcano
2024-04-01 15:55:27 +00:00
9ff2a9dcdd [dynamo] Skip leaf check on assert_metadata_eq if grad tensor level is -2 (#122728)
When fakifying a grad tracking tensor, if the level is -2 (sentinel
value) we can just unwrap the grad tensor and return a fake version of
it. In this PR, we update the `assert_metadata_eq` to not compare if
the grad tensor and the unwrapped ones are leafs or not, as this may
not be always true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122728
Approved by: https://github.com/zou3519
2024-04-01 15:38:16 +00:00
03439d4c1c [inductor] Lower divide by constant as multiplication by reciprocal (#121924)
Fixes #101039

This lowers division by a constant value to be multipication by reciprocal.
The same optimization is applied in eager mode on CUDA:

0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121924
Approved by: https://github.com/lezcano
2024-04-01 14:37:37 +00:00
6939279a17 [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-04-01 14:30:44 +00:00
dd8a24b8b7 [xla hash update] update the pinned xla hash (#123078)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123078
Approved by: https://github.com/pytorchbot
2024-04-01 11:17:02 +00:00
4b725e1619 [AOTInductor] Support quantized linear on CPU with fbgemm (#123069)
Summary:
Added support for quantized linear on CPU with fbgemm.
Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we
decompose it into two steps, pack weight, and fbgemm's qlinear with
packed weight.

Test Plan:
Included in commit.
test_aot_inductor::test_quantized_linear

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069
Approved by: https://github.com/hl475
2024-04-01 09:15:05 +00:00
6b1f13ea2f Add skip models by device in Dynamo Test (#122591)
Fix skip logic in `runner.py`. Add skip list which was defined by device for dynamo benchmark runner `runner.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122591
Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/jgong5
2024-04-01 03:16:32 +00:00
8b7da5b791 Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297)
For `at::scalar_tensor` the default dtype will be `float` ([link to scalar_tensor](0d8e960f74/aten/src/ATen/native/TensorFactories.cpp (L856)), [link to default dtype](0d8e960f74/c10/core/TensorOptions.h (L551))) if we don't set the `dtype` value. However, the input scalar value is not necessarily a `float` value. With `torch::tensor(x)`, the dtype of the tensor will be decided according to the dtype of the scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122297
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-01 01:32:41 +00:00
781e8d2201 [dynamo] Support __next__ on UserDefinedObjectVariable (#122565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122565
Approved by: https://github.com/yanboliang
2024-03-31 19:00:03 +00:00
5fc0f52bf0 [BE] Use modern C++ in ATen tests (#123031)
`std::is_same<A, B>::value` -> `std::is_same_v<A, B>`
`std::is_floating_point<T>::value` -> `std::is_floating_point_v<T>`
And use constexpr instead of defining two mutually exclusive templates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123031
Approved by: https://github.com/Skylion007
2024-03-31 16:07:38 +00:00
fa6178d246 [CI] Updated expected result files after https://github.com/pytorch/pytorch/pull/122846 (#123035)
Summary: Before https://github.com/pytorch/pytorch/pull/122846, pyhpc_isoneutral_mixing in AOTI inference run segfaults so its result was not logged in the expected result file. Now it does show as fail_to_run instead of None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123035
Approved by: https://github.com/chenyang78
2024-03-31 13:56:00 +00:00
6c2f36c984 Upgrade submodule pybind to 2.12.0 (#122899)
To fix https://github.com/pytorch/pytorch/issues/122056

Building with NP 2.0 allows me to run locally with both NP 2.0 and 1.26.
Any other test we should run @rgommers  ?

FYI @Skylion007 @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122899
Approved by: https://github.com/Skylion007
2024-03-31 11:29:40 +00:00
cyy
6d8bb0e984 [Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884)
This PR fixes some clang-tidy warnings in distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884
Approved by: https://github.com/kwen2501
2024-03-31 09:06:35 +00:00
a52e89b6f7 [inductor]re-enable cpu reduction ut (#122289)
Re-enable these two ut. I can pass these two ut on my local and we can see the status in the CI for this PR.

See the background about why they are disabled https://github.com/pytorch/pytorch/issues/93542, https://github.com/pytorch/pytorch/issues/87157.

After https://github.com/pytorch/pytorch/pull/115620. The reduction orders should be deterministic.
However, the orders may not exactly same with ref path (`aten`). We may can set larger tolerance if they still cannot be passed in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122289
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-31 08:33:14 +00:00
56451cd49d Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-31 03:07:32 +00:00
2b1ba0ceae [DeviceMesh] Cache and reuse sliced result (#122975)
Fixes #118849

Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors.

We will follow up with reusing pg from the parent_mesh during submesh creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975
Approved by: https://github.com/wanchaol
2024-03-30 23:56:55 +00:00
35c493f2cf [CPP Extension] Escape include paths (#122974)
By using `shlex.quote` on Linux/Mac and `_nt_quote_args` on Windows

Test it by adding non-existent path with spaces and single quote

TODO: Fix double quotes on Windows (will require touching `_nt_quote_args`, so will leave it for another day

Fixes https://github.com/pytorch/pytorch/issues/122476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122974
Approved by: https://github.com/Skylion007
2024-03-30 21:58:29 +00:00
557e7c9c16 Add some type hints to functions and update a few spelling mistakes (#123015)
# Summary
While working on this PR: https://github.com/pytorch/pytorch/pull/121845
I found that these type hints made my ide/ noob experience easier to reason about

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123015
Approved by: https://github.com/Skylion007
2024-03-30 21:15:01 +00:00
e203aa9fab [FSDP] [easy] fix HSDP validation error msg (#123019)
Summary:
This would otherwise yield

> ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.')

which is odd.

Remove the extra tailing commas.

Test Plan: CI

Differential Revision: D55549851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019
Approved by: https://github.com/Skylion007
2024-03-30 18:12:34 +00:00
ec58f1f74e [inductor] make mask_rcnn inference work in max-autotune mode (#123008)
inference for vision_maskrcnn model fail when max-autotune is enabled.

Repro:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --inference --bfloat16 --backend inductor --only vision_maskrcnn
```

It turns out that MA code receives empty input tensor for convolution and some places in MA related code does not handle this corner case properly. This PR enhance that and now the accuracy test above can pass.

Regarding why the input tensor is empty, I think it's probably due to no objects are detected in the input images (random data?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123008
Approved by: https://github.com/jansel
2024-03-30 16:39:57 +00:00
5e878be101 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit d94db5f6ee0af745c0d17cc6c87f695baa2b3b5f.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/atalman due to Breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2028084839))
2024-03-30 14:20:54 +00:00
b8550f527f Support gpu trace on XPU (#121795)
# Motivation
Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #121794
2024-03-30 13:07:53 +00:00
eb7adc3ae0 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-30 13:04:38 +00:00
99f8f77de9 [Inductor] Fix AFOC QPS Regression. (#122944)
Summary: Recently, we observed ~8% qps regression for AFOC model. After dig into the problem, I found it was introduced by D55272024, where the split node normalization was skipped or call_method split node, while our pattern detection based on the assumption that all split node has been normalized to call_funciton node. More context: https://docs.google.com/document/d/19h-fu2BqdUXMaSqbd7c0-Qe00ic7quUN-emJqH_1-SA/edit

Test Plan:
# unit test
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/0792d406-3d64-4b9c-95cc-15fb0cc76a96
Test UI: https://www.internalfb.com/intern/testinfra/testrun/11258999096315690
Network: Up: 113KiB  Down: 535KiB  (reSessionID-6132c09b-2ce7-4e89-b61d-d6c6142630cc)
Jobs completed: 26. Time elapsed: 1:25.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 10. Fail 0. Fatal 0. Skip 0. Build failure 0
```
buck2 test @mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13792273886410433
Network: Up: 1.3MiB  Down: 960KiB  (reSessionID-0bea8575-f163-4c5d-b201-69e05806af98)
Jobs completed: 68. Time elapsed: 2:47.2s.
Cache hits: 0%. Commands: 13 (cached: 0, remote: 1, local: 12)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "afoc" --flow_id 545665840
```
Now the merge_splits_pass is conducted.
```
'inductor': Counter({'pattern_matcher_nodes': 1614, 'pattern_matcher_count': 1566, 'normalization_pass': 645, 'remove_split_with_size_one_pass': 629, 'batch_aten_mul': 13, 'scmerge_split_sections_removed': 11, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'merge_splits_pass': 3, 'merge_getitem_cat_pass': 2, 'scmerge_split_removed': 2, 'batch_linear_post_grad': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1})}
```

# e2e
baseline:
f545633808

before_fix:
f545665840

After_fix:
f546227494

proposal:

Differential Revision: D55513494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122944
Approved by: https://github.com/jackiexu1992
2024-03-30 07:34:41 +00:00
2cd3ef4777 Check scale dtype for fake_quantize_per_channel_affine_cachemask (#120987)
Fixes #120903

Scale for fake quant is assumed FP32 but not checked. If scales of double dtype are passed in, an internal error is raised: `TORCH_INTERNAL_ASSERT(!needs_dynamic_casting<func_t>::check(iter));` in aten/src/ATen/native/cpu/Loops.h
This PR adds a check of scale dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120987
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-03-30 07:32:32 +00:00
07f0ff6ed7 [DCP][FSDP2][Test] Add_adamW to test_train_parity_2d_transformer_checkpoint_resume (#122002)
Want to add the option of AdamW here, as currently this is the only test for 2D.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122002
Approved by: https://github.com/awgu, https://github.com/fegin
2024-03-30 07:28:41 +00:00
ed457c7dbe [export] Add torch_fn (#122693)
This PR adds a new metadata, `torch_fn` which is meant to replace `source_fn_stack` as `source_fn_stack` is not entirely well defined between strict/nonstrict. Previous discussion [here](https://docs.google.com/document/d/1sPmmsmh6rZFWH03QBOe49MaXrQkP8SxoG8AOMb-pFk4/edit#heading=h.anmx9qknhvm).

`torch_fn` represents the torch function that a particular aten operator came from. For example, `torch.nn.Linear` goes down to the `torch.nn.functional.linear` at the `__torch_function__` layer, and then `aten.t/aten.addmm` in the `__torch_dispatch__` layer. So the nodes `aten.t/aten.addmm` will now have the `torch_fn` metadata containing the `torch.nn.functional.linear`.

The `torch_fn` metadata is a tuple of 2 strings: a unique identifier for each torch function call, and the actual torch function `f"{fn.__class__}.{fn.__name__}"`. The purpose of the first value is to distinguish between 2 consecutive calls to the same function. For example, if we had 2 calls to `torch.nn.Linear`, the nodes and corresponding metadata would look something like:
```
aten.t - ("linear_1", "builtin_function_or_method.linear"),
aten.addmm - ("linear_1", "builtin_function_or_method.linear"),
aten.t - ("linear_2", "builtin_function_or_method.linear"),
aten.addmm - ("linear_2", "builtin_function_or_method.linear"),
```

Higher order ops -- currently we can get the torch_fn metadata for nodes within the HOO's subgraph, but after retracing, this becomes the `(cond, higher_order_op.cond)` :( This is because `fx_traceback.set_current_meta` points to the cond node in the toplevel graph, rather than the original node in the subgraph. I think this is because `fx.Interpreter` does not go into the cond subgraphs. (will discuss with Yidi more ab this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122693
Approved by: https://github.com/tugsbayasgalan
2024-03-30 06:47:15 +00:00
3a9eead4ab [inductor] Don't compile MultiKernelCall in a subprocess (#123010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123010
Approved by: https://github.com/shunting314
ghstack dependencies: #123009
2024-03-30 05:46:09 +00:00
6c0911f1d9 [inductor] Skip cudagraphs warning on CPU (#123009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123009
Approved by: https://github.com/shunting314
2024-03-30 05:46:09 +00:00
0b7a156f68 [executorch hash update] update the pinned executorch hash (#122662)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122662
Approved by: https://github.com/pytorchbot
2024-03-30 05:18:53 +00:00
c66a44ea79 [AOTInductor] Support many outputs aliasing the same tensor (#122846)
fixes https://github.com/pytorch/pytorch/issues/122826

# Problem
When the model returns multiple outputs which alias the same tensor, we get a SEGFAULT. Because we try to release the same buffer twice.
```
def forward(x):
  x_out = x + 1
  contig = x_out.contiguous()   # alias of same tensor as x_out
  return x_out, contig

run_impl() {
  output_handles[0] = buf0.release();
  output_handles[1] = buf0.release();   # SEGFAULT
}

# if we try to workaround this by assign aliases without creating a new tensor,
# then, we'll get a double free error during handle clean-up.
output_handles[1] = output_handles[0];    # assign without creating a new tensor
...
alloc_tensors_by_stealing_from_handles(){
  aoti_torch_delete_tensor_object(handles[0]);
  aoti_torch_delete_tensor_object(handles[1]);   # Double free
}
```

# Solution
~~Instead, we use the first `output_handle` that shares the same tensor and alias it.~~
```
output_handles[0] = buf0.release();
aoti_torch_alias_tensor(output_handles[0], &output_handles[1]);  # No SEGFAULT & No double free!
```

A simpler approach is to figure out which handles are duplicate. Then we simply copy all duplicate except the last one. The last one will use `std::move` and free the tensor owned by the model instance.
```
output_handles[0] = buf0.release();
output_handles[1] = output_handles[0];
```

Differential Revision: [D55455344](https://our.internmc.facebook.com/intern/diff/D55455344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122846
Approved by: https://github.com/desertfire, https://github.com/chenyang78, https://github.com/jingsh
2024-03-30 04:41:17 +00:00
aaba3a87b1 tune down batch-size for res2net to avoid OOM (#122977)
The batch-size for this model is 64 previously. Later on we change that to 256 and cause OOM in cudagraphs setting. This PR tune the batch size down to 128.

Share more logs from my local run
```
cuda,res2net101_26w_4s,128,1.603578,110.273572,335.263494,1.042566,11.469964,11.001666,807,2,7,6,0,0
cuda,res2net101_26w_4s,256,1.714980,207.986155,344.013071,1.058278,22.260176,21.034332,807,2,7,6,0,0
```

The log shows that torch.compile uses 11GB for 128 batch size and 21GB for 256 batch size. I guess the benchmark script has extra overhead cause the model OOM for 256 batch size in the dashboard run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122977
Approved by: https://github.com/Chillee
2024-03-30 03:54:53 +00:00
5a06b8ebfd Remove skipIfTorchDynamo from TestComposability in test_eager_transforms.py (#121830)
Fixes: https://github.com/pytorch/pytorch/issues/96559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121830
Approved by: https://github.com/zou3519
ghstack dependencies: #121410, #121665
2024-03-30 01:55:04 +00:00
3d3d4e1cd5 export XPUStream to doc (#121398)
# Motivation
We would like to export XPUStream to public [doc](https://pytorch.org/cppdocs/api/library_root.html). The detailed documentation can help users understand and utilize XPU more effectively.

# Additional Context
A detailed XPUStream API and usage should be documented to public doc, like cuda's [doc](https://github.com/pytorch/pytorch/blob/main/docs/cpp/source/notes/tensor_cuda_stream.rst).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121398
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/albanD
2024-03-30 00:36:26 +00:00
f4ff063c33 Add attributes to xpu device prop (#121898)
# Motivation
Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile`  or directly passed to triton to generate more optimized code based on device properties.

# Additional Context
expose the following attributes to `torch.xpu.get_device_properties`:
- `has_fp16` (newly added)
- `has_fp64` (newly added)
- `has_atomic64` (newly added)
- `driver_version`
- `vendor`
- `version`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman
2024-03-30 00:25:39 +00:00
b5bef9bbfd Fix cpp tests not running + failing to surface (#122845)
The comment in the code should have the information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122845
Approved by: https://github.com/huydhn
2024-03-29 22:41:45 +00:00
4282bb8b07 [c10d] add the source rank which detects the timeout (#122850)
Summary:
When a rank detects a timeout from tcpstore and triggers the dump. It's good to have more info about the source rank which detects the
collective timeout locally. We just need to put the source rank as the
value in the kvstore
Test Plan:
In unit test, we triggered the timeout on rank 0 and rank 1 should get
the timeout signal from store and log the correct source rank:

```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (34d27652)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTestTimeoutDumpOnStuckRanks
NCCL version 2.19.3+cuda12.0
[rank0]:[E327 17:04:16.986381360 ProcessGroupNCCL.cpp:565] [Rank 0]
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2,
OpType=ALLREDUCE, NumelIn=12, NumelOut=12, Timeout(ms)=1000) ran for
1099 milliseconds before timing out.
[rank0]:[E327 17:04:16.988036373 ProcessGroupNCCL.cpp:1582] [PG 0 Rank
0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed
   NCCL work: 1.
   [rank0]:[E327 17:04:16.182548526 ProcessGroupNCCL.cpp:1346] [PG 0
   Rank 0] Received a timeout signal from this local rank and will start
   to dump the debug info. Last enqueued NCCL work: 2, last completed
   NCCL work: 1.
   [rank0]:[E327 17:04:16.247574460 ProcessGroupNCCL.cpp:1167] [PG 0
   Rank 0] ProcessGroupNCCL preparing to dump debug info.
   [rank1]:[E327 17:04:16.273332178 ProcessGroupNCCL.cpp:1346] [PG 0
   Rank 1] Received a global timeout from another rank 0, and will start
   to dump the debug info. Last enqueued NCCL work: 1, last completed
   NCCL work: 1.
   [rank1]:[E327 17:04:16.273565177 ProcessGroupNCCL.cpp:1167] [PG 0
   Rank 1] ProcessGroupNCCL preparing to dump debug info.
   [rank1]:[F327 17:04:16.274256512 ProcessGroupNCCL.cpp:1185] [PG 0
   Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog detected a
   collective timeout from another rank 0 and notified the current rank.
   This is most likely caused by incorrect usages of collectives, e.g.,
   wrong sizes used across ranks, the order of collectives is not same
   for all ranks or the scheduled collective, for some reason, didn't
   run. Additionally, this can be caused by GIL deadlock or other
   reasons such as network errors or bugs in the communications library
   (e.g. NCCL), etc. We tried our best to dump the debug info into the
   storage to help you debug the issue.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122850
Approved by: https://github.com/wconstab
2024-03-29 22:22:37 +00:00
d7d77a152c [ez] Increase slow grad check shards 4 to 6 (#122631)
They take almost 4 hours to run completely for one shard

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122631
Approved by: https://github.com/huydhn
2024-03-29 21:49:27 +00:00
ea33adf6c2 [vec] test VecMask in vec_test_all_types (#122878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122878
Approved by: https://github.com/malfet
ghstack dependencies: #119979, #122869
2024-03-29 21:48:29 +00:00
c9b32c9caa [vec] test at::vec::convert in vec_test_all_types (#122869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122869
Approved by: https://github.com/malfet
ghstack dependencies: #119979
2024-03-29 21:48:29 +00:00
6f4ed57b8a [inductor][cpp] unified the vectorized conversion with at::vec::convert for all data types (#119979)
This PR unified the vectorized conversion with `at::vec::convert` for all vectorized data types. The intrinsics implementations are implemented as a specialization and moved to their own arch-specific files. The vectorized conversion logic in cpp Inductor is simplified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119979
Approved by: https://github.com/jansel, https://github.com/malfet
2024-03-29 21:48:29 +00:00
05e54536fb [CI] Removed tests for torch.utils.tensorboard.summary.hparams (#122556)
Partially addresses #122160

In the module `torch.utils.tensorboard.summary`, the `hparams` method does not depend on any utilities from pytorch as it uses only the utilities from `tensorboard`. Thus, I think it will be safe to delete the test for `hparams` method as it does not depend on pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122556
Approved by: https://github.com/huydhn
2024-03-29 21:44:02 +00:00
482d8bf1ea [aoti] Change aot_compile callsites (#122225)
Summary:
Replacing `torch._export.aot_compile` callsites with
```
ep = torch.export._trace._export(.., predispatch=True)   # Traces the given program into predispatch IR
so_path = torch._inductor.aot_compile_ep(ep, ...)  # Takes an exported program and compiles it into a .so
```

This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better.

Test Plan: CI

Differential Revision: D54808612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225
Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov
2024-03-29 21:34:20 +00:00
267145c5d0 Enable full state checking (#122971)
Fixes https://github.com/pytorch/pytorch/issues/115679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122971
Approved by: https://github.com/anijain2305
2024-03-29 21:24:57 +00:00
4d6cb7bca0 Use Q-NEON register to compute the dot product (#122952)
Make transposed gemv a bit faster
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122952
Approved by: https://github.com/kimishpatel
ghstack dependencies: #122951
2024-03-29 21:09:08 +00:00
73e362756b Avoid COW materialize in conv forward ops (#122748)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122748
Approved by: https://github.com/ezyang
ghstack dependencies: #122720
2024-03-29 20:34:19 +00:00
cyy
7423092227 [TorchGen] [2/N] Remove unused variables and simplify dictionary iterations (#122585)
This PR continues to remove unused variables and simplifies dictionary iterations from TorchGen scripts, following #122576.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122585
Approved by: https://github.com/ezyang
2024-03-29 20:34:11 +00:00
57a9a64e10 [BE] Give a different error message when evaluating an integer. (#122938)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122938
Approved by: https://github.com/Skylion007
2024-03-29 19:14:15 +00:00
3178ba0dc9 Don't use sympy Float functions, use an opaque one with no reasoning (#122823)
Sympy simplifications don't obey floating point semantics, so don't
use Sympy for this.  Keep them as is, only evaluate with the reference
implementations when all arguments are known.

This may end up getting subsumed by some other changes later, but I
wanted to understand if this was easy and it seems to be easy.

This doesn't actually depend on the earlier diffs on the stack and I can detach it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823
Approved by: https://github.com/lezcano
2024-03-29 19:13:55 +00:00
ae0cf1f98d [TD][ez] Set pytest cache bucket default to gha-artifacts (#122901)
After https://github.com/pytorch/pytorch/pull/121907/files

Example failure: https://github.com/pytorch/pytorch/actions/runs/8473386479/job/23217733984#step:5:130
```
usage: pytest_cache.py [-h] (--upload | --download) --cache_dir CACHE_DIR
                       --pr_identifier PR_IDENTIFIER --job_identifier
                       JOB_IDENTIFIER [--sha SHA] [--test_config TEST_CONFIG]
                       [--shard SHARD] [--repo REPO] [--temp_dir TEMP_DIR]
                       [--bucket BUCKET]
pytest_cache.py: error: argument --bucket: expected one argument
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122901
Approved by: https://github.com/huydhn
2024-03-29 18:52:58 +00:00
99d939f51f [dynamo] Bugfix for HASATTR guard (#122947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122947
Approved by: https://github.com/jansel
ghstack dependencies: #122828
2024-03-29 18:50:33 +00:00
0a7162f898 Fix svd_lowrank parameter M (#122681)
ISSUE: #122699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122681
Approved by: https://github.com/lezcano
2024-03-29 18:06:38 +00:00
487b6d40ec Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-29 18:05:28 +00:00
3243be7c3a [FSDP2] Removed wrapSwapTensorsTest since no longer needed (#122962)
We do not need to set the flag after https://github.com/pytorch/pytorch/pull/122755.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122962
Approved by: https://github.com/mikaylagawarecki
2024-03-29 17:53:18 +00:00
a236fa9f06 Revert "[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)"
This reverts commit 384de46395234e793a319325e5c9d20a60407a64.

Reverted https://github.com/pytorch/pytorch/pull/122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/122882#issuecomment-2027544640))
2024-03-29 17:52:39 +00:00
2a137f7af1 [dynamo] Support hasattr on UserDefinedClassVariable (#122564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122564
Approved by: https://github.com/anijain2305
2024-03-29 17:34:14 +00:00
772e142e70 [dynamo] Delay cuda device registration (#122795)
the module-level `torch.cuda.device_count` calls are delayed until reading the registered devices.

Fixes #122085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122795
Approved by: https://github.com/ezyang
2024-03-29 17:22:18 +00:00
315bd951e4 Add inductor fx pass unit test for shape propagation (#122897)
Summary: Pre-grad fx passes expect information from shape propagation to be present. D55221119 ensured that `pass_execution_and_save` invokes shape propagation, and this diff adds a covering unit test to prevent regression.

Test Plan: New UT passes locally.

Differential Revision: D55440240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122897
Approved by: https://github.com/khabinov, https://github.com/Skylion007
2024-03-29 16:44:22 +00:00
b83c94339e Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857)
This PR fixes the two major issues that was discovered after the initial merge of PR #121561
1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem.
2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
2024-03-29 16:37:24 +00:00
d8b69de73b [EZ] Run fp16 torch.mm/torch.mv across CPU threads (#122951)
This significantly speeds up real world applications, such as LLMs

Before this change llama2-7b fp16 inference run at 1.5 tokens per sec,
after it runs at almost 6 tokens per sec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122951
Approved by: https://github.com/ezyang
2024-03-29 16:14:59 +00:00
cyy
fb90b4d4b2 [TorchGen] Use std::optional in generated code (#121454)
This PR changes TorchGen to generate std::optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121454
Approved by: https://github.com/ezyang
2024-03-29 14:11:09 +00:00
375a8041ed [AOTI][refactor] Improve logging (#122932)
Summary: Improve some logging msgs, and change a data type to remove a compile time warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122932
Approved by: https://github.com/chenyang78
2024-03-29 14:02:23 +00:00
cyy
769d1909f0 Enable clang-tidy warnings of aten/src/ATen/functorch (#122933)
Enable clang-tidy in aten/src/ATen/functorch,  following #122779.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122933
Approved by: https://github.com/ezyang
2024-03-29 14:01:28 +00:00
38946bff51 Added DispatchKey.CompositeImplicitAutograd to all upsample_nearest*.default decompositions (#122782)
Related to https://github.com/pytorch/pytorch/pull/117632#issuecomment-2021321172
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122782
Approved by: https://github.com/ezyang
2024-03-29 13:55:25 +00:00
b524a404e0 Fixed support for uint8 in upsample bicubic2d decomposition (#120411)
Superseeds https://github.com/pytorch/pytorch/pull/104248

Description:
- Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown)
- Added missing clamp(0, 1) for xscale and yscale
  - slowdown for f32 on cpu. PR on nodes fusion on CPU: https://github.com/pytorch/pytorch/pull/120077 can help for upsampling cases with align corners = true
  - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu.
- Removed lowering implementation

Benchmarks:
```
[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |        613.029 (+-1.590)        |         5477.608 (+-9.027)         |           3060.314 (+-12.368)           |     0.559 (+-0.000)      |          608.735 (+-6.336)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |        610.176 (+-1.428)        |        5718.503 (+-11.203)         |           3424.022 (+-12.836)           |     0.599 (+-0.000)      |          604.781 (+-6.229)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        325.001 (+-0.840)        |        6183.029 (+-10.893)         |            3275.032 (+-7.625)           |     0.530 (+-0.000)      |          325.693 (+-1.067)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        325.855 (+-1.108)        |        6391.394 (+-11.552)         |            3533.410 (+-7.666)           |     0.553 (+-0.000)      |          325.838 (+-1.457)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       2521.533 (+-14.857)       |        5025.217 (+-13.415)         |            2814.304 (+-6.742)           |     0.560 (+-0.000)      |         2520.308 (+-10.796)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       2531.204 (+-12.534)       |        5294.925 (+-11.994)         |            3147.590 (+-6.808)           |     0.594 (+-0.000)      |         2521.228 (+-11.732)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |        758.352 (+-10.362)       |        5639.912 (+-14.495)         |            3014.123 (+-8.799)           |     0.534 (+-0.000)      |          756.114 (+-4.792)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |        758.712 (+-5.781)        |         5927.541 (+-9.982)         |            3249.555 (+-7.226)           |     0.548 (+-0.000)      |          757.719 (+-5.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       1524.469 (+-12.860)       |        34321.641 (+-80.310)        |           19373.714 (+-56.351)          |     0.564 (+-0.000)      |         1518.082 (+-49.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       1521.746 (+-13.780)       |        35949.711 (+-81.010)        |           21782.366 (+-68.938)          |     0.606 (+-0.000)      |         1467.911 (+-15.901)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |        712.311 (+-5.361)        |        38826.510 (+-92.267)        |           20762.314 (+-59.303)          |     0.535 (+-0.000)      |          712.669 (+-4.673)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |        715.060 (+-4.757)        |        40269.353 (+-92.543)        |           22402.114 (+-81.574)          |     0.556 (+-0.000)      |          716.001 (+-8.945)

      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |       2331.889 (+-29.159)       |        21541.096 (+-72.346)        |           12181.194 (+-45.288)          |     0.565 (+-0.000)      |         2304.864 (+-21.351)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |       2333.697 (+-10.066)       |        22514.154 (+-57.798)        |           21709.449 (+-98.307)          |     0.964 (+-0.000)      |         2302.141 (+-13.041)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        1198.768 (+-5.364)       |       37652.371 (+-101.644)        |           42740.413 (+-98.571)          |     1.135 (+-0.000)      |          1197.104 (+-7.225)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        1196.851 (+-5.118)       |       39678.341 (+-173.750)        |           46807.738 (+-92.744)          |     1.180 (+-0.000)      |          1189.322 (+-5.681)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       10020.978 (+-54.855)      |        19955.290 (+-71.891)        |           11420.521 (+-53.179)          |     0.572 (+-0.000)      |         9999.583 (+-61.230)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       10066.441 (+-62.700)      |       21058.334 (+-183.414)        |           19986.577 (+-65.304)          |     0.949 (+-0.000)      |         10018.672 (+-59.188)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |       3171.135 (+-14.635)       |        19687.864 (+-54.320)        |           23313.699 (+-57.391)          |     1.184 (+-0.000)      |         3182.191 (+-17.686)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |       3181.314 (+-13.784)       |        20224.468 (+-50.827)        |          30541.963 (+-381.385)          |     1.510 (+-0.000)      |         3183.578 (+-16.203)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       5879.450 (+-31.551)       |       136918.555 (+-480.320)       |          77723.568 (+-331.766)          |     0.568 (+-0.000)      |         5726.061 (+-87.517)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       5882.869 (+-30.325)       |       143378.094 (+-513.842)       |         137244.074 (+-4827.730)         |     0.957 (+-0.000)      |         5727.679 (+-22.164)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |       2674.937 (+-45.003)       |      244829.360 (+-1930.579)       |         271283.073 (+-2243.245)         |     1.108 (+-0.000)      |         2676.054 (+-24.632)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |       2676.217 (+-16.601)       |      248658.668 (+-2904.952)       |         296514.520 (+-2983.281)         |     1.192 (+-0.000)      |         2682.844 (+-19.886)

      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1768.437 (+-6.294)       |        2934.013 (+-28.870)         |            2520.649 (+-6.797)           |     0.859 (+-0.000)      |          1759.292 (+-5.097)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1748.660 (+-5.550)       |         3271.104 (+-7.557)         |            2891.306 (+-7.632)           |     0.884 (+-0.000)      |          1746.341 (+-5.845)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2813.150 (+-6.656)       |         3258.973 (+-7.543)         |            2766.286 (+-6.473)           |     0.849 (+-0.000)      |          2805.077 (+-7.611)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2812.102 (+-8.211)       |         3568.780 (+-9.018)         |            3125.870 (+-7.324)           |     0.876 (+-0.000)      |          2834.178 (+-9.034)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |        1687.975 (+-9.527)       |         2752.085 (+-9.627)         |            2373.274 (+-7.888)           |     0.862 (+-0.000)      |          1698.782 (+-8.098)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |        1696.606 (+-8.678)       |        3056.317 (+-13.303)         |           2699.160 (+-10.638)           |     0.883 (+-0.000)      |         1684.942 (+-10.519)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |        2613.491 (+-9.769)       |        3176.493 (+-13.366)         |            2730.193 (+-9.573)           |     0.859 (+-0.000)      |          2625.085 (+-9.943)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       2614.946 (+-34.129)       |        3465.398 (+-11.165)         |           3044.396 (+-11.447)           |     0.879 (+-0.000)      |          2627.355 (+-9.608)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |       10784.549 (+-58.181)      |        18292.452 (+-59.344)        |           15909.922 (+-49.864)          |     0.870 (+-0.000)      |         10837.656 (+-51.947)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |       10786.513 (+-52.308)      |        20449.038 (+-56.204)        |           18295.997 (+-54.522)          |     0.895 (+-0.000)      |         10843.751 (+-44.781)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |       17532.699 (+-64.807)      |        20425.699 (+-80.271)        |           17517.040 (+-79.705)          |     0.858 (+-0.000)      |         17595.597 (+-61.870)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |       17530.816 (+-55.131)      |        22450.080 (+-92.899)        |           19827.828 (+-77.649)          |     0.883 (+-0.000)      |         17615.934 (+-71.716)

      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |       6875.484 (+-40.543)       |        11569.509 (+-62.462)        |          10053.350 (+-208.136)          |     0.869 (+-0.000)      |         6864.501 (+-46.747)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |       6843.126 (+-44.498)       |        12915.236 (+-60.654)        |          25335.058 (+-382.640)          |     1.962 (+-0.000)      |         6899.002 (+-46.861)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |       11103.418 (+-51.318)      |        28834.389 (+-78.395)        |          37405.463 (+-581.646)          |     1.297 (+-0.000)      |         11223.012 (+-60.709)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |       11092.994 (+-70.835)      |       36597.023 (+-118.988)        |           45761.267 (+-85.051)          |     1.250 (+-0.000)      |         11104.014 (+-61.288)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |       7106.791 (+-63.666)       |        11191.071 (+-45.402)        |           9786.037 (+-75.781)           |     0.874 (+-0.000)      |         7129.419 (+-77.674)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |       7146.519 (+-28.376)       |        12443.571 (+-39.425)        |           20147.067 (+-74.771)          |     1.619 (+-0.000)      |         7179.622 (+-64.847)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |       10533.849 (+-44.227)      |       34814.909 (+-138.127)        |          42803.001 (+-114.326)          |     1.229 (+-0.000)      |         10644.039 (+-59.681)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       10548.910 (+-44.221)      |       42876.940 (+-146.959)        |          49711.443 (+-139.276)          |     1.159 (+-0.000)      |         10652.375 (+-44.174)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |      42814.521 (+-103.198)      |       73100.489 (+-435.262)        |          63587.659 (+-134.266)          |     0.870 (+-0.000)      |        43208.921 (+-195.287)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |      42812.373 (+-103.870)      |       81769.160 (+-373.369)        |         175159.813 (+-2028.558)         |     2.142 (+-0.000)      |         43007.691 (+-96.358)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |      69955.505 (+-373.373)      |      215248.616 (+-2040.775)       |         267511.246 (+-2094.161)         |     1.243 (+-0.000)      |        70382.679 (+-594.941)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |      69852.157 (+-490.076)      |      242841.484 (+-19645.513)      |         317931.678 (+-2016.498)         |     1.309 (+-0.000)      |        70074.819 (+-352.919)

Times are in microseconds (us).

[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                     |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |         97.727 (+-0.018)        |          97.765 (+-0.025)          |             97.773 (+-0.027)            |     1.000 (+-0.000)      |           97.905 (+-0.040)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.615 (+-0.066)        |          97.332 (+-0.032)          |             97.950 (+-0.026)            |     1.006 (+-0.000)      |           97.690 (+-0.062)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        100.635 (+-0.033)        |         125.883 (+-0.020)          |            102.499 (+-0.116)            |     0.814 (+-0.000)      |          101.103 (+-0.027)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        100.898 (+-0.036)        |         109.717 (+-0.336)          |            102.558 (+-0.120)            |     0.935 (+-0.000)      |          101.642 (+-0.105)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |        462.853 (+-0.028)        |         382.475 (+-0.047)          |            382.472 (+-0.033)            |     1.000 (+-0.000)      |          462.188 (+-0.014)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |        462.783 (+-0.021)        |         382.806 (+-0.037)          |            382.563 (+-0.043)            |     0.999 (+-0.000)      |          462.089 (+-0.028)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        466.721 (+-0.022)        |         384.438 (+-0.027)          |            384.886 (+-0.037)            |     1.001 (+-0.000)      |          467.014 (+-0.025)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        466.993 (+-0.032)        |         384.212 (+-0.009)          |            383.946 (+-0.029)            |     0.999 (+-0.000)      |          466.575 (+-0.020)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        190.070 (+-0.082)        |         209.353 (+-1.096)          |            202.870 (+-0.888)            |     0.969 (+-0.000)      |          189.371 (+-0.164)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        190.021 (+-0.018)        |         210.504 (+-0.456)          |            201.814 (+-0.770)            |     0.959 (+-0.000)      |          189.314 (+-0.036)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        188.860 (+-0.207)        |         336.635 (+-0.023)          |            252.026 (+-0.510)            |     0.749 (+-0.000)      |          188.860 (+-0.170)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        188.725 (+-0.214)        |         276.329 (+-0.563)          |            251.439 (+-0.524)            |     0.910 (+-0.000)      |          188.776 (+-0.189)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        781.879 (+-0.086)        |         836.389 (+-7.177)          |            816.483 (+-6.626)            |     0.976 (+-0.000)      |          781.362 (+-0.106)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        781.824 (+-0.099)        |         840.406 (+-7.111)          |            807.530 (+-6.514)            |     0.961 (+-0.000)      |          781.307 (+-0.129)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        769.290 (+-0.309)        |         675.498 (+-1.537)          |            688.171 (+-4.326)            |     1.019 (+-0.000)      |          769.830 (+-0.222)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        769.240 (+-0.179)        |         675.800 (+-1.113)          |            673.176 (+-1.740)            |     0.996 (+-0.000)      |          769.935 (+-0.171)

Times are in microseconds (us).

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120411
Approved by: https://github.com/lezcano
2024-03-29 13:15:25 +00:00
d94db5f6ee Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Note: If rebuild fail after pulled this PR, please sync `sleef` submodule by run:
```cmd
git submodule sync
git submodule update --init --recursive
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-29 07:28:31 +00:00
35c56f85fd [dynamo][pt2d] avoid skipping modules from torch/testing/_internal (#122851)
Dynamo skips user defined modules from `torch/testing/_internal` (eg MLP, Transformer). This PR adds `torch/testing/_internal/...` to `manual_torch_name_rule_map`. It ensures FSDP CI + torch.compile are meaningfully tested

unit test shows frame count = 0 before and frame count > 0 after
```pytest test/dynamo/test_trace_rules.py -k test_module_survive_skip_files```

some FSDP unit tests actually start to compile modules with this change. add trition availability check or disable tests for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122851
Approved by: https://github.com/jansel
2024-03-29 06:42:06 +00:00
10bdf64427 Properly pexpr the actual sympy.Expression, don't repr it. (#122893)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122893
Approved by: https://github.com/albanD, https://github.com/desertfire, https://github.com/jansel
2024-03-29 06:40:19 +00:00
ed37fbdf60 made gpt_fast benchmark run faster (#122872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122872
Approved by: https://github.com/msaroufim, https://github.com/yifuwang
ghstack dependencies: #122848
2024-03-29 03:49:19 +00:00
b9c9f037d1 Added some checkpointing tests (#122848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122848
Approved by: https://github.com/anijain2305
2024-03-29 03:49:19 +00:00
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
6a45809580 Simplify forward AD missing support error (#122639)
This thing about jit decomposition confuses users greatly and I'm not sure what it adds. So removing it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122639
Approved by: https://github.com/soulitzer
2024-03-29 02:11:46 +00:00
76d8020e62 Add tests for pre_dispatch + run_decomp flow and taskify failures (#122508)
Differential Revision: [D55448616](https://our.internmc.facebook.com/intern/diff/D55448616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122508
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-29 01:47:07 +00:00
cyy
f041df8530 Fix order conditioning of norm kernel (#122874)
NormOneOps is not executed due to an incorrect comparison, this PR fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122874
Approved by: https://github.com/Skylion007
2024-03-29 00:28:13 +00:00
6b8205d3de Revert "Support map in pre-dispatch functionalization (#121444)"
This reverts commit 079feea3379c021a330dbfac7668a5fc8fccc3bd.

Reverted https://github.com/pytorch/pytorch/pull/121444 on behalf of https://github.com/clee2000 due to sorry windows failure seems related 079feea337 https://github.com/pytorch/pytorch/actions/runs/8474191301/job/23220791555. PR got force merged before windows job finished ([comment](https://github.com/pytorch/pytorch/pull/121444#issuecomment-2026323614))
2024-03-28 23:42:26 +00:00
16771747c2 Add tensor step and capturable support to rprop (#122261)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Rprop step update while compiling

Also adds capturable support + testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261
Approved by: https://github.com/janeyx99
2024-03-28 23:31:18 +00:00
e63e013c3b Skip use_count() debug assert for _nested_get_offsets() (#122917)
This broke [internal tests](https://www.internalfb.com/intern/test/844425064039866/) that run with unset `NDEBUG`. It wasn't initially caught because we don't test with unset `NDEBUG` in OSS CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122917
Approved by: https://github.com/soulitzer
ghstack dependencies: #122902
2024-03-28 23:19:17 +00:00
6fc5ad931c Use zeros for NJT dummy to avoid messing with randomness (#122902)
Use of randomness was breaking vmap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122902
Approved by: https://github.com/vmoens, https://github.com/zou3519
2024-03-28 22:09:31 +00:00
f476d707fd Remove previous grad impl. in torch dynamo (#122215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122215
Approved by: https://github.com/zou3519
2024-03-28 22:00:23 +00:00
079feea337 Support map in pre-dispatch functionalization (#121444)
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
2024-03-28 21:56:36 +00:00
481c9bb1fc Upgrade submodule oneDNN to v3.3.6 (#122164)
As the title. Including issue fixes for aarch64:
- https://github.com/oneapi-src/oneDNN/pull/1831
- https://github.com/oneapi-src/oneDNN/pull/1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
https://github.com/pytorch/pytorch/pull/122164#issuecomment-2007912919

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
2024-03-28 21:36:27 +00:00
3924d2189c [FSDP2] Simplified _move_states_to_device (#122907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122907
Approved by: https://github.com/Skylion007
2024-03-28 21:22:59 +00:00
3beb9d85a6 Revert "Add non strict inline constraints and runtime assertions to non-strict exported program (#122722)"
This reverts commit b693fff5d72b249d39436ced577a88d3b866bbba.

Reverted https://github.com/pytorch/pytorch/pull/122722 on behalf of https://github.com/BoyuanFeng due to This breaks torchrec.distributed.tests.test_pt2.TestPt2: test_kjt__getitem__ ([comment](https://github.com/pytorch/pytorch/pull/122722#issuecomment-2026078351))
2024-03-28 20:42:35 +00:00
8852b09abc [FSDP2] Used _chunk_cat for reduce-scatter copy-in (#122888)
This PR uses `_chunk_cat` to fuse padding gradients on dim-0, chunking into `world_size` chunks, and copying them into the reduce-scatter input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122888
Approved by: https://github.com/yifuwang, https://github.com/BoyuanFeng, https://github.com/weifengpy
ghstack dependencies: #122726, #122847
2024-03-28 20:35:45 +00:00
8df99732a4 Revert "Workaround dind-rootless volumes mount as root (#122787)"
This reverts commit 84dc76156a0b8a73e56d80c3947ed9dd03c5ac5e.

Reverted https://github.com/pytorch/pytorch/pull/122787 on behalf of https://github.com/zxiiro due to This broke rocm tests ([comment](https://github.com/pytorch/pytorch/pull/122787#issuecomment-2026022659))
2024-03-28 20:10:19 +00:00
dacc73669c [export] Make quantizer compatible with the standard nn_module_stack. (#122819)
Summary: When we migrate to torch.export, we won't put L['self'] as the prefix for all the fqn in nn_module_stack. This diff adds the branch to handle the new case.

Test Plan: buck test mode/opt caffe2/test/quantization:test_quantization -- -r set_module_name

Differential Revision: D55436617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122819
Approved by: https://github.com/tugsbayasgalan
2024-03-28 19:36:46 +00:00
384de46395 [aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882
Approved by: https://github.com/desertfire
2024-03-28 19:06:29 +00:00
646dd1ab8d Rewrite quantized conv transpose2d for vulkan (#122547)
Summary: Vulkan rewrite sp that quantized transpose 2d ops can run in a model

Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[  PASSED  ] 417 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[  PASSED  ] 77 tests.
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large

 9 FAILED TESTS
  YOU HAVE 8 DISABLED TESTS

# Run CUNET quantized model on hibiki board.

Reviewed By: manuelcandales

Differential Revision: D52344263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122547
Approved by: https://github.com/manuelcandales, https://github.com/copyrightly, https://github.com/yipjustin
2024-03-28 18:51:44 +00:00
71b5b7e081 Let dynamo trace some functions in functorch.deprecated.* namespace (#121665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121665
Approved by: https://github.com/zou3519
ghstack dependencies: #121410
2024-03-28 18:50:43 +00:00
966ae943df Add wrapper for fbgemm quantization operations (#122763)
Summary:
We add wrappers for fbgemm's packing so we can pass it through PT2 to
lowering phase of AOTInductor.

Test Plan:
Included in commit.
test_quantized_ops::test_wrapped_fbgemm_linear_fp16

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55433204](https://our.internmc.facebook.com/intern/diff/D55433204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122763
Approved by: https://github.com/jerryzh168
ghstack dependencies: #122762
2024-03-28 18:41:18 +00:00
e296722e0e Z3 validation: Lift operators later when we actually run with Z3 (#122791)
Previously, we lifted operators putting them into the FX graph, limiting
the applicability of the FX graph for only Z3.  Now, we lift operators
when we are interpreting, which means I can use the graph for other
things.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122791
Approved by: https://github.com/Chillee, https://github.com/lezcano
2024-03-28 18:31:30 +00:00
3d2d7ba19d Delete torch.autograd.function.traceable APIs (#122817)
We deprecated them in 2.3 with plans to delete in 2.4. Very few OSS
repos use this flag at all and it also does nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122817
Approved by: https://github.com/albanD
2024-03-28 18:24:15 +00:00
a3b30851c5 Add quantized.linear_unpacked_dynamic_fp16 (#122762)
Summary:

We add a new op quantized.linear_unpacked_dynamic_fp16, which is essentially linear_dynamic_fp16 with different (unpacked) weight/bias format.
This op does packing on the fly for each call with standard at::Tensor weight & bias.

Test Plan:
Included in commit.
test_quantized_op::test_unpacked_qlinear_dynamic_fp16

Differential Revision: [D55433203](https://our.internmc.facebook.com/intern/diff/D55433203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122762
Approved by: https://github.com/jerryzh168
2024-03-28 18:02:27 +00:00
59f6393209 [docs] Update PT2+Profiler docs (#122272)
Document:
* Torch-Compiled Region
* What to expect in kernels inside a torch-compiled region

For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272
Approved by: https://github.com/aaronenyeshi
2024-03-28 17:52:28 +00:00
091a24495b [AOTInductor] Support use_runtime_constant_folding for CPU. (#122563)
Summary:
We allow CPU to use the config use_runtime_constant_folding.
Changes include
1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly.
2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration.

Test Plan: Activate tests that were deactivated for CPU before.

Reviewed By: khabinov

Differential Revision: D55234300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563
Approved by: https://github.com/chenyang78
2024-03-28 17:49:05 +00:00
8a33a77fd1 Back out "Added a check in register_lowering to avoid decomposed ops (#117632)" (#122709)
Summary:
Original commit changeset: ebda663a196b

Original Phabricator Diff: D55271788

Test Plan: Some models are failing torch compile with this, retrying the tests

Reviewed By: colinchan15

Differential Revision: D55374457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122709
Approved by: https://github.com/huydhn
2024-03-28 17:46:57 +00:00
4670dcc94c [Inductor]Fix a couple of broken unit tests (#122714)
Summary: Titled

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Buck UI: https://www.internalfb.com/buck2/ad05a43c-cb4a-443e-8904-b4d53e4f4b1e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798909218388
Network: Up: 107KiB  Down: 28KiB  (reSessionID-d7146e4f-773a-46ea-9852-f10f59302479)
Jobs completed: 24. Time elapsed: 1:49.3s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor/fb:split_cat_fx_passes_fb
```

Buck UI: https://www.internalfb.com/buck2/82dbf3b0-c747-4c07-98b8-53b69afa3157
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900267699118
Network: Up: 1.4GiB  Down: 2.3GiB  (reSessionID-0bd22c6d-5dfe-4b4a-bc24-705eadac884b)
Jobs completed: 252570. Time elapsed: 7:25.2s.
Cache hits: 95%. Commands: 123778 (cached: 117999, remote: 2779, local: 3000)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D55378009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122714
Approved by: https://github.com/SherlockNoMad
2024-03-28 17:44:30 +00:00
07f94df1a6 [torch quantization]fix HistogramObserver OOM when (self.max_val - self.min_val) is too small (#122659)
Differential Revision: D55347133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122659
Approved by: https://github.com/jerryzh168
2024-03-28 17:41:21 +00:00
d65b9dff73 [AMD] turn off triton memcache for amd devices (#122560)
Summary:
triton memcache is not supported on amd devices yet and causes torch.compile to fail

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
ci

Sandcastle run

Differential Revision: D55285655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122560
Approved by: https://github.com/jansel
2024-03-28 17:38:21 +00:00
d9a08de9a4 Add Opinfo entries for HOP testing (#122265)
In this PR, we add a systematic way to test all HOPs to be exportable as export team has been running into various bugs related to newly added HOPs due to lack of tests. We do this by creating:
- hop_db -> a list of HOP OpInfo tests which then used inside various flows including export functionalities: [aot-export, pre-dispatch export, retrace, and ser/der

For now, we also create an allowlist so that people can bypass the failures for now. But we should discourage ppl to do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122265
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-03-28 17:36:43 +00:00
0bfa9f4758 [ROCm][ATen][Native] Fix kernel cache selecting kernels for incorrect architectures (#121401)
Fixes #120794

Torch creates a cache of compiled kernels at $HOME/.cache/torch/kernels. The names used to save and select the cached kernels use cuda_major and cuda_minor to identify the gpu architecture for which the gpu kernels where compiled. On ROCM this is insufficient as on rocm cudaDeviceProp  cuda_major and cuda_minor are mapped to hipDeviceProp_t::major and hipDeviceProp_t::minor which correspond to the first and second number of the LLVM target corresponding to the architecture in question:

GFX1030 is major = 10, minor = 3
GFX1032 is major = 10, minor = 3
GFX900 is major = 9,  minor = 0
GFX906 is major = 9,  minor = 0
GFX908 is major = 9,  minor = 0

Thus it can be seen hipDeviceProp_t::major and hipDeviceProp_t::minor are insufficient to uniquely identify the ROCM architecture. This causes the rocm runtime to raise an error when an operation uses a cached kernel that was first cached on a architecture with the same hipDeviceProp_t::major and hipDeviceProp_t::minor but a different llvm target.

The solution provided in this pr is to replace the use of hipDeviceProp_t::major,hipDeviceProp_t::minor with hipDeviceProp_t::gcnArchName when pytorch is compiled for rocm which contains a string identical to the LLVM target of the architecture in question

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121401
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/malfet
2024-03-28 17:24:31 +00:00
9693797491 [PT2][Inductor][Observability] Improve the optimus scuba log (#122361)
Summary: Titled

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/18014398535709463
Network: Up: 113KiB           Down: 480KiB           (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389)
Discovered 9. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
Command: test.                                                                                 Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.3s
Command: test.                                                                                 Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.4s
Command: test.                                                                                 Remaining: 9/24. Cache hits: 0%. Time elapsed: 44.5s
Network: Up: 117KiB  Down: 507KiB  (reSessionID-1d2e3558-15b5-4a4e-8c5d-10c983afb389)
Jobs completed: 24. Time elapsed: 1:48.3s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0
```
buck2 test mode/dev-nosan //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16044073698893554
Network: Up: 120KiB  Down: 60KiB  (reSessionID-57f2c21b-3f4e-462b-9e5b-fe3dd15f6b7d)
Jobs completed: 28. Time elapsed: 1:47.5s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

optimus_scuba_log:
```
{'before_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIbj2haUwKx69H8BAKXdGqXZSpoybr0LAAAz', 'group_batch_fusion_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFqhiRYcJ_C4JFoDABKPTsfpzjJ_br0LAAAz', 'normalization_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIvswhaiAVyipcoGAJZ5sUi8Bb5qbr0LAAAz', 'remove_split_with_size_one_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GFneTxcVBPaqVuwCADCiI4q1mEwlbr0LAAAz', 'merge_getitem_cat_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GJc0Phn87ljuMO0CADBPGqqehKp2br0LAAAz', 'merge_splits_pass_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GLWB_BbvLyT7D_0DABmygDYPDjJ_br0LAAAz', 'after_recompile_pre_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GO6eQBeIj6oV3o4JAFLzQ3ECMTIrbr0LAAAz', 'inductor_pre_grad': Counter({'pattern_matcher_nodes': 2006, 'pattern_matcher_count': 1806, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1}), 'before_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GMoKmxYg6AUeQ40KAMDaJ4EVDwYmbr0LAAAz', 'group_batch_fusion_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GHIvQxkrV1PMBggEACv7786a2bE8br0LAAAz', 'after_recompile_post_grad': 'https://www.internalfb.com/intern/everpaste/?color=0&handle=GIpBNxXupQTHWx8BALSiVrKgDbtfbr0LAAAz', 'inductor_post_grad': Counter({'pattern_matcher_nodes': 2093, 'pattern_matcher_count': 1893, 'normalization_pass': 861, 'remove_split_with_size_one_pass': 748, 'merge_splits_pass': 82, 'merge_getitem_cat_pass': 11, 'scmerge_split_sections_removed': 4, 'batch_layernorm': 1, 'batch_sigmoid': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'scmerge_split_removed': 1, 'scmerge_cat_removed': 1, 'batch_aten_mul': 1})}
```

Differential Revision: D55107000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122361
Approved by: https://github.com/jackiexu1992
2024-03-28 17:13:32 +00:00
049d68d8bb [inductor][Autotune] Add matrix_instr_nonkdim to triton_meta (#122852)
Summary: Previous work `https://github.com/pytorch/pytorch/pull/120742` to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.

Test Plan:
P1201466917

  triton_heuristics.template(
    num_stages=1,
    num_warps=4,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
    inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
  )

Perf :
Before: 1.693ms    0.134GB    79.28GB/s
After:    1.577ms    0.134GB    85.12GB/s

Differential Revision: D55456401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122852
Approved by: https://github.com/xw285cornell
2024-03-28 16:58:38 +00:00
1e8d4b389b Super tiny fix typo (#122881)
"CustoType" -> "CustomType"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122881
Approved by: https://github.com/awgu
2024-03-28 16:13:25 +00:00
958dbb876c Revert "_foreach_copy with different src/dst dtypes (#121717)"
This reverts commit da2a9a05127c2b44e447e734d99e727d856cb36f.

Reverted https://github.com/pytorch/pytorch/pull/121717 on behalf of https://github.com/janeyx99 due to Causing IMAs on V100s internally :C ([comment](https://github.com/pytorch/pytorch/pull/121717#issuecomment-2025553295))
2024-03-28 15:54:40 +00:00
8698121636 Revert "Add RMSNorm module (#121364)"
This reverts commit a7306de0dc96cda8b698d19680a88d27aa45a31d.

Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))
2024-03-28 15:31:10 +00:00
8007d9a34a Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621)"
This reverts commit f2c1060de3cdddbfefcab11e547211993d0f9cfa.

Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/atalman due to Broke internal executorch test ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-2025496296))
2024-03-28 15:28:02 +00:00
9208df45cb Fixed increasing CPU overhead of RemovableHandle.__init__ (#122847)
For some reason, if we construct `class Handle(RemovableHandle` inside `register_multi_grad_hook`, then over time, the call to `RemovableHandle.__init__` slows down more and more (when we have GC disabled). Perhaps, this is related to the class attribute `next_id: int = 0`. Python experts: please let me know if you have thoughts 😅

I am open to any suggestions on if how we should deal with this `Handle` class. For now, I changed it to a private `_MultiHandle`.

<details>
<summary> Experiment Script </summary>

```
import gc
import time

import torch

NUM_TENSORS = int(5e4)
ts = [torch.empty(1, requires_grad=True) for _ in range(NUM_TENSORS)]

def hook(grad) -> None:
    return

gc.disable()
times = []
for i, t in enumerate(ts):
    start_time = time.time()

    torch.autograd.graph.register_multi_grad_hook([t], hook)

    end_time = time.time()
    times.append(end_time - start_time)

print([f"{t * 1e6:.3f} us" for t in times[1:6]])  # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]])  # print last few times

times = []
for i, t in enumerate(ts):
    start_time = time.time()

    t.register_hook(hook)

    end_time = time.time()
    times.append(end_time - start_time)

print([f"{t * 1e6:.3f} us" for t in times[1:6]])  # print first few times
print([f"{t * 1e6:.3f} us" for t in times[-5:]])  # print last few times
```
</details>

<details>
<summary> Results </summary>

Before fix:
```
['23.603 us', '19.550 us', '15.497 us', '12.875 us', '13.828 us']
['327.110 us', '341.177 us', '329.733 us', '332.832 us', '341.177 us']
['318.050 us', '315.189 us', '319.719 us', '311.613 us', '308.990 us']
['374.317 us', '394.821 us', '350.714 us', '337.362 us', '331.402 us']
```
Calling `register_multi_grad_hook` makes calling itself and `register_hook` slower (actually, any call to `RemovableHandle.__init__`).

After fix:
```
['13.590 us', '9.060 us', '12.875 us', '7.153 us', '8.583 us']
['4.530 us', '5.245 us', '6.437 us', '4.768 us', '5.007 us']
['2.623 us', '1.907 us', '1.431 us', '1.669 us', '1.192 us']
['1.431 us', '1.431 us', '1.192 us', '1.192 us', '1.431 us']
```
</details>

Update: from @soulitzer

> Your suspicion about next_id is right. I think what is happening is that whenever a class attribute is set, it needs to invalidate some cached data for the subclasses one-by-one. eefff682f0/Objects/typeobject.c (L845)
And this PR fixes the issue by avoiding creating many subclasses dynamically. Changing next_id to something like List[int] or incrementing a global instead also fixes this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122847
Approved by: https://github.com/soulitzer
ghstack dependencies: #122726
2024-03-28 15:24:12 +00:00
4290a57e9c Revert "[NJT] .to() properly updates device of offsets (#122797)"
This reverts commit 3e7fd45b409966440c54f5e370885b4b2a388a01.

Reverted https://github.com/pytorch/pytorch/pull/122797 on behalf of https://github.com/jeffdaily due to Sorry for reverting your change but it is failing CUDA and ROCm jobs in trunk. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/122797#issuecomment-2025473181))
2024-03-28 15:17:45 +00:00
cyy
d6aed1b692 Fix clang-tidy warnings of aten/src/ATen/functorch (#122779)
This PR fixes some performance related clang-tidy warnings of aten/src/ATen/functorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122779
Approved by: https://github.com/ezyang
2024-03-28 15:15:06 +00:00
6e1c81c687 Revert "Let dynamo trace some functions in functorch.deprecated.* namespace (#121665)"
This reverts commit f9eab9ca92c603e671e7714669758a81ce8d7111.

Reverted https://github.com/pytorch/pytorch/pull/121665 on behalf of https://github.com/guilhermeleobas due to revert PR ([comment](https://github.com/pytorch/pytorch/pull/121665#issuecomment-2025460500))
2024-03-28 15:11:51 +00:00
f9eab9ca92 Let dynamo trace some functions in functorch.deprecated.* namespace (#121665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121665
Approved by: https://github.com/zou3519
ghstack dependencies: #121410
2024-03-28 15:07:18 +00:00
f178d996a8 [dynamo] Fix traceback generation on runtime errors (#122746)
Fixes `During handling of the above exception, another exception occurred: [...] torch._dynamo.exc.Unsupported: generator`. traceback.format_exc uses generators which isn't supported by dynamo yet.
<details>
  <summary>current error message</summary>

```
======================================================================
ERROR: test_custom_fn_saved_tensors (__main__.TestCompiledAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.0", line 4, in forward
    def forward(self, inputs, sizes, hooks):
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xmfan/core/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 499, in test_custom_fn_saved_tensors
    self.check_output_and_recompiles(fn, 1)
  File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 61, in check_output_and_recompiles
    actual = list(opt_fn())
  File "/home/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py", line 495, in fn
    loss.backward()
  File "/home/xmfan/core/pytorch/torch/_tensor.py", line 534, in backward
    torch.autograd.backward(
  File "/home/xmfan/core/pytorch/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/xmfan/core/pytorch/torch/autograd/graph.py", line 766, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/_dynamo/eval_frame.py", line 397, in _fn
    res = fn(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 741, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 315, in __call__
    _WrappedCall._generate_error_message(topmost_framesummary),
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 289, in _generate_error_message
    tb_repr = get_traceback()
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 288, in get_traceback
    return traceback.format_exc()
  File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 183, in format_exc
    return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
  File "/home/xmfan/.conda/envs/benchmarks/lib/python3.10/traceback.py", line 136, in format_exception
    return list(te.format(chain=chain))
  File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 941, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/home/xmfan/core/pytorch/torch/_dynamo/convert_frame.py", line 348, in _convert_frame_assert
    unimplemented("generator")
  File "/home/xmfan/core/pytorch/torch/_dynamo/exc.py", line 199, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: generator
```

</details>

With this change, we get back the descriptive error message:
<details>
  <summary>post-fix error message</summary>

```
Traceback (most recent call last):
  File "/home/xmfan/core/pytorch/torch/fx/graph_module.py", line 307, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xmfan/core/pytorch/torch/nn/modules/module.py", line 1537, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.0", line 4, in forward
    def forward(self, inputs, sizes, hooks):
IndexError: list index out of range

Call using an FX-traced Module, line 4 of the traced Module's generated forward function:

def forward(self, inputs, sizes, hooks):

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    getitem = inputs[0]

    getitem_1 = inputs[1];  inputs = None
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122746
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #122691
2024-03-28 14:40:54 +00:00
1d96791661 [dynamo] Fix list proxy to list element proxy source propagation (#122691)
Currently, when we create proxies for a list's elements in wrap_fx_proxy_cls, we create them using the same source as the list's e.g. `LocalSource(inputs)` instead of `GetItemSource(LocalSource(inputs), index=i)`. This results in invalid guards when the tensors it contains becomes dynamic, and the guard system thinks the list is a tensor:
```
Malformed guard:
L['sizes'][0] == L['inputs'].size()[0]
Malformed guard:
2 <= L['inputs'].size()[0]

Traceback [...]
AttributeError: 'list' object has no attribute 'size'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122691
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-03-28 14:40:54 +00:00
0284bca99b Don't cache device_count if we haven't initialized CUDA yet (#122815)
Before initializing CUDA, it can change by modifying CUDA_VISIBLE_DEVICES

Fixes https://github.com/pytorch/pytorch/issues/122085
Fixes https://github.com/pytorch/pytorch/issues/38616
Fixes https://github.com/pytorch/pytorch/issues/110000
Fixes https://github.com/pytorch/pytorch/issues/110971
Fixes https://github.com/pytorch/pytorch/issues/95073

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122815
Approved by: https://github.com/albanD
2024-03-28 13:23:45 +00:00
84dc76156a Workaround dind-rootless volumes mount as root (#122787)
In ARC Runners we are using dind-rootless to run docker-in-docker and
in rootless mode volume mounts always mount as root but are mapped to
the local `runner` user in ARC. This causes the build.sh and test.sh
scripts to fail because they run as the `jenkins` user and expect to
be able to write to the workspace path that's being mounted.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2024-03-28 09:06:40 -04:00
cyy
d1da9cc654 [ClangTidy] Disable misc-include-cleaner (#122855)
misc-include-cleaner was introduced in clang-tidy-17 as a way to check missing and unused includes. However, there are lots of transitive headers in PyTorch and it would take enormous efforts to add related annotations to them in order to direct this checker. For this reason, it's better to disable it now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122855
Approved by: https://github.com/cpuhrsch
2024-03-28 10:10:43 +00:00
8c8e4e31f2 Some improvements to nonzero post guard_size_oblivious (#122156)
Prompted by https://github.com/pytorch/pytorch/pull/121571

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122156
Approved by: https://github.com/jansel
2024-03-28 03:53:16 +00:00
caa57e4fcd Add tensor step and capturable support to rmsprop (#122264)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes RMSprop step update while compiling

Adds capturable support to RMSprop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264
Approved by: https://github.com/janeyx99
2024-03-28 03:39:28 +00:00
927bc4b558 [vision hash update] update the pinned vision hash (#122754)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122754
Approved by: https://github.com/pytorchbot
2024-03-28 03:27:07 +00:00
c10352a406 [audio hash update] update the pinned audio hash (#122584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122584
Approved by: https://github.com/pytorchbot
2024-03-28 03:26:21 +00:00
235f24fc66 [inductor] Add FileLock around V.debug.copy (#122665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122665
Approved by: https://github.com/ezyang
2024-03-28 03:17:33 +00:00
1b5ccdb0f0 Avoid COW materialize in more forward ops (#122720)
Affected ops:
* ormqr
* lerp
* multinomial
* bernoulli
* histogram
* searchsorted
* log_softmax
* jiterator ops
* dropout
* _segment_reduce

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122720
Approved by: https://github.com/ezyang
2024-03-28 03:02:13 +00:00
60f3c092d4 [dynamo] Config option to Inline builtin nn module forward (#122725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122725
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647, #122716, #122769, #122818
2024-03-28 03:01:27 +00:00
d4317becce [dynamo][easy] Force recompilation in a test (#122818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122818
Approved by: https://github.com/williamwen42
ghstack dependencies: #122646, #122647, #122716, #122769
2024-03-28 03:01:27 +00:00
52b1d2a73d Increase timm batch sizes to make less overhead-bound and less noisy (#122581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122581
Approved by: https://github.com/ezyang
ghstack dependencies: #122686, #122688, #121692, #122841
2024-03-28 02:34:32 +00:00
e6ee8322d7 nn.Module: use swap_tensors for Tensor subclasses (#122755)
This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors.

This uses the `swap_tensors` method to swap all of the tensors not just the .data field.

Test plan:

```
pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting'
python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755
Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki
2024-03-28 02:03:09 +00:00
3e7fd45b40 [NJT] .to() properly updates device of offsets (#122797)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122797
Approved by: https://github.com/jbschlosser
2024-03-28 00:56:23 +00:00
574a8ccf10 Remove several expectedFailureNonStrict (#122802)
This PR removes several `expectedFailureNonStrict` from `test_export.py`, where the error messages from strict and non-strict export differ a bit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122802
Approved by: https://github.com/ydwu4
2024-03-28 00:42:49 +00:00
12116aee68 Add Flash Attention support on ROCM (#121561)
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)

- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
    * MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
    * Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
    * varlen API will be supported in future release of AOTriton
- [x] Only support head dimension 16,32,64,128.
    * Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
    * Kernel is selected according to autotune information from Triton.

Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API

This is a more extensive fix to #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/huydhn
2024-03-28 00:27:38 +00:00
8d676a6e8e [dynamo][cpp-guards] Bugfix for size/strides for tensor match (#122828)
This got missed because CPP guard manager is not ON by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122828
Approved by: https://github.com/mlazos, https://github.com/jansel
2024-03-28 00:16:49 +00:00
66510c641f [c10d][NCCL] Refactor coalesced storage (#122651)
The `coalescedDevice_` are `coalescedComms_` used inefficiently and in case of consequent coalescing comms can cause to read-before-write condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122651
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-03-27 23:56:02 +00:00
cc12668053 Fix swap_tensors path in _apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122800
Approved by: https://github.com/albanD
2024-03-27 23:34:16 +00:00
0348773655 Forward fix for subtly breaking AC with compile in the case of stacked (#122841)
checkpoint layers separated by recomputable op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122841
Approved by: https://github.com/anijain2305
ghstack dependencies: #122686, #122688, #121692
2024-03-27 23:23:04 +00:00
a8b7480f0d fix dynamo.explain examples (#122745)
`dynamo.explain()` was updated to return a structure but the docs weren't updated to match.

- Update the docs to use the new API
- Remove some dead code left when `explain` was updated.
- Drive-by: Fix some `nopython` uses that I noticed
- Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it.

Fixes #122573

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745
Approved by: https://github.com/jansel
2024-03-27 22:53:27 +00:00
a54ea7bbd8 Made several changes to min-cut partitioner that allow it to recompute more things (#121692)
Perf results
<img width="862" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/8d44e633-8941-46a6-8e7d-806330a8c890">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121692
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #122686, #122688
2024-03-27 22:45:52 +00:00
bef01c7c2b Revert "Optimize multi_tensor_apply (take 2) (#119764)"
This reverts commit fe41ba47652ca73569453bddb43605c77bb85184.

Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399))
2024-03-27 22:42:07 +00:00
222dfc4282 [Inductor] Run pattern matcher over the original graph (#122519)
Differential Revision: [D55429070](https://our.internmc.facebook.com/intern/diff/D55429070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122519
Approved by: https://github.com/jansel
2024-03-27 22:09:36 +00:00
530e13cf3d Revert "[c10d] disable compute_duration by default (#122138)" (#122539)
This reverts commit bf18e967b4abc90c27ad460680497d8f5ec55962.

It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122539
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #122538
2024-03-27 21:53:28 +00:00
933d3a7829 Allow dynamo to inline through "hessian" (#121410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121410
Approved by: https://github.com/zou3519
2024-03-27 21:39:37 +00:00
a7306de0dc Add RMSNorm module (#121364)
Similar to dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)

**The implementation here is not optimized and we welcome pull requests to improve this**

- Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation
- Remove the [upcast to float and downcast
](dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364
Approved by: https://github.com/albanD
2024-03-27 21:39:30 +00:00
b693fff5d7 Add non strict inline constraints and runtime assertions to non-strict exported program (#122722)
This PR reduces the difference between strict and non-strict exported program by

- Support `inline_constraints` for non-strict exported program
- Add runtime assertions for range constraints to non-strict exported program

After this PR, the following unit tests are no longer `expectedFailureNonStrict`:
- test_automatic_constrain_size
- test_export_with_inline_constraints
- test_redundant_asserts
- test_constrain_size_with_constrain_value
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122722
Approved by: https://github.com/pianpwk
2024-03-27 21:20:03 +00:00
abe4a0e9eb [dynamo] pop result of print reordering (#122744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122744
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741, #122742, #122743
2024-03-27 20:39:39 +00:00
76fe0faadd [dynamo, 3.12] add END_SEND (#122743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122743
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741, #122742
2024-03-27 20:39:39 +00:00
c5d372dafc [dynamo, 3.12] trace through __mro__ attribute access (#122742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122742
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740, #122741
2024-03-27 20:39:39 +00:00
71d40ff861 [dynamo, 3.12] fix typing variable tracing (#122741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122741
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739, #122740
2024-03-27 20:39:39 +00:00
5d0a792d5f [dynamo, 3.12] fix some tests (#122740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122740
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738, #122739
2024-03-27 20:39:39 +00:00
a9704848d1 [dynamo, 3.12] add CALL_INTRINSIC_1 (#122739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122739
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737, #122738
2024-03-27 20:39:39 +00:00
8e5a4248a3 [dynamo, 3.12] add LOAD_SUPER_ATTR (#122738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122738
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530, #122737
2024-03-27 20:39:39 +00:00
8cd7bb7422 [dynamo, 3.12] add LOAD_FAST variants (#122737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122737
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456, #122530
2024-03-27 20:39:39 +00:00
a9b27bbbe9 [dynamo, 3.12] update jump instructions (#122530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122530
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455, #122456
2024-03-27 20:39:39 +00:00
f44f16ebd5 [dynamo, 3.12] add END_FOR (#122456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122456
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449, #122455
2024-03-27 20:39:39 +00:00
bcdd0c6f59 [dynamo, 3.12] add BINARY/STORE_SLICE (#122455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122455
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356, #122449
2024-03-27 20:39:39 +00:00
7b13228038 [dynamo, 3.12] fix DICT_VERSION C++ guards (#122449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122449
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355, #122356
2024-03-27 20:39:39 +00:00
01547960bc [dynamo, 3.12] remove LOAD_METHOD, update LOAD_ATTR (#122356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122356
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354, #122355
2024-03-27 20:39:39 +00:00
8ba26f4aa5 [dynamo, 3.12] support RETURN_CONST (#122355)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122355
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335, #122354
2024-03-27 20:39:39 +00:00
3a67c86f72 [dynamo, 3.12] remove references to PRECALL instruction in 3.12 (#122354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122354
Approved by: https://github.com/jansel
ghstack dependencies: #122146, #122335
2024-03-27 20:39:39 +00:00
35382f0573 [dynamo, 3.12] Use CPython internal _PyOpcode_Caches instead of hardcoding (#122335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122335
Approved by: https://github.com/jansel
ghstack dependencies: #122146
2024-03-27 20:39:39 +00:00
2564f6cf0e [dynamo, 3.12] Allocate Dynamo shadow frames by mimicking CPython (#122146)
Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed:
- Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11.
- The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.")

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146
Approved by: https://github.com/jansel
2024-03-27 20:39:39 +00:00
ccfc87b199 include scheduler_on_plateau in optim.h (#121722)
Fixes #121593
Co-authored-by: Jane Xu <janeyx@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121722
Approved by: https://github.com/albanD
2024-03-27 19:45:25 +00:00
ceff2205e9 [dynamo][cpp-guards] Bugfix to pass on correct example_value (#122769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122769
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647, #122716
2024-03-27 19:40:46 +00:00
7281c5afdc [dynamo][fbcode][torchrec] Selectively inline torchrec/distributed/types.py (#122716)
Manually verified for the internal model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122716
Approved by: https://github.com/jansel
ghstack dependencies: #122646, #122647
2024-03-27 19:40:46 +00:00
5b42c41b19 [dynamo][improve-guard-overhead] Skip TENSOR_MATCH guards on parameters for optimizers (#122647)
**1.32x  guard overhead reduction** (1.092 vs vs 0.827 ms) for MegatronBertForCausalLM with 394 params.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122647
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #122646
2024-03-27 19:40:43 +00:00
c108696228 [dynamo][guards-cpp-refactor][easy] Env variable to turn on cpp manager (#122646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122646
Approved by: https://github.com/jansel
2024-03-27 19:40:37 +00:00
1b9c7e41bb Remove .data call in LSTM as it is not necessary (#122733)
Summary: Title

Test Plan: CI

Differential Revision: D55392057

Functional pre-dispatch tracing chokes on LSTM .data call today. While we need to fix it, it seems this call seems unnecessary here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122733
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD
2024-03-27 19:08:22 +00:00
1d6fc0d4de Fixed _infer_device_type warning in checkpoint (#122726)
Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726
Approved by: https://github.com/soulitzer
2024-03-27 18:38:42 +00:00
37e3c8f33f [DCP] Supporting resolve_bytes in LoadPlanner (#122700)
1. Supporting resolve bytes, similar to resolve_tensor.
2. This will allow us to load the bytes, directly on to the user provided ioBytes buffer.

This essentially mirrors the existing pattern we have for tensors, where the user is expected to follow some version of:

```
1. resolve_tensor
2. copy to target tensor
3. commit_tensor
```

Differential Revision: [D55259699](https://our.internmc.facebook.com/intern/diff/D55259699/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122700
Approved by: https://github.com/Skylion007, https://github.com/wz337, https://github.com/pradeepfn
2024-03-27 17:43:32 +00:00
cd51496f8b add a couple debug options (#121033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121033
Approved by: https://github.com/ezyang
2024-03-27 17:24:43 +00:00
5af839f86d [quant][pt2e] Enable observer sharing between different quantization specs (#122734)
Summary:

Right now we don't insert additional observers (share observers) if qspec.dtype and qspec.is_dynamic matches exactly,
since fixed qparams quantization spec and derived quantization spec do have have is_dynamic field curerntly, observer sharing does not happen between them and quantization spec, in this PR we fixed the issue by
adding is_dynamic to all quantization specs.

Note: SharedQuantizationSpec should probably be its own type in the future
TODO later:
(1). move all these fields (dtype, is_dynamic, quant_min, quant_max etc.) to QuantizationSpecBase,
(2). make SharedQuantizationSpec a separate type
(3). add quant_min/quant_max in observer sharing checking in pt2e/prepare.py

Test Plan:
python test/test_quantization.py -k test_fixed_qparams_qspec_observer_dedup
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55396546](https://our.internmc.facebook.com/intern/diff/D55396546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122734
Approved by: https://github.com/andrewor14
2024-03-27 16:45:19 +00:00
b63f6f78dc Revert "[Inductor] Run pattern matcher over the original graph (#122519)"
This reverts commit 1f5fcb4e203eb343e8c53f6444015c98e8f68d60.

Reverted https://github.com/pytorch/pytorch/pull/122519 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/122519#issuecomment-2023022311))
2024-03-27 15:13:26 +00:00
f3b82a4dc2 [xla hash update] update the pinned xla hash (#122628)
Originally made this PR since xla was failing, but the PR that changed the pin got reverted, so this is just a normal update now

The old pin was ~2 weeks old?

Currently XLA is broken https://github.com/pytorch/pytorch/actions/runs/8438508272/job/23115239444
Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122628
Approved by: https://github.com/malfet, https://github.com/JackCaoG
2024-03-27 15:09:42 +00:00
f140309e9c Revert "Only update momentum buffers for SGD if momentum is enabled (#122349)"
This reverts commit a333b080c16a3a6bbb057b4fbaaec4a4e14615dd.

Reverted https://github.com/pytorch/pytorch/pull/122349 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/122349#issuecomment-2023001467))
2024-03-27 15:04:52 +00:00
70c3deef2d Revert "[xla hash update] update the pinned xla hash (#122628)"
This reverts commit 04399a30913fd04c2120420b671cd432659d56e6.

Reverted https://github.com/pytorch/pytorch/pull/122628 on behalf of https://github.com/atalman due to Need revert and then reland ([comment](https://github.com/pytorch/pytorch/pull/122628#issuecomment-2022995857))
2024-03-27 15:01:33 +00:00
eb5381da66 Skip storage check debug assert in view codegen when output is a subclass instance (#122718)
Before the fix, this assert blows up in DEBUG mode for views where the input (base) is a dense tensor and the output (view) is a subclass instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122718
Approved by: https://github.com/soulitzer
2024-03-27 14:39:51 +00:00
105381ea11 [inductor][cpp] simplify CppVecKernelChecker (remove bool/int8 load as mask and load as float flags) (#119734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119734
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #119654, #119655
2024-03-27 11:20:35 +00:00
49121603ab [inductor][cpp] support vectorized indirect indexing (#119655)
This PR adds the vectorized indirect indexing so that we can further simplify the `CppVecKernelChecker` (done in the later PR #119734) and remove the check that throws `CppVecUnsupportedError`. A boundary assertion check is added on vectorized indices and via the new `indirect_assert` method on `Kernel` - the base implementation is for scalar indices, overridden in `CppVecKernel` for vectorized indices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119655
Approved by: https://github.com/jansel
ghstack dependencies: #119654
2024-03-27 10:25:45 +00:00
a697d972b1 Fix torchbench errors (#122735)
Summary: It looks like this target has stopped working, lets fix it.

Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/:test
```
now works

Differential Revision: D55389546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122735
Approved by: https://github.com/xmfan
2024-03-27 06:59:16 +00:00
367ec62ae3 [inductor][cpp] generalize vector mask for dtypes (#119654)
Vectorized boolean values in CPU Inductor were modeled with `Vectorized<float>` which cannot work for operations with other data types. This PR generalizes it with the new `VecMask` template class that can work for masks on any vectorized data types. The intrinsics implementation in `cpp_prefix.h` for mask conversion, cast and masked load are now implemented as the specialization for `VecMask` and moved to corresponding header files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119654
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-03-27 05:33:53 +00:00
f2c1060de3 [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-03-27 02:20:37 +00:00
d1104d76aa [Easy] Fix freezing bug with mismatched bias sizes (#122724)
Fix for https://github.com/pytorch/pytorch/issues/121231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122724
Approved by: https://github.com/davidberard98
2024-03-27 01:41:00 +00:00
249e65b92d Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9
2024-03-27 01:14:38 +00:00
fe41ba4765 Optimize multi_tensor_apply (take 2) (#119764)
### Take 2

The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153:
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.

### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar
2024-03-27 00:51:30 +00:00
67a4d6d6cb Stopped TORCH_COMPILE_DEBUG from printing out a bunch of logs (#122688)
@ezyang suggests using TORCH_TRACE for dumping out all intermediate logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122688
Approved by: https://github.com/ezyang, https://github.com/mlazos
ghstack dependencies: #122686
2024-03-27 00:24:40 +00:00
602c2af9e3 Cleaned up/fixed get_args after_aot repro (#122686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122686
Approved by: https://github.com/ezyang
2024-03-27 00:24:40 +00:00
c81c9ba472 Disallow {FakeTensor,FunctionalTensor}.data_ptr (#122514)
This PR:
- disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing.
- disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in
  PT2)

The motivation behind this is that the leading cause of segfaults when
using custom ops with PT2 is calling .data_ptr on FunctionalTensor or
FakeTensor.

This change is BC-breaking. If your code broke as a result of this, it's
because there was a bug in it (these .data_ptr should never be
accessed!). You can either fix the bug (recommended) or get the previous
behavior back with:
```
from torch._subclasses.fake_tensor import FakeTensor
from torch._subclasses.functional_tensor import FunctionalTensor

data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr()
```

Test Plan:
- existing tests

Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler
2024-03-26 23:55:42 +00:00
04399a3091 [xla hash update] update the pinned xla hash (#122628)
Originally made this PR since xla was failing, but the PR that changed the pin got reverted, so this is just a normal update now

The old pin was ~2 weeks old?

Currently XLA is broken https://github.com/pytorch/pytorch/actions/runs/8438508272/job/23115239444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122628
Approved by: https://github.com/malfet, https://github.com/JackCaoG
2024-03-26 23:51:38 +00:00
07b618e2d4 Graph break cleanly in Dynamo for module parametrization (#121041)
Fixes #118795

This is a graph breaking partial fix for #120914. We still need -actual- module parametrization tracing support, but at least it doesn't blow up hard now.

**Background**: Module parametrization injects a property as the module parameter attribute that calls a `nn.Module` whose forward takes in a module parameter and returns a reparametrized module parameter.
Example:
```
class MyParametrization(nn.Module):
    def forward(X):
        # This reparametrization just negates the original parameter value
        return -X

m = nn.Linear(...)
p = MyParametrization()
register_parametrization(m, "weight", p)

# Accessing the "weight" attribute will invoke p's forward() on m's original weight and return the output as the new weight.
# m.weight here is now an injected property that does the above instead of an actual Parameter.
# This property is defined in torch/nn/utils/parametrize.py.
m.weight

# NB: Parametrization changes the module type (e.g. torch.nn.utils.parametrize.ParametrizedLinear)
print(type(m))
```

**Problem 1**: Dynamo has special tracing rules for things in `torch.nn`. Parametrizing a module changes the type of the module and the parametrized attribute, so now these rules wrongly affect tracing here. To fix this:
* For parametrized modules, call `convert_to_unspecialized()` to restart analysis where Dynamo starts inlining the module.

**Problem 2**: The issue seen in #118795 is that Dynamo will see a dynamically constructed tensor when `m.weight` is called and introduce that to its `tensor_weakref_to_sizes_strides` cache during fake-ification. This tensor is also made to be a graph input, since it's a module parameter. When guards are created for this module parameter input, the logic calls `m.weight` again and tries to look the result up in the cache, but this is a different tensor now, giving the `KeyError` symptom. To fix this:
* Replace Dynamo's `tensor_weakref_to_sizes_strides` cache with a `input_source_to_sizes_strides` cache.
    * This cache was originally introduced in #100128.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121041
Approved by: https://github.com/anijain2305
2024-03-26 23:44:51 +00:00
2367d0dacd [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562) (#122690)
Summary:

During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Reviewed By: zoranzhao

Differential Revision: D55354548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
2024-03-26 23:25:15 +00:00
09cb42ce29 [dynamo] delete graph_out_{n} after restoring local vars (#122658)
At graph breaks, we create a graph_out_{n} symbol to hold the graph output and
use it to restore the local vars. In addition to their own symbols, the local
vars are kept alive by the symbol we created. This means that if the graph
break is the last usage of one of the symbols, the symbol would still be kept
alive upon graph resumption.

This PR: delete the graph_out_{n} symbol after restoring local vars so the
lifetime of the local vars is governed by themselves.

## Example Problem
Tensor `b`'s last usage is in the graph break. However, it won't be deallocated until `bar()` completes. In the orignal issue report by @Yuzhen11, `b` is a large tensor and `bar()` is an expensive computation.

```python
import torch

def foo(a):
    return torch.mm(a, a)

@torch._dynamo.disable()
def graph_break_fn(a):
    ret = a.bfloat16()
    return ret

def bar(c):
    return torch.mm(c, c)

def fn(a):
    b = foo(a)
    c = graph_break_fn(b)
    # del b
    return bar(c)

fn_compiled = torch.compile(fn, backend="eager")
a = torch.randn(10000, 10000, device="cuda", requires_grad=True)

fn_compiled(a).sum().backward()
```

Bytecode before this PR:
```
ORIGINAL BYTECODE fn /home/yifu/microbench/del2.py line 18
 19           0 LOAD_GLOBAL              0 (foo)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               1 (b)

 20           8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                1 (b)
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (c)

 22          16 LOAD_GLOBAL              2 (bar)
             18 LOAD_FAST                2 (c)
             20 CALL_FUNCTION            1
             22 RETURN_VALUE

MODIFIED BYTECODE fn /home/yifu/microbench/del2.py line 18
 18           0 LOAD_GLOBAL              3 (__compiled_fn_0)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               3 (graph_out_0)
              8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                3 (graph_out_0)
             12 LOAD_CONST               1 (0)
             14 BINARY_SUBSCR

 20          16 CALL_FUNCTION            1
             18 LOAD_GLOBAL              4 (__resume_at_14_1)
             20 ROT_TWO
             22 CALL_FUNCTION            1
             24 RETURN_VALUE

ORIGINAL BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_FAST                0 (___stack0)
              2 JUMP_ABSOLUTE            9 (to 18)
              4 LOAD_GLOBAL              0 (foo)
              6 LOAD_FAST                1 (a)
              8 CALL_FUNCTION            1
             10 STORE_FAST               2 (b)
             12 LOAD_GLOBAL              1 (graph_break_fn)
             14 LOAD_FAST                2 (b)
             16 CALL_FUNCTION            1
        >>   18 STORE_FAST               3 (c)

 22          20 LOAD_GLOBAL              2 (bar)
             22 LOAD_FAST                3 (c)
             24 CALL_FUNCTION            1
             26 RETURN_VALUE

MODIFIED BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_GLOBAL              3 (__compiled_fn_2)
              2 LOAD_FAST                0 (___stack0)
              4 CALL_FUNCTION            1
              6 UNPACK_SEQUENCE          1
              8 RETURN_VALUE
```

Bytecode after this PR:
```
ORIGINAL BYTECODE fn /home/yifu/microbench/del2.py line 18
 19           0 LOAD_GLOBAL              0 (foo)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               1 (b)

 20           8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                1 (b)
             12 CALL_FUNCTION            1
             14 STORE_FAST               2 (c)

 22          16 LOAD_GLOBAL              2 (bar)
             18 LOAD_FAST                2 (c)
             20 CALL_FUNCTION            1
             22 RETURN_VALUE

MODIFIED BYTECODE fn /home/yifu/microbench/del2.py line 18
 18           0 LOAD_GLOBAL              3 (__compiled_fn_0)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 STORE_FAST               3 (graph_out_0)
              8 LOAD_GLOBAL              1 (graph_break_fn)
             10 LOAD_FAST                3 (graph_out_0)
             12 LOAD_CONST               1 (0)
             14 BINARY_SUBSCR
             16 DELETE_FAST              3 (graph_out_0)

 20          18 CALL_FUNCTION            1
             20 LOAD_GLOBAL              4 (__resume_at_14_1)
             22 ROT_TWO
             24 CALL_FUNCTION            1
             26 RETURN_VALUE

ORIGINAL BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_FAST                0 (___stack0)
              2 JUMP_ABSOLUTE            9 (to 18)
              4 LOAD_GLOBAL              0 (foo)
              6 LOAD_FAST                1 (a)
              8 CALL_FUNCTION            1
             10 STORE_FAST               2 (b)
             12 LOAD_GLOBAL              1 (graph_break_fn)
             14 LOAD_FAST                2 (b)
             16 CALL_FUNCTION            1
        >>   18 STORE_FAST               3 (c)

 22          20 LOAD_GLOBAL              2 (bar)
             22 LOAD_FAST                3 (c)
             24 CALL_FUNCTION            1
             26 RETURN_VALUE

MODIFIED BYTECODE torch_dynamo_resume_in_fn_at_20 /home/yifu/microbench/del2.py line 20
 20           0 LOAD_GLOBAL              3 (__compiled_fn_2)
              2 LOAD_FAST                0 (___stack0)
              4 CALL_FUNCTION            1
              6 UNPACK_SEQUENCE          1
              8 RETURN_VALUE

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122658
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-03-26 22:49:05 +00:00
df724153c1 Add option to skip cudagraphing on dynamic shape graphs (#122520)
This was requested internally.

Differential Revision: [D55264528](https://our.internmc.facebook.com/intern/diff/D55264528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122520
Approved by: https://github.com/mlazos, https://github.com/shunting314
2024-03-26 21:49:21 +00:00
e229ec6886 [NEON] Speedup float16 convert (#122702)
By using `vcvt_f16_f32` and back

According to [benchmark_convert.py](d3279637ca) this makes float32 to float16 tensor conversion roughly 3 times faster: time to convert 4096x4096 float32 tensor drops from  5.23 msec to 1.66 msec on M2 Pro

Test plan: run `vector_test_all_types` + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122702
Approved by: https://github.com/kimishpatel
2024-03-26 21:48:12 +00:00
6767c04fde Forward fix for broken internal tests related to NJT view dummy (#122704)
(internal link) [example test breakage](https://www.internalfb.com/intern/test/562950061753019?ref_report_id=0)

Symptom: `type stub not overridden` for SymInt. The global NJT dummy relies on `SymInt.__mul__()` in its constructor. Lazily constructing the dummy avoids the race.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122704
Approved by: https://github.com/soulitzer
2024-03-26 21:22:12 +00:00
291848bf30 [Build] Fix AVX detection logic (#122708)
`CXX_AVX[2|512]_FOUND` flags should indicate whether compiler supports generating code  for given instruction set, rather than whether host machine can run the generated code.

This fixes a weird problem that surfaced after https://github.com/pytorch/pytorch/pull/122503 when builder can sometimes be dispatched to an old CPU architecture, that can not run AVX512 instructions, but can compile for those just fine

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122708
Approved by: https://github.com/jeanschmidt
2024-03-26 20:37:35 +00:00
3bede14fa7 Don't create world pg variable out of thin air when rewriting c10d collectives (#122561)
Fixes https://github.com/pytorch/pytorch/issues/122404

Previously, when rewriting c10d collectives, if the group argument is
unspecified or None, we create a world pg variable out of thin air and
pass it to the rewrite target. The approach was problematic, as it
assumes the symbol `torch` is available in the scope (see #122404).

After #120560, dynamo can now trace dist.group.WORLD. If the group
argument is unspecified, we can just set it with dist.group.WORLD in the
rewrite target.

Testing

pytest test/distributed/test_inductor_collectives.py -k test_dynamo_rewrite_dist_allreduce

Also verified with the repro provided in #122404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122561
Approved by: https://github.com/wconstab
ghstack dependencies: #120560
2024-03-26 20:12:08 +00:00
852111e1c2 [TORCH_TRACE] Record stack when no compile context is available (#122644)
This will help me track down those annoying unknown compile products.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122644
Approved by: https://github.com/jamesjwu
2024-03-26 19:30:52 +00:00
f631586084 Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)"
This reverts commit b6982bf2b25d2d3ba5d82488a39721d6013a838f.

Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2021233604))
2024-03-26 18:54:17 +00:00
537cd66e73 [Inductor] Support custom op in JIT with cpp wrapper (#122554)
Summary:  To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python.  https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code.

Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-26 18:48:45 +00:00
e61aaab725 Log autotune time in scuba (#122637)
Summary:
This diff
* Refactors triton and autotune caches to be child classes of the original memcache based cache infra
* Swaps scuba table for autotune
* Adds autotune time spent/saved to scuba table

Test Plan:
Local testing using:
```
buck run mode/opt fbcode//caffe2/test/inductor/:max_autotune -- -r test_max_autotune_remote_caching_dynamic_False
```
and
```
TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE=1 buck2 run mode/opt //scripts/oulgen:runner
```

Differential Revision: D55332620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122637
Approved by: https://github.com/jamesjwu
2024-03-26 17:51:33 +00:00
1f5fcb4e20 [Inductor] Run pattern matcher over the original graph (#122519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122519
Approved by: https://github.com/jansel
2024-03-26 17:30:32 +00:00
8cfbdc0451 [Easy][DCP] Fix small typo in assert (#122633)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122633
Approved by: https://github.com/awgu, https://github.com/wconstab
2024-03-26 16:46:12 +00:00
30a579dba3 Add XPU ATen merge rule (#122484)
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122484
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-03-26 16:20:48 +00:00
FEI
e08cbc0d41 update comment of test_invalid_last_dim_stride in test_transformers.py (#122679)
Fixes #122594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122679
Approved by: https://github.com/mikaylagawarecki
2024-03-26 15:40:24 +00:00
8bad7b63c8 [ez] Add more files to trigger inductor (#122669)
To catch https://github.com/pytorch/pytorch/pull/122562/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122669
Approved by: https://github.com/desertfire
2024-03-26 15:19:30 +00:00
9b90c5e2a1 [CI] Switch pull job linux-jammy-py3_8-gcc11-build to use ARC with runner groups (#122503)
title says it all...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122503
Approved by: https://github.com/atalman
2024-03-26 14:38:12 +00:00
85845a29db Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310
Approved by: https://github.com/masnesral, https://github.com/lezcano
2024-03-26 14:16:33 +00:00
7e176ebb47 Log compilation_metrics to TORCH_TRACE (#122638)
It's not technically needed as you can get it from Scuba too, but it's
more convenient for tlparse to get at it this way.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122638
Approved by: https://github.com/albanD
2024-03-26 14:10:55 +00:00
99c822c0ba Let dynamo inline through jacfwd (#121254)
Similar to #121146, changes are simple and don't require any fancy modification to the codebase. Moved a few entries on trace_rules.py and added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121254
Approved by: https://github.com/zou3519
ghstack dependencies: #120338
2024-03-26 12:43:30 +00:00
2b4173e0de [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)
**Summary**
Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268, #122373
2024-03-26 08:12:41 +00:00
293579363c [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268
2024-03-26 08:09:35 +00:00
caf9c23310 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)
**Summary**
Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267
2024-03-26 08:07:06 +00:00
41d24df08f [export] hack skip index_put_ in dce (#122683)
Summary: Ideally we should do whats in the todo. Just doing this for now to unblock llama capture

Test Plan: capturing llama and using pt2e to quantize it

Differential Revision: D55354487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122683
Approved by: https://github.com/kimishpatel
2024-03-26 08:05:06 +00:00
e0329cba8a [Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)
**Summary**
Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266
2024-03-26 08:03:42 +00:00
b7089937dc Disable test (test_mm_plus_mm2_cuda_cuda_wrapper) (#122682)
Summary:
The test is unstable at the moment. We need to make sure both Aten
and Triton Kernel works to reactivate the test.

Test Plan:
Disabling test

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122682
Approved by: https://github.com/clee2000
2024-03-26 07:14:35 +00:00
f8eeae7aaa Enable CPP wrapper codegen registration (#121296)
Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-26 06:51:03 +00:00
d1f58eaaf5 [inductor] Fix bug with freezing + split_cat passes (#122544)
Fixes #122380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122544
Approved by: https://github.com/eellison
2024-03-26 06:12:57 +00:00
268b0cc714 Do not run CUDA lazy init if it is triggered with fake mode on. (#122636)
Partially fixes https://github.com/pytorch/pytorch/issues/122109

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122636
Approved by: https://github.com/zou3519
2024-03-26 05:43:59 +00:00
dd3f2cb53a [Inductor] Add NEON ISA support on arm64 Macs (#122217)
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)

Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608

Following was added/changed to enable vectorization code to work on MacOS
 - Added VecNEON class to `_inductor/codecache.py`  that is supported on all AppleSilicon Macs
 - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
 - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)

See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype  | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16  | 120 tokens/sec  | 130 tokens/sec | 156 tokens/sec |
| float32  | 158 tokens/sec  | 140 tokens/sec | 236 tokens/sec |
| float16  | 235 tokens/sec  | 81 tokens/sec | 58 tokens/sec |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
2024-03-26 05:07:30 +00:00
a333b080c1 Only update momentum buffers for SGD if momentum is enabled (#122349)
As title

[benchmark](https://gist.github.com/mlazos/1171f035a2392c33778aaa3d7bf24370)

Helps compiled vanilla SGD execution time by 2x on certain models with large number of small params (ex.
ElectraForQuestionAnswering goes from 1090us -> 554us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122349
Approved by: https://github.com/janeyx99
2024-03-26 04:19:39 +00:00
0c47f8028e Keep example_inputs when saving and loading ExportedProgram (#122618)
Summary:
`torch.export` is a powerful tool for creating a structured and shareable package from arbitrary pytorch code. One great use case of `torch.export` is sharing models or subgraphs in a way that allows results to be easily replicated. However, in the current implementation of `export`, the `example_inputs` field is thrown out. When trying to replicate bugs, benchmarks, or behaviors, losing the original input shapes and values makes the process much messier.

This change adds saving and loading for the `example_inputs` attribute of an `ExportedProgram` when using `torch.export.save` and `torch.export.load`. This simple addition makes `ExportedPrograms`s a fantastic tool for performance and accuracy replication. For example, with this change we enable the following workflow:

```
# Script to create a reproducible accuracy issue with my model.
kwargs = {"fastmath_mode": True}
exp_program = export(my_model, sample_inputs, kwargs)
result = exp_program.module()(*sample_inputs, **kwargs)
# Uhoh, I dont like that result, lets send the module to a colleague to take a look.
torch.export.save(exp_program, "my_model.pt2")
```

My colleague can then easily reproduce my results llike so:

```
# Script to load and reproduce results from a saved ExportedProgram.
loaded_program = torch.export.load("my_model.pt2")
# The following line is enabled by this Diff, we pull out the arguments
# and options that caused the issue.
args, kwargs = loaded_program.example_inputs
reproduced_result = loaded_program.module()(*args, **kwargs)
# Oh I see what happened here, lets fix it.
```

Being able to share exact inputs and arguments makes `ExportedPrograms` much
more clean and powerful with little downside. The main potential issue with this change
is that it does slightly increase the size of saved programs. However, the size of
inputs will be much smaller than parameters in most cases. I am curious to hear
discussion on saved file size though.

The deserialization of `example_inputs` is currently implemented as `Optional`. Although this wont effect users of `export.save` and `export.load`, it does give backwards compatibility to any direct users of `serialize` and `deserialize`.

Test Plan:
This diff includes a new test which exercises the save / load flow with multiple args and kwargs.

```
buck test //caffe2/test:test_export -- TestSerialize
```

Differential Revision: D55294614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122618
Approved by: https://github.com/zhxchen17
2024-03-26 03:32:44 +00:00
47e8d60627 [dtensor] add op support for view_as_complex and view_as_real (#122569)
This PR will unblock DTensor computations for [rotary embeddings](https://github.com/meta-llama/llama/blob/main/llama/model.py#L132) used in LLaMa training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122569
Approved by: https://github.com/wanchaol
ghstack dependencies: #122541
2024-03-26 03:32:04 +00:00
1af6fc5e03 Remove top-level DisableFuncTorch; clearing interpreter stack should work. (#122610)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122610
Approved by: https://github.com/zou3519
ghstack dependencies: #122202
2024-03-26 03:08:22 +00:00
f42818321b Restore DILL_AVAILABLE for backwards compat with torchdata (#122616)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122616
Approved by: https://github.com/peterbell10
2024-03-26 02:18:51 +00:00
55f36d1ada Revert "[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)"
This reverts commit 57a3d00b0659e4ac37c4a35a36c71f710e89197a.

Reverted https://github.com/pytorch/pytorch/pull/122562 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122562#issuecomment-2019262415))
2024-03-26 02:18:19 +00:00
4e0b5d59fa [dtensor] add backward support for scaled dot product attention (flash-attention) (#122541)
As titled, as a followup to the forward part #120298.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122541
Approved by: https://github.com/wanchaol
2024-03-26 01:50:24 +00:00
83ad8e01b1 fix the problem that cpu_fallback for aten::triu_indices on custom device crashed (#121306)
Fixes #121289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121306
Approved by: https://github.com/ezyang
2024-03-26 01:29:45 +00:00
5e66bf5f42 Avoid COW materialize in nn.functional forward ops (3) (#122443)
Affected ops:
* repeat
* unfold
* logsigmoid
* pixel_shuffle/unshuffle
* remaining norm ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122443
Approved by: https://github.com/ezyang
2024-03-26 00:56:57 +00:00
b6982bf2b2 [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-03-26 00:52:12 +00:00
eda279c997 [CpuInductor] Implement masked_load for integral types (#122608)
Use `if constexpr` to separate float vs integral masked load for avx512
Discovered while looking at `test_comprehensive_fft_ihfft2_cpu_int64` on
non-AVX512 capable CPUs where (5, 6, 7) shape were big enough to start a vectorized loop

Added `test_pad_cast` regression test

Fixes https://github.com/pytorch/pytorch/issues/122606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122608
Approved by: https://github.com/jansel
ghstack dependencies: #122607
2024-03-25 22:44:54 +00:00
57a3d00b06 [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Differential Revision: D55286634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-03-25 22:05:20 +00:00
ebde6c72cb Precompile triton templates (#121998)
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

```
@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
2024-03-25 21:33:36 +00:00
9b095c3fe6 [dynamo] Config to not emit runtime asserts (#122603)
Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406

Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603
Approved by: https://github.com/ezyang
2024-03-25 21:17:44 +00:00
1f67da5105 [executorch hash update] update the pinned executorch hash (#122152)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122152
Approved by: https://github.com/pytorchbot
2024-03-25 20:56:34 +00:00
46a76cfef5 [ROCm] Fix test_trace_rule_update.py (#121524)
-Add missing torch API to trace rules and ignore API with manual trace rule.

The PR fix test/dynamo/test_trace_rule_update

maybe related to https://github.com/pytorch/pytorch/pull/121142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121524
Approved by: https://github.com/jansel, https://github.com/pruthvistony
2024-03-25 20:53:24 +00:00
bc7f3859b3 Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
2024-03-25 20:50:12 +00:00
1c1268b6e9 seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr (#121905)
When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`.

This pull request is to fix this edge condition so that it will exit the program gracefully with useful information.

**Test:**
Before the fix, my test script exits like below:
```
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: basic_string::_M_construct null not valid
```

After this fix, my test script exited with useful message like,
```
[rank0]:   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2
[rank0]: ncclInternalError: Internal check failed.
[rank0]:  Last error: Unknown NCCL Error
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905
Approved by: https://github.com/wconstab
2024-03-25 20:49:34 +00:00
05bbcae5bb Refactor functorch meta conversion (#122202)
At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`.

The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer.

I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py.

The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I *exactly* replicated the old reconstruction behavior; a further refactor would be to rationalize this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202
Approved by: https://github.com/zou3519
2024-03-25 20:47:21 +00:00
9223b2cb31 Pop codegened parent graph from wrapper in GraphLowering (#122469)
Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done.

Fixes https://github.com/pytorch/pytorch/issues/121887.

Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469
Approved by: https://github.com/eellison
2024-03-25 20:27:59 +00:00
b2c496ba24 Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)"
This reverts commit c1fe09dc37358d8121f119d66e9e8c8d57035158.

Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))
2024-03-25 20:14:40 +00:00
f84e3bf36d [ez] Fix XLA auto hash updates (#122630)
The xla pin is located in .github/ci_commit_pins not .ci/docker/ci_commit_pins
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122630
Approved by: https://github.com/huydhn
2024-03-25 19:45:56 +00:00
9d1de31634 [BE][CPUInductor] Use C++17 helper templates (#122607)
Such as `std::is_same_v` ,`std::is_integral_v` and C++14 one `std::enable_if_t`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122607
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-25 19:01:44 +00:00
2d4197c9b7 add case for creating storage on ort (#122446)
Fixes #122445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122446
Approved by: https://github.com/mikaylagawarecki
2024-03-25 18:59:20 +00:00
2db7d874a9 [inductor] Improve error message for shape errors in slice_scatter (#122543)
Fixes #122291

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122543
Approved by: https://github.com/shunting314
2024-03-25 18:57:16 +00:00
db506762d1 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit a52b4e22571507abc35c2d47de138497190d2e0a.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))
2024-03-25 18:52:05 +00:00
c7bf5871ce CUDAEvent::elapsed_time could accidentally initialize a non-used GPU (#122538)
This sets the device before call cudaEventElapsedTime to avoid the case
where the "cudaGetCurrentDevice" device would be initialized even though
neither event is on that device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122538
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-03-25 17:49:50 +00:00
198927170d Avoid COW materialize in nn.functional forward ops (2) (#121992)
Affected ops:
* dropout
* embedding
* embedding_bag
* mutli_head_attention_forward
* grid_sample
* ctc_loss
* nll_loss
* pdist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121992
Approved by: https://github.com/ezyang
ghstack dependencies: #122437, #121991
2024-03-25 17:31:19 +00:00
55becf02bc Avoid COW materialize in nn.functional forward ops (1) (#121991)
Affected ops:
* Remaining norm ops
* pad
* margin_loss ops
* fractional_max_pool
* linear
* prelu
* rrelu
* scaled_dot_product_attention
* logsigmoid
* threshold
* binary_cross_entropy
* gelu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121991
Approved by: https://github.com/ezyang
ghstack dependencies: #122437
2024-03-25 17:31:19 +00:00
4c70ab26ef [MPS] Enable index_select for complex types (#122590)
Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick

Fixes https://github.com/pytorch/pytorch/issues/122427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590
Approved by: https://github.com/Skylion007
2024-03-25 16:57:35 +00:00
e6a37eeb06 run some cuda testcases on other devices if available. (#122182)
If users want to run some cuda testcases on other devices throw setting an environment variable for testing the performance on custom devices, I think it can be used like this pr.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122182
Approved by: https://github.com/ezyang
2024-03-25 16:40:03 +00:00
70ac13b876 [ez][TD] Hide errors in llm retrieval job (#122615)
The new ghstack does have a base on main anymore, so finding the base for ghstacked PRs is harder.  Something similar to https://github.com/pytorch/pytorch/pull/122214 might be needed, but then I'm worried about tokens

Either way, this is a quick workaround to hide these errors for ghstack users
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122615
Approved by: https://github.com/huydhn
2024-03-25 16:35:00 +00:00
47a9725de9 Implement prefer_deferred_runtime_asserts_over_guards (#122090)
Fixes https://github.com/pytorch/pytorch/issues/121749

As promised, it is pretty easy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122090
Approved by: https://github.com/lezcano
2024-03-25 16:31:16 +00:00
e49a38973f Update DimOrDims typing in torch.sparse (#122471)
I noticed the typing of the `torch.sparse.sum`'s `dim` parameter wasn't allowing an int tuple as input and tracked the issue to this type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122471
Approved by: https://github.com/soulitzer
2024-03-25 16:25:56 +00:00
06f22537ca [dynamo] Suppress warning about torch.autograd.Function() (#122566)
PR #120577 got reverted due to issues in fbcode.  This hides warning
that PR was trying to fix until we can debug the fbcode issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122566
Approved by: https://github.com/yanboliang
2024-03-25 16:18:43 +00:00
0465a90b00 [export][reland] Fix unflattened submodule ordering. (#122341) (#122507)
Summary:

Make sure the order of submodules is the same as the original eager module.

bypass-github-export-checks

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_unflatten_submodule_ordering

Differential Revision: D55251277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122507
Approved by: https://github.com/tugsbayasgalan
2024-03-25 15:22:01 +00:00
11dfa72153 [BE] Remove unnecessary state dict update. (#122528)
From what I can see, following is a redundant/unnecessary setting of dict element.

Differential Revision: [D55191396](https://our.internmc.facebook.com/intern/diff/D55191396/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122528
Approved by: https://github.com/Skylion007
2024-03-25 15:21:44 +00:00
5152945441 GPT2 SDPA inference pattern-matching for Inductor-CPU (#121866)
### Summary
With this PR, SDPA pattern of GPT2 is being mapped to `torch.nn.functional.scaled_dot_product_attention`.
While GPT2 supports both a causal mask & an attention mask, this PR considers the case of attention mask being absent.
TorchBench inference workload for GPT2 also doesn't use an attention-mask.

This pattern's replacement is being disabled for CUDA because [CUDA AOT Inductor](https://github.com/pytorch/pytorch/actions/runs/8319111885/job/22762567770) CI job's `GPT2ForSequenceClassification` accuracy test failed, although all other trunk CUDA Inductor CI checks had passed.
Created #122429 to track that particular issue.

### CPU performance data with TorchBench
|MODEL |BATCH SIZE | DTYPE | BEFORE: Speedup over eager-mode with the default Inductor implementation | AFTER: Speedup over eager mode with SDPA op mapped| Perf boost = (AFTER - BEFORE)/BEFORE * 100|
|--------------------------|-------------|---------|-----------------------------|--------------------------|------------|
|hf_GPT2| 1 | FP32 | 1.522x | 1.791x| 17.67%|
|hf_GPT2| 1 | BF16 (AMP) | 1.795x | 2.387x| 32.98%|
|hf_GPT2| 2 | FP32 |  1.313x |1.629x | 19.3%|
|hf_GPT2|2| BF16 (AMP) | 1.556x | 1.924x | 23.65%|
|hf_GPT2_large| 1 | FP32 | 1.380x |1.585x | 12.93%|
|hf_GPT2_large| 1 | BF16 (AMP) | 1.208x | 1.567x | 22.91%|
|hf_GPT2_large| 2 | FP32 | 1.188x | 1.490x | 25.42%|
|hf_GPT2_large|2| BF16 (AMP) | 0.991x | 1.575x | 58.93%|

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen Sapphire Rapids)
48 physical cores were used. Intel OpenMP & libtcmalloc were preloaded.

Example command -
```
 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 -C 0-47 python benchmarks/dynamo/torchbench.py --performance --inference --inductor --float32 -dcpu --only hf_GPT2_large --freezing --batch-size 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121866
Approved by: https://github.com/Valentine233, https://github.com/jgong5, https://github.com/desertfire
2024-03-25 15:04:03 +00:00
4dc09d6aa4 Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)"
This reverts commit e9dcda5cba92884be6432cf65a777b8ed708e3d6.

Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))
2024-03-25 13:49:04 +00:00
cyy
b9d6f8cc18 Fix clang-tidy warnings in aten/src/ATen/core/*.cpp (#122572)
This PR fixes clang-tidy warnings in aten/src/ATen/core/*.cpp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122572
Approved by: https://github.com/ezyang
2024-03-25 13:46:24 +00:00
1e404c9b12 Remove redundant query to tensor_to_context (#122278)
from_real_tensor will query it again, so this query is strictly
dominated.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122278
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270, #122271
2024-03-25 13:16:21 +00:00
49b81af45f Delete dead memoized_only kwarg in FakeTensor (#122271)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122271
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270
2024-03-25 13:16:21 +00:00
f32ce4e28e Delete FakeTensorConverter.__call__ in favor of from_real_tensor (#122270)
It's annoying grepping for `__call__` call-sites so they're now all explicit now. I'd do this to MetaConverter too but that one is way more public, a lot more sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122270
Approved by: https://github.com/eellison
ghstack dependencies: #122044
2024-03-25 13:16:13 +00:00
069270db60 [dynamo] Fix list comparison ops (#122559)
Fixes #122376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122559
Approved by: https://github.com/Skylion007
2024-03-25 07:03:23 +00:00
5891c5b3a6 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
2024-03-25 06:21:17 +00:00
cf06189a2d [CPPInductor] Fix another out-of-bounds access (#122580)
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626  ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any  x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
    return torch.diagonal_scatter(x, y)

x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1,  2, -8,  8,  5,  5, -7, -8,  7,  0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-03-25 04:49:20 +00:00
deeeaded1f Add metas for randint/rand factory functions out overload (#122375)
Fixes https://github.com/pytorch/pytorch/issues/121897

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375
Approved by: https://github.com/lezcano
2024-03-25 04:01:38 +00:00
cyy
a01d35c7f6 [TorchGen] Remove unused variables (#122576)
This PR removes some unused Python variables from TorchGen scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576
Approved by: https://github.com/Skylion007
2024-03-25 03:31:41 +00:00
e75ecd5618 [BE][veclib] Use is_same_v/enable_if_t (#122533)
`enable_if_t` helper is part of C++14
`is_same_v` helper is part of C++17

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122533
Approved by: https://github.com/Skylion007
2024-03-24 20:57:41 +00:00
14e348b7ad Handle JIT test failure when the GPU is newer than the CUDA compiler or vice versa (#122400)
The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400
Approved by: https://github.com/ezyang
2024-03-24 13:58:06 +00:00
36188360dd [dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560)
Fixes https://github.com/pytorch/pytorch/issues/120431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560
Approved by: https://github.com/wconstab
2024-03-24 11:13:05 +00:00
3e4a4bea12 [dynamo] Graph break on SymNode control flow (#122546)
Fixes #111918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122546
Approved by: https://github.com/ezyang
2024-03-24 07:22:02 +00:00
adeedc060f [Inductor] Fix unbacked symbol in stride when using item() (#122298)
Fixes #122296

Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298
Approved by: https://github.com/ezyang
2024-03-24 06:27:15 +00:00
cyy
c1fe09dc37 [TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)
This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into  ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415
Approved by: https://github.com/ezyang
2024-03-24 06:11:08 +00:00
ca9606f809 Update COW OpInfo test to include kwargs and expected materialization (#122437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122437
Approved by: https://github.com/ezyang
2024-03-24 06:07:30 +00:00
9d4218c23e Handle JIT test failure when the GPU is newer than the CUDA compiler (#122402)
The test uses the CUDA compute capabilities of the current device to
compile an extension. If nvcc is older than the device, it will fail
with a message like "Unsupported gpu architecture 'compute_80'"
resulting in a `RuntimeError: Error building extension 'cudaext_archflags'`
ultimately failing the test.

This checks for this case and allows execution to continue

Fixes https://github.com/pytorch/pytorch/issues/51950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402
Approved by: https://github.com/ezyang
2024-03-24 05:36:24 +00:00
cyy
808a035658 [Dynamo][4/N] Enable clang-tidy coverage on torch/csrc/dynamo/* (#122534)
This PR enables clang-tidy coverage on torch/csrc/dynamo/* and also contains other small improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122534
Approved by: https://github.com/Skylion007
2024-03-24 05:26:32 +00:00
f0d461beac [vision hash update] update the pinned vision hash (#122536)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122536
Approved by: https://github.com/pytorchbot
2024-03-24 03:42:21 +00:00
5f7e71c411 [dynamo] Add HASATTR guard for UserDefinedObject attrs (#122555)
Fixes #111522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122555
Approved by: https://github.com/Skylion007
2024-03-24 03:41:58 +00:00
07d037674f [inductor] Fix issue with randint + symbolic shapes (#122428)
Fixes #122405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122428
Approved by: https://github.com/ezyang
2024-03-24 03:41:13 +00:00
476585b190 Preserve unbacked SymInt on SymNode (#120816)
Previously, when we applied a replacement, a SymInt that was
previously an unbacked SymInt would then transmute into whatever
we replaced it into (e.g., a constant).

This has a major downside: we often look at SymInts associated with
FX nodes (e.g., the meta of x.item() return) to find out where the
unbacked SymInt was allocated.  If we replace it, we no longer can
find out where, e.g., u1 was allocated!  But we need to know this
so we can generate deferred runtime asserts like u1 == s0.

To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites:

* `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`.
* `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred.
* `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False`
* `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue
* `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression.

There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming.

Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR.

Fixes https://github.com/pytorch/pytorch/issues/119689

Fixes https://github.com/pytorch/pytorch/issues/118385

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816
Approved by: https://github.com/lezcano
2024-03-24 02:56:16 +00:00
cyy
a52b4e2257 Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-24 02:12:08 +00:00
788638fcdc Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122473
Approved by: https://github.com/lezcano
2024-03-24 01:02:20 +00:00
cdc7f0fd3b Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420)
Fixes #122283

Description:

PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420
Approved by: https://github.com/lezcano
2024-03-24 00:40:31 +00:00
4758837930 [BE] Do not use importlib.load_module (#122542)
To get rid of the annoying
```
<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead
```
using recipe from https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122542
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-23 17:22:26 +00:00
bf40e3f880 [EZ][BE] Add missing acosh op to vec256_float_neon.h (#122513)
As base class has it
ed15370aab/aten/src/ATen/cpu/vec/vec_base.h (L367-L369)

Discovered while attempting to enabling Inductor vectorization on ARM platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122513
Approved by: https://github.com/Skylion007
2024-03-23 14:18:02 +00:00
a39e638707 Update bsr_dense_addmm kernel parameters for sizes 3 x 2 ^ N (#122506)
As in the title. The speed-ups for a particular set of input sizes range from about 7 to 85 % depending on the used BSR tensor block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122506
Approved by: https://github.com/cpuhrsch
2024-03-23 11:54:33 +00:00
8a209344c9 Fix access to unitialized memory in VSX vector functions for quantized values (#122399)
Similar to https://github.com/pytorch/pytorch/pull/89833 those function may access uninitialized memory leading
to undefined behavior/results.
Initialize with zeros as done before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122399
Approved by: https://github.com/ezyang
2024-03-23 06:11:30 +00:00
c677221798 remove torchao dependency (#122524)
Test Plan:
CI

```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2
```

```
buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4
```

Differential Revision: D55263008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524
Approved by: https://github.com/jerryzh168
2024-03-23 03:18:43 +00:00
19d27a13ea [CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] (#122511)
Discovered while debugging regressions in enabling vectorization on ARM platform

Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122511
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-03-23 01:45:07 +00:00
4d8a3f8bb3 changed aliasing checks to properly recurse for computing last usage (#122444)
Fixes https://github.com/pytorch/pytorch/issues/122457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122444
Approved by: https://github.com/yifuwang, https://github.com/jansel
ghstack dependencies: #121624, #122474
2024-03-23 01:43:21 +00:00
50036ec781 [Inductor] Add a test for creating a cpu inductor-> triton backend (#122396)
Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend.

Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this.

This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue.

Test plan:
```
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('intermediate_hooks', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 0.394s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396
Approved by: https://github.com/jansel
2024-03-23 01:14:57 +00:00
41d69ff324 Add a shape inference tool (#120097)
Summary:
Add a shape inference tool that helps to infer each node shape of a given graph module.
1. Given a fx graph, and an example of an input(don't need to be an accurate input that can be forward, but should have valid dims and data structures), `infer shape` creates an input of symbolic shape
2. Shape prop this symbolic input can catch runtime or value exceptions.
3. These errors are constraints for symbol values, and the constraint solver `infer symbolic values` helps us figure out specific values for each symbol.
4. Finally, we run the shape propagation based on input tensor to get tensor shapes for all nodes in the FX traced module.

Test Plan:
### 1. Test `infer symbol values`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```

### 2. Test `infer shape`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```
Inferred shape result like: P897560514

Differential Revision: D53593702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120097
Approved by: https://github.com/yf225
2024-03-23 00:23:24 +00:00
29bca8547b Fix failing test_cpu_repro without vectorization support (#117262)
At least the following tests fail when there is no supported vector ISA:
test_lowp_fp_neg_abs
test_non_contiguous_index_with_constant_stride
test_scalar_mul_bfloat16
test_transpose_non_contiguous
test_transpose_sum2d_cpu_only
test_transpose_sum_outer
test_transpose_vertical_sum_cpu_only
test_vertical_sum_cpu_only

Those tests assert `metrics.generated_cpp_vec_kernel_count` is nonzero
which is never the case without a supported vector ISA, e.g. on PPC and
maybe on AArch.

Skip those tests with a new decorator and use the simpler one where an equivalent is already used

Some usages of `metrics.generated_cpp_vec_kernel_count` where guarded by a check instead of skipping the test. I tried to apply that instead of a skip where the test looked similar enough to where that was previously done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117262
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-23 00:03:55 +00:00
a84f1d3def [effects] Fix backwards handling (#122346)
I didn't previously test the `.backwards()` call, and when testing on #122348 I realized we were missing some token handling in some places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122346
Approved by: https://github.com/zou3519
2024-03-22 23:31:52 +00:00
e7fa3f7812 AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly (#118670)
This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600.

More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like:

"We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents"

Here, the problem is similar:

"We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass".

This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial).

One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by:

(1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error

(2) In the error message, provide the name of an optional method that the subclass must implement to handle this case:

`def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement.

`__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is **unable** to convert a DTensor with some placement type into another DTensor with a `_Partial` placement.

`__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time.

I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670
Approved by: https://github.com/ezyang
2024-03-22 23:16:08 +00:00
f7b8d8e249 Support for sapling scm (#122072)
We can use Sapling (hg) with the pytorch repo but there are a couple minor issues to teach our scripting to be happier with having either a git or hg repo.

This change fixes some issues in:
- setup.py
- lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122072
Approved by: https://github.com/ezyang
2024-03-22 22:59:16 +00:00
cyy
482f6c4693 [Dynamo][3/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122392)
This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122392
Approved by: https://github.com/ezyang
2024-03-22 22:57:41 +00:00
3f99306452 [export] Remove from_export flag (#122500)
Summary: The flag from_export was incorrectly included in a previous diff (https://www.internalfb.com/diff/D54314379) - it was intended for helping with ExportedProgram verification, but was no longer needed in the final implementation.

Test Plan: Changes no functionality, test/export already covers everything

Differential Revision: D55205857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122500
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-03-22 22:55:14 +00:00
03184a82dd [TD] TD on ASAN PR jobs (#122332)
Low impact CPU jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332
Approved by: https://github.com/huydhn
2024-03-22 22:32:51 +00:00
271cc687de Audit retracibility errors and fix some ez ones (#122461)
Summary: Title

Test Plan: CI

Differential Revision: D55227094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122461
Approved by: https://github.com/zhxchen17
2024-03-22 21:31:51 +00:00
29132c2e47 Prevent dup initializers when ONNXProgram.save is called many times (#122435)
Fixes https://github.com/pytorch/pytorch/issues/122351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122435
Approved by: https://github.com/titaiwangms
ghstack dependencies: #122196, #122230
2024-03-22 21:03:15 +00:00
4eaa000acc Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-22 20:25:47 +00:00
3795ebe925 Revert "[Inductor] Make codecache CUDA compilation more robust & flexible (#121490)"
This reverts commit 6bbd697306851b785b51b4d0545c1ef9365ddaa6.

Reverted https://github.com/pytorch/pytorch/pull/121490 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. 700c92e1b9 ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))
2024-03-22 20:11:47 +00:00
97d3bf71b9 Revert "[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)"
This reverts commit 700c92e1b9cb6fae2610d08e5a960273c4dd1697.

Reverted https://github.com/pytorch/pytorch/pull/121491 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. 700c92e1b9 ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))
2024-03-22 20:11:47 +00:00
8013c4409f [inductor] config to control whether we assume inputs are aligned (#122158)
**Motivation**: https://github.com/pytorch/pytorch/issues/112771

**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

**Tests** https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158
Approved by: https://github.com/ezyang
2024-03-22 20:03:38 +00:00
5790096059 [dynamo] Remove uses of raise unimplemented (#122136)
`unimplemented` is a function that raises an error, so
`raise unimplemented(...)` never reaches the `raise`.
Another related issue is that `raise unimplemented(...) from e`
doesn't attach the exception cause correctly. I fix this by adding
a `from_exc` argument to `unimplemented`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136
Approved by: https://github.com/lezcano
2024-03-22 19:29:58 +00:00
ed15370aab [aoti] Add handling of ir.Constants in promote_constants (#122419)
This issue popped up when enabling predispatch IR on the benchmarks (https://github.com/pytorch/pytorch/pull/122225)

On the following model:
```
class M(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = device

    def forward(self, x):
        t = torch.tensor(x.size(-1), device=self.device, dtype=torch.float)
        t = torch.sqrt(t * 3)
        return x * t
```

We get the following error:
```
======================================================================
ERROR: test_constant_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/test/inductor/test_torchinductor.py", line 9232, in new_test
    return value(self)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 922, in test_constant
    self.check_model(M(self.device), (torch.randn(5, 5, device=self.device),))
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 91, in check_model
    actual = AOTIRunnerUtil.run(
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 102, in run
    so_path = AOTIRunnerUtil.compile(
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 40, in compile
    so_path = torch._inductor.aot_compile_ep(
  File "/data/users/angelayi/pytorch/torch/_inductor/__init__.py", line 150, in aot_compile_ep
    return compile_fx_aot(
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1005, in compile_fx_aot
    compiled_lib_path = compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1111, in compile_fx
    return compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1145, in compile_fx
    return compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1336, in compile_fx
    return inference_compiler(unlifted_gm, example_inputs_)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1266, in fw_compiler_base
    return inner_compile(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/data/users/angelayi/pytorch/torch/_inductor/debug.py", line 304, in inner
    return fn(*args, **kwargs)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 447, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 707, in fx_codegen_and_compile
    graph.run(*example_inputs)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 612, in run
    return super().run(*args)
  File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 957, in run_node
    result = super().run_node(n)
  File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 819, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 816, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 298, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 5340, in mul
    return make_pointwise(fn)(a, b)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 409, in inner
    inputs = promote_constants(inputs, override_return_dtype)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 373, in promote_constants
    ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._inductor.exc.LoweringException: StopIteration:
  target: aten.mul.Tensor
  args[0]: Constant(value=5.0, dtype=torch.float32, device=device(type='cuda', index=0))
  args[1]: 3
```

So I added an additional casing in `promote_constants` to handle the ir.Constants and now it works! Although please let me know if this is the wrong approach. Here's a paste of the full run with the inductor logs: P1198927007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122419
Approved by: https://github.com/eellison, https://github.com/desertfire, https://github.com/chenyang78
2024-03-22 18:39:36 +00:00
cyy
52e9049ffa Remove unused variables (#122496)
This PR removes several unused variables in the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122496
Approved by: https://github.com/ezyang
2024-03-22 18:04:09 +00:00
bbe846f430 Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
Start to fix https://github.com/pytorch/pytorch/issues/114801

Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828
Approved by: https://github.com/thiagocrepaldi
2024-03-22 18:01:33 +00:00
34d33df056 [DCP] Check if pg exists in async before checking for cpu PG (#122316)
Check if pg exists in async before checking for cpu PG in async save path.

This PR enables using async_save even if PG is not initialized.

Differential Revision: [D54868689](https://our.internmc.facebook.com/intern/diff/D54868689/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54868689/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122316
Approved by: https://github.com/shuqiangzhang, https://github.com/XilunWu
2024-03-22 18:01:11 +00:00
400cc518fc pt2 dper passes: run shape prop before each pass (#122451)
Summary: Most passes relies on shape info. We need to run shape prop after each pass

Reviewed By: frank-wei

Differential Revision: D55221119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122451
Approved by: https://github.com/frank-wei
2024-03-22 17:57:25 +00:00
152fa9ecc2 skip moondream for training (#122483)
The model shows as failed model on the dashboard for training. But the model is not implemented for training (at least for now):
2196021e9b/torchbenchmark/models/moondream/__init__.py (L6)

Skip it in dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122483
Approved by: https://github.com/eellison
2024-03-22 17:35:52 +00:00
a3d4eaf253 [inductor] device guard for max autotune benchmark (#122479)
Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0.

This PR enforces that we do benchmarking for max-autotune on the correct device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479
Approved by: https://github.com/xintwfb, https://github.com/Chillee
2024-03-22 17:27:53 +00:00
3db64c1955 [NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049)
Differential Revision: D54993977

### Summary
The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_.
Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement.

### Test Plan
See diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049
Approved by: https://github.com/shuqiangzhang
2024-03-22 17:10:33 +00:00
f305c96cac [DCP] Add bytesIO object to test_e2e_save_and_load (#122112)
Added a TestTrainstate that includes BytesIO checkpoint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122112
Approved by: https://github.com/LucasLLC
2024-03-22 16:57:13 +00:00
86082f1fdc [aot_inductor] added runtime checks for input/output tensors in debug compile mode (#122047)
This PR added runtime checks to guard the dtypes and shapes of input/output tensors.
Currently, we enable these only for debug compilation
(i.e. aot_inductor.debug_compile is True) in abi_compatible mode.

Differential Revision: [D54993148](https://our.internmc.facebook.com/intern/diff/D54993148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122047
Approved by: https://github.com/desertfire
2024-03-22 16:40:33 +00:00
90a13c3c5b Added a check in register_lowering to avoid decomposed ops (#117632)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117632
Approved by: https://github.com/lezcano
2024-03-22 16:38:31 +00:00
9347a79f1c [Watchdog Timer] Clear timer for already terminated process (#122324)
Summary:
handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry.

removing the non-existent process during clear timer.

Test Plan: unit tests

Differential Revision: D55099773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324
Approved by: https://github.com/d4l3k
2024-03-22 15:48:03 +00:00
018f5e2c32 Fix unused variable warning in int4mm.cu (#122286)
Fix the following warning while compilation:
```
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_weight_int4pack_mm_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:871:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
  871 |   auto stream = at::cuda::getCurrentCUDAStream();
      |      ^~~~~~
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_convert_weight_to_int4pack_cuda(const at::Tensor&, int64_t)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:1044:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
 1044 |   auto stream = at::cuda::getCurrentCUDAStream();
      |      ^~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122286
Approved by: https://github.com/soulitzer
2024-03-22 15:46:18 +00:00
7fd14ebb52 [export] Use randomized inputs to examples. (#122424)
Summary: as title. replacing all torch.ones to randn

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D55206441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122424
Approved by: https://github.com/tugsbayasgalan
2024-03-22 15:32:28 +00:00
60bc29aa0b Revert "[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)"
This reverts commit 2c6eeb26d3f61fba352ad51fd8653120937a20f3.

Reverted https://github.com/pytorch/pytorch/pull/122267 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
b30b396d05 Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)"
This reverts commit 99f0fec7d0873d627e8c7f2dec65818d725424b0.

Reverted https://github.com/pytorch/pytorch/pull/122268 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
777ac511cc Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)"
This reverts commit 783fd89ff1cf401e484c20d14b16823abf20d87d.

Reverted https://github.com/pytorch/pytorch/pull/122373 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
dbedc6bb7c Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)"
This reverts commit 23a6d74f9352e0afb37750fee300d077c4ba9393.

Reverted https://github.com/pytorch/pytorch/pull/122374 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
02fee6caec Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit ecbe82b9cec75324b7efb58e1d9cae6b35b71bdc.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))
2024-03-22 14:53:45 +00:00
e6986e4317 Public API for NJT construction from jagged components (#121518)
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.

Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.

TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113279, #113280
2024-03-22 14:48:22 +00:00
65c37fe05a AOTAutograd: ensure traced tangent subclass metadata takes non-contiguous outputs into account (#118669)
Fixes https://github.com/pytorch/pytorch/issues/118596.

The issue was as follows:

(1) Whenever AOTAutograd sees an output that is non-contiguous, that it needs a tangent for, it forces the tangent that it generates to be contiguous during tracing

(2) However: if this tangent is a subclass, we need to generate code to flatten/unflatten the subclass at runtime.

(3) To do so, we use the metadata stashed here: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/_aot_autograd/schemas.py#L231

(4) However, this metadata was **wrong** - it was generated by inspecting the tangent, **before** we made the tangent contiguous.

The fix in this PR basically moves the logic make `traced_tangents` contiguous earlier, at the time that we first generate `ViewAndMutationMetadata`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118669
Approved by: https://github.com/zou3519
ghstack dependencies: #118803, #119947
2024-03-22 14:42:27 +00:00
09be5800c8 dynamo: support placement kwargs for DTensor.to_local() (#119947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119947
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu
ghstack dependencies: #118803
2024-03-22 14:42:27 +00:00
2e44b12dd4 dynamo: handle DTensor.device_mesh.device_type (#118803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118803
Approved by: https://github.com/wanchaol, https://github.com/yanboliang
2024-03-22 14:42:22 +00:00
ea8e0c75c7 [quant][pt2] Fix create FQ with FixedQParamsQSpec (#122104)
Summary: Before we just returned a _PartialWrapper object when
using FixedQParamsQuantizationSpec in QAT. This is wrong and
we should return a FQ object instead.

Differential Revision: [D55021106](https://our.internmc.facebook.com/intern/diff/D55021106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122104
Approved by: https://github.com/jerryzh168
2024-03-22 14:23:05 +00:00
6e6891e843 [jit] Fix _batch_norm_with_update shape function (#122430)
Summary: We used `native_batch_norm`'s shape function before,
but the schemas are actually different. We need to create new
shape functions for `_batch_norm_with_update` specifically.

Test Plan:
buck2 test '@fbcode//mode/opt-tsan' fbcode//caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - TestShapeGraphLinting.Basic'

Reviewers: bdhirsh, davidberard98, eellison

Differential Revision: [D55211182](https://our.internmc.facebook.com/intern/diff/D55211182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122430
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2024-03-22 14:21:57 +00:00
23a6d74f93 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)
**Summary**
Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268, #122373
2024-03-22 13:13:14 +00:00
f65373e278 Revert "Factor meta conversion through serializable MetaTensorDesc (#122044)"
This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84.

Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))
2024-03-22 12:46:21 +00:00
700c92e1b9 [Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)
* Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels **_inductor.config.cutlass_backend_min_gemm_size**

 * During GEMM algorithm choice generation: **if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend**, even if it is not enabled in **_inductor.config.max_autotune_gemm_backends**

Test plan:
CI
Additional unit test in test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491
Approved by: https://github.com/jansel
ghstack dependencies: #121490
2024-03-22 10:58:43 +00:00
d34514f8db Renamed mutationlayout/aliasedlayout (#122474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474
Approved by: https://github.com/jansel
ghstack dependencies: #121624
2024-03-22 08:32:14 +00:00
eca30df846 Added load_args to repro (#121624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121624
Approved by: https://github.com/ezyang
2024-03-22 08:32:14 +00:00
783fd89ff1 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268
2024-03-22 08:17:57 +00:00
99f0fec7d0 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)
**Summary**
Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267
2024-03-22 08:15:28 +00:00
bb75313f0a [dynamo] Optimize handling of BINARY_OP (#122465)
This saves ~0.1s on https://dev-discuss.pytorch.org/t/a-torchdynamo-trace-time-ablation-study/1961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122465
Approved by: https://github.com/oulgen
2024-03-22 08:14:58 +00:00
2c6eeb26d3 [Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)
**Summary**
Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266
2024-03-22 08:12:23 +00:00
6bbd697306 [Inductor] Make codecache CUDA compilation more robust & flexible (#121490)
Minor changes which make the CUDA compilation within _inductor/codecache.py
more robust and flexible.

Test plan:
CI
Additional test in test_codecache.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490
Approved by: https://github.com/jansel
2024-03-22 08:12:11 +00:00
a337ee0a3a [Quant] Enable QConv2d with silu post op (#122266)
**Summary**
Enable QConv2d implementation with post op `silu`

**Test Plan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_silu_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122266
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-03-22 07:58:45 +00:00
b78e8c0d37 remove duplicate method run_subtests (#122421)
Fixes #121654

I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421
Approved by: https://github.com/soulitzer
2024-03-22 07:00:49 +00:00
6ba85cfc2a Fixed memory leak in Python dispatcher w.r.t. THPDevice. (#122439)
Fixes the memory leak reported in #122417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122439
Approved by: https://github.com/soulitzer
2024-03-22 06:44:12 +00:00
3600778ede Do not create a new node if no normalization is needed (#122330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122330
Approved by: https://github.com/jansel
2024-03-22 05:51:28 +00:00
e2d89e9704 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
2024-03-22 03:56:34 +00:00
cyy
ecbe82b9ce Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-22 03:49:31 +00:00
ef0d470eb3 [vision hash update] update the pinned vision hash (#122453)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122453
Approved by: https://github.com/pytorchbot
2024-03-22 03:37:11 +00:00
fb57d1699b [export] Fix handling output in remove_effect_tokens_pass (#122357)
Added handling for updating the output_spec in the graph signature if the the result of a with_effects call is an output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122357
Approved by: https://github.com/zhxchen17
2024-03-22 03:35:59 +00:00
09eb07bee8 Introduce XPU implementation for PyTorch ATen operators (#120891)
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.

The added ATen operators include:

- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.

Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
2024-03-22 03:31:04 +00:00
e419011471 [inductor] Add torch.while_loop support to JIT Inductor (#122069)
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).

AOT Inductor support will be added in a follow-up PR.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-22 02:45:27 +00:00
5e0440edb4 Revert "Optimize multi_tensor_apply (take 2) (#119764)"
This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0.

Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk 0b68a28c87.  Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))
2024-03-22 02:18:28 +00:00
470b44c048 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #113279
2024-03-22 02:12:37 +00:00
cd6bfc7965 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-22 02:12:36 +00:00
bde22835c6 [PT2] - Guard oblivious on meta registrations (#122216)
Summary:
```
[trainer0|0]:Potential framework code culprit (scroll up for full backtrace):
[trainer0|0]:  File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check
[trainer0|0]:    if index.numel() != 0:
```

Test Plan: General CI.

Reviewed By: ezyang

Differential Revision: D54689183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216
Approved by: https://github.com/ezyang
2024-03-22 01:36:03 +00:00
4f93b3d958 [Dort] Reduce excessive warning to info (#122442)
No need to warn when an op can be exported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122442
Approved by: https://github.com/thiagocrepaldi
2024-03-22 01:09:33 +00:00
a001b4b048 Inductor: Don't clamp views when the views come from split_with_sizes (#122149)
Summary:
Fixes #122126

`split_with_sizes` don't need clamping.

Test Plan: Added test + CI

Differential Revision: D55043320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122149
Approved by: https://github.com/ezyang
2024-03-22 00:55:36 +00:00
b1fa0ce4aa [export] build the infra to rollout predispatch export. (#122326)
Test Plan:
fbcode:caffe2/test/quantization:test_quantization
fbcode:bolt/nn/executorch/backends/tests:qnn_test
fbcode:on_device_ai/helios/compiler_tests/...
fbcode:pyspeech/tests:pyspeech_utils_test_oss
fbcode:caffe2/test:quantization_pt2e_qat
fbcode:on_device_ai/Assistant/Jarvis/tests:test_custom_ops
fbcode:modai/test:test_modai
fbcode:executorch/exir/backend/test:test_partitioner

Differential Revision: D55133846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122326
Approved by: https://github.com/tugsbayasgalan
2024-03-22 00:55:10 +00:00
4b535906aa Better handle test-config labels on PR (#122155)
I have some minor fixes in the scripts to

1. Fix the bug where the empty test matrix was confusingly print as unstable https://github.com/pytorch/pytorch/pull/121381#issuecomment-2004558588
1. Replace `print` with `logging.info`
1. Remove the hardcoded `VALID_TEST_CONFIG_LABELS` list.  It's out of date and not many people use this features besides `test-config/default`, so why bother.  The behavior here is simpler now:
    1. If the PR has some `test-config/*` labels, they will be applied
    1. If the PR has none of them, all test configs are applied
1. Add log for the previous 2 cases to avoid confusion

### Testing

```
python filter_test_configs.py --workflow "Mac MPS" --job-name "macos-12-py3-arm64 / build" --event-name "push" --schedule "" --branch "" --tag "ciflow/mps/121381" \
  --pr-number 121065 \
  --test-matrix "{ include: [
    { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
    { config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
  ]}
 ```

Also running on this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122155
Approved by: https://github.com/clee2000
2024-03-21 23:20:52 +00:00
bce640709c Revert "Precompile triton templates (#121998)"
This reverts commit b8df2f0ca530ebe01fa079c891c170a1f4b22823.

Reverted https://github.com/pytorch/pytorch/pull/121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail b8df2f0ca5 ([comment](https://github.com/pytorch/pytorch/pull/121998#issuecomment-2014003037))
2024-03-21 23:05:59 +00:00
c4486d3e88 Allow fake models to run with ONNXProgram.__call__ (#122230)
In order to a fake model to run using ONNXProgram.__call__
interface, we need to save the model into disk along with external data
before executing the model. This is what this PR implements

An alternative is to ONNXProgram.__call__ to detect that the model
was exported with fake mode and explicit raise an exception when
ONNXProgram.__call__ is executed. The exception message would instruct
the user to call ONNXProgram.save and manually execute the model using
the ONNX runtime of choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122230
Approved by: https://github.com/BowenBao
ghstack dependencies: #122196
2024-03-21 22:28:05 +00:00
4ba51bb2c4 Add keys used for templated attention impls (#122423)
# Summary

Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423
Approved by: https://github.com/bdhirsh
2024-03-21 22:16:53 +00:00
224beecee6 Revert "Proper view support for jagged layout NestedTensor (#113279)"
This reverts commit 5855c490f09a028bfdfefea8b93c9833eb55dc5c.

Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))
2024-03-21 22:03:01 +00:00
12e7602cf9 Revert "Support for torch.nested.as_nested_tensor(t) (#113280)"
This reverts commit 17c9c7026521be1c194cae278b76ac8e8f7d145b.

Reverted https://github.com/pytorch/pytorch/pull/113280 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113280#issuecomment-2013893099))
2024-03-21 22:00:44 +00:00
816db3bd29 Revert "Public API for NJT construction from jagged components (#121518)"
This reverts commit d4dff9cf5e7b734a8621b571e8f5a761dc43e1e0.

Reverted https://github.com/pytorch/pytorch/pull/121518 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/121518#issuecomment-2013879641))
2024-03-21 21:56:29 +00:00
48afb5c325 [inductor] Use python constants in IndexPropagation (#122031)
In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))`
which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that
`sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead
get a positive zero constant.

Here, I work around this by doing the constant folding with python arithmetic
which does respect signed zeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031
Approved by: https://github.com/lezcano
2024-03-21 21:53:22 +00:00
99055ae165 [aoti] Fix compilation bug for buffer mutations (#121688)
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78, https://github.com/bdhirsh
2024-03-21 21:51:32 +00:00
332456c44d triton_kernel_wrap shouldn't access FakeTensor.data_ptr (#122418)
The comment suggests that we need to replace all FakeTensors with real
tensors. `torch.empty` doesn't actually return a real Tensor because
FakeTensorMode is active!

We disable torch dispatch so that torch.empty actually returns a real Tensor.

The motivation for this PR is that we're trying to ban
FakeTensor.data_ptr (or at least warn on it) in torch.compile. See the
next PR up in the stack

Test Plan:
- Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122418
Approved by: https://github.com/oulgen
2024-03-21 21:48:07 +00:00
621fdc9db8 infer_schema can add alias annotations when passed a list of mutated args (#122343)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122343
Approved by: https://github.com/ezyang
ghstack dependencies: #122319, #122320
2024-03-21 21:39:07 +00:00
639d6201b4 Expand the types infer_schema can infer (#122320)
This PR allows it to infer:
- None return as ()
- List[Tensor] as Tensor[]

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122320
Approved by: https://github.com/ezyang, https://github.com/soulitzer
ghstack dependencies: #122319
2024-03-21 21:39:07 +00:00
0dd78f1828 Add standalone tests for infer_schema (#122319)
We're gonna reuse this helper in the new python custom ops API. Given a
function with type annotations, `infer_schema(fun)` returns an inferred
schema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122319
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2024-03-21 21:39:04 +00:00
23524710e6 [dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756)
Fixes remaining refleaks found when debugging https://github.com/pytorch/pytorch/issues/119607, tests added in https://github.com/pytorch/pytorch/pull/120657.

Also fixes some tests that xfail: https://github.com/pytorch/pytorch/issues/120631 (not entirely sure why), but introduced tests now fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120756
Approved by: https://github.com/jansel
2024-03-21 21:23:12 +00:00
2cd0a5d516 [Inductor] Fix for WrapperCodeGen.statically_known_int_or_none (#121808)
There's obviously a small typo in WrapperCodeGen.statically_known_int_or_none,
where the return value of a call to V.graph._shape_env._maybe_evaluate_static
is being discarded.

This fix changes that to work how it was likely intended to.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121808
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/aakhundov
2024-03-21 21:15:32 +00:00
968c4c4154 Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 74deacbf31d032a2659dc1633dc3e5248921d466.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:17 +00:00
13afbcfc85 Revert "Support gpu trace on XPU (#121795)"
This reverts commit 91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2.

Reverted https://github.com/pytorch/pytorch/pull/121795 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:16 +00:00
182bb0f2ca Revert "Introduce XPU implementation for PyTorch ATen operators (#120891)"
This reverts commit 148a8de6397be6e4b4ca1508b03b82d117bfb03c.

Reverted https://github.com/pytorch/pytorch/pull/120891 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert it to resolve a conflict in trunk https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013434523.  Please help reland the change after ([comment](https://github.com/pytorch/pytorch/pull/120891#issuecomment-2013668563))
2024-03-21 20:30:20 +00:00
628dcde136 [AOTI] Disable stack allocation when there is a fallback op (#122367)
Summary: Stack allocation is disabled when there is an aten fallback op, see c84f81b395/torch/_inductor/codegen/cpp_wrapper_cpu.py (L974). But we need to do the same where is a custom op fallback.

Test Plan: CI

Reviewed By: mikekgfb

Differential Revision: D55149369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122367
Approved by: https://github.com/mikekgfb
2024-03-21 20:02:33 +00:00
af9b71c82f fix typo in while_loop_test (#122416)
As titiled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122416
Approved by: https://github.com/angelayi
2024-03-21 19:42:08 +00:00
d131cbc44f Fuse the input -> p2p buffer copy into one-shot all-reduce kernel when the input is small (#121213)
This improves the gpt-fast llama2 70B 8xH100 (non-standard) TP benchmark from 86 tok/s to 88 tok/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121213
Approved by: https://github.com/Chillee
2024-03-21 18:25:57 +00:00
765c3fc138 fix breaking changes for ONNX Runtime Training (#122000)
Fixes breaking changes for ONNX Runtime Training.

PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000
Approved by: https://github.com/malfet
2024-03-21 18:10:22 +00:00
c2651a7f0e Make check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372)
Partially fixes https://github.com/pytorch/pytorch/issues/113002

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122372
Approved by: https://github.com/lezcano
ghstack dependencies: #122370
2024-03-21 17:14:42 +00:00
780f70b728 Make expected stride test in torch._prims_common size oblivious (#122370)
Partially addresses https://github.com/pytorch/pytorch/issues/113002

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122370
Approved by: https://github.com/lezcano
2024-03-21 17:14:42 +00:00
25bf5f7e61 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit aa74a8b9e5b34eaa700a64064818adc7a12942ca.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to Sorry for revert your change one more time but the hard part is that it breaks lot of internal builds ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2013043364))
2024-03-21 17:07:17 +00:00
b8df2f0ca5 Precompile triton templates (#121998)
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

```
@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275, #121997
2024-03-21 17:04:53 +00:00
17175cdbc7 [Docs] Add extended debugging options for troubleshooting (#122028)
Fixes #120889

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-21 17:00:45 +00:00
c20bc18d59 [export] allow static constraints in dynamic_shapes (#121860)
This PR allows users to specify int values for dimensions in dynamic_shapes as well as None, for example:

```
class Foo(torch.nn.Module):
    def forward(self, x, y, z):
        ...

    foo = Foo()
    inputs = (torch.randn(4, 6), torch.randn(5, 4), torch.randn(3, 3))

for dynamic_shapes in [
    None
    ((4, 6), (5, 4), (3, 3)),
    ((None, 6), None, {0: 3, 1: 3})
]:
    _ = export(foo, inputs, dynamic_shapes=dynamic_shapes)
```

All of the above should produce the same ExportedProgram.

This is done by temporarily creating a static dim constraint during analysis, where vr.lower == vr.upper. These constraints are then deleted during _process_constraints(), and do not show up in the final ExportedProgram's range_constraints.

Additionally, export() will also fail if the shapes are mis-specified, for example:
```
_ = export(foo, inputs, dynamic_shapes=((5, None), None, None))
```
leads to `torch._dynamo.exc.UserError: Static shape constraint of 5 does not match input size of 4, for L['x'].size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121860
Approved by: https://github.com/avikchaudhuri
2024-03-21 16:59:59 +00:00
16935de961 Support alias for NestedTensorCPU/CUDA (#117711)
Fixes #ISSUE_NUMBER

Co-authored-by: Vincent Moens <vmoens@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117711
Approved by: https://github.com/albanD
2024-03-21 16:05:52 +00:00
148a8de639 Introduce XPU implementation for PyTorch ATen operators (#120891)
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.

The added ATen operators include:

- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.

Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
2024-03-21 15:42:20 +00:00
204fd69ca6 Make ONNXProgram.model_proto and disk file the same (#122196)
Currently, the in-memory onnx program model proto does
not contain initializers saved into the disk version.

This PR changes this behavior, so that both versions are
identical. This is important for running models with fake
tensor from OMMProgram.model_proto directly, without a file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122196
Approved by: https://github.com/BowenBao
2024-03-21 15:29:31 +00:00
f9996ed764 [BE] Enable torch inductor tests running on MacOS (#122360)
Original idea was limit the testing to just x86 Macs, but right now it will be skipped on all Apple Silicon ones, as all of them have Metal capable GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122360
Approved by: https://github.com/jansel
2024-03-21 14:47:05 +00:00
456b112dca [inductor] Support non-Tensor predicate in torch.cond (#122378)
Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 34 tests in 56.980s

OK

$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 54 tests in 460.093s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-21 14:35:01 +00:00
0b68a28c87 Optimize multi_tensor_apply (take 2) (#119764)
### Take 2

The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153:
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.

### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar
2024-03-21 11:53:31 +00:00
0d8e960f74 Revert "[Sparsity] add support for H100 compute capability 9.x (#121768)"
This reverts commit 91fdaa1b416ab8ac8be30f3c3428751e236657cd.

Reverted https://github.com/pytorch/pytorch/pull/121768 on behalf of https://github.com/jeanschmidt due to Agreed on reverting and fixing rocm tests ([comment](https://github.com/pytorch/pytorch/pull/121768#issuecomment-2011893826))
2024-03-21 10:42:08 +00:00
cyy
7f8bb1de83 [Dynamo][2/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122362)
This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122362
Approved by: https://github.com/ezyang
2024-03-21 09:41:41 +00:00
ea1cd31b50 [c10d] Log the target of FR dump (#122345)
Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump

Test Plan: Modified unit tests

Differential Revision: D54972069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345
Approved by: https://github.com/wconstab
2024-03-21 08:03:05 +00:00
365e89a591 Add tensor step to adadelta (#122252)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Adadelta step update while compiling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252
Approved by: https://github.com/janeyx99
2024-03-21 07:28:47 +00:00
7fa1be506b Add an option to sdpa benchmark to specify backend (#122368)
# Summary
Adds the ability to specify sdpa backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122368
Approved by: https://github.com/cpuhrsch
2024-03-21 07:00:40 +00:00
18c164ef7c [Inductor] Match insignficiant strides on outputs (#122239)
Fix for https://github.com/pytorch/pytorch/issues/116433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239
Approved by: https://github.com/Chillee
2024-03-21 05:35:59 +00:00
b915877deb Support numpy array in Tensor.__eq__ (#122249)
When the `other` arg of `Tensor.__eq__` is a numpy array, it is converted to a PyTorch tensor view of the numpy array, which is then given as the `other` arg to a `Tensor.eq` call

Fixes #119965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122249
Approved by: https://github.com/ezyang
2024-03-21 04:55:01 +00:00
bf18e967b4 [c10d] disable compute_duration by default (#122138)
Summary:
Compute duration would invoke additional cuda overhead and possibly
GPU mem increase and possible hang, so we want to disable it by default and enable it only
when needed, or at least when timing is enabled.

Test Plan:
Test with existing unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138
Approved by: https://github.com/wconstab
2024-03-21 04:45:37 +00:00
ea6f67853e [inductor fbcode] Add python include paths for Python.h (#122363)
Summary:
We're getting errors that Python.h is not found because we didn't have
the proper include path set up for it.

bypass-github-export-checks

Test Plan: I can only get this to show up in Bento: N5106134

Reviewed By: hl475, chenyang78

Differential Revision: D55133110

Co-authored-by: Bert Maher <bertrand@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363
Approved by: https://github.com/bertmaher
2024-03-21 04:32:17 +00:00
d4dff9cf5e Public API for NJT construction from jagged components (#121518)
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.

Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.

TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113280
2024-03-21 04:14:17 +00:00
17c9c70265 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-03-21 04:13:55 +00:00
77bed8f7f2 [ONNX] model_type flag is only supported under SKIP_XFAIL_SUBTESTS (#122336)
Fixes #120918

To address the confusion that developers usually have on which list to put xfail and skip. This PR provides guidance that `model_type` and `matcher` specified xfail/skip should go to `SKIP_XFAIL_SUBTESTS`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122336
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-21 04:10:32 +00:00
cc0cadaf4c [vision hash update] update the pinned vision hash (#122154)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122154
Approved by: https://github.com/pytorchbot
2024-03-21 03:59:12 +00:00
61f69c7fc4 [audio hash update] update the pinned audio hash (#122153)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122153
Approved by: https://github.com/pytorchbot
2024-03-21 03:53:24 +00:00
885fb9742d Handle special kwargs in user-written Triton kernel calls (#122280)
Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs
...
----------------------------------------------------------------------
Ran 6 tests in 6.783s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280
Approved by: https://github.com/oulgen
2024-03-21 03:34:07 +00:00
3e6fdea390 [ONNX] Fix list dtype finding bug in dispatcher (#122327)
Fixes #122166

Before this PR, dispatcher assumes the first input should provide the reasonable dtype to them, but `aten::index` reveals the cases with `None` in the front of inputs. The PR addresses it by selecting the first non None input to take dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122327
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-03-21 02:54:58 +00:00
ae913175c3 Fix GraphModuleDeserializer (#122342)
Summary: self.constants is used in self.deserialize_signature()

Test Plan: CI

Differential Revision: D55152971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122342
Approved by: https://github.com/zhxchen17
2024-03-21 02:27:39 +00:00
e9dcda5cba Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
2024-03-21 01:57:08 +00:00
91ead3eae4 Support gpu trace on XPU (#121795)
# Motivation
Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #121794
2024-03-21 01:56:42 +00:00
74deacbf31 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-21 01:52:58 +00:00
57734202c6 [HSTU][TGIF] Provide a API to check whether running in torch_dispatch mode (#122339)
Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not.

Test Plan:
- OSS CI
- Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish

Differential Revision: D55091453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339
Approved by: https://github.com/jiayisuse
2024-03-21 01:37:23 +00:00
e38d60bc07 Remove some stale xla dynamo backend (#122128)
`torchxla_trace_once ` and `aot_torchxla_trivial ` should be removed.

In our internal(hopefully dashboard can be open source soon) torchbench daily runs, `openxla` backend has much higher passing rate and similar perfomrance as the `openxla_eval`(non-aot-auto-grad backend). We still use `openxla_eval` in llama2 example but I think we should move user to `openxla` backend going forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122128
Approved by: https://github.com/alanwaketan, https://github.com/jansel
2024-03-21 01:13:50 +00:00
c20cf97366 Move some cudagraphs checks into C++ (#122251)
Based off of https://github.com/pytorch/pytorch/pull/111094
This + cpp guards improves TIMM geomean optimizer performance by about 20%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251
Approved by: https://github.com/eellison
2024-03-21 01:02:23 +00:00
be5863de39 Remove usage of deprecated volatile (#122231)
Summary:
When building our iOS app, we get a compile error about the deprecated `volatile` keyword.

This diff attempts to fix it by replacing the usage of the deprecated `volatile` keyword with `atomic` as suggested by malfet

Test Plan: Successfully built the iOS app that previously had a compile error

Differential Revision: D55090518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122231
Approved by: https://github.com/malfet
2024-03-21 00:55:16 +00:00
1686e2d1e4 [symbolic shapes][compile-time] Minor compile time optimization in has_free_symbols (#122144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122144
Approved by: https://github.com/lezcano
ghstack dependencies: #120726
2024-03-21 00:48:57 +00:00
cyy
c2eedb7f8a [Dynamo][1/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122259)
This PR begins a series of works to ensure dynamo C++ code is clang-tidy clean.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122259
Approved by: https://github.com/ezyang
2024-03-21 00:43:25 +00:00
c80601f35a Revert "Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537)"
This reverts commit a2a88f39ee991f471f2a2c54571886d70f5cd2e6.

Reverted https://github.com/pytorch/pytorch/pull/121537 on behalf of https://github.com/kurtamohler due to flaky CI failures ([comment](https://github.com/pytorch/pytorch/pull/121537#issuecomment-2010937226))
2024-03-21 00:03:30 +00:00
eqy
d5b5012dc4 [CUDA] Raise softmax_forward_64bit_indexing GPU memory requirement (#116075)
printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075
Approved by: https://github.com/ptrblck, https://github.com/malfet
2024-03-21 00:03:17 +00:00
5855c490f0 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-20 23:45:34 +00:00
057892f4be [CPU] optimize Lp norm for 1-dimensional vector (#122143)
Fixes https://github.com/pytorch/pytorch/issues/120229

- Optimize vector norm by simplifying vector norm formula for 1-dimensional vector.
- Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof.
- Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix.
- Additionally, avoids overflow in power, `abs(x) ** p` for large `p` or `x`, for 1-dimensional vector.

### Performance
Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for
`torch.norm(torch.randn(2**18, 1), ord, -1)`
`torch.linalg.vector_norm(torch.randn(2**18, 1), ord, -1)`
Tested on 28 physical cores/socket, 1 socket on Skylake.

|                          	|                 	|         	|         	| **Avg Latency (ms)**  	|                       	|                                        	|
|--------------------------	|-----------------	|---------	|---------	|-----------------------	|-----------------------	|----------------------------------------	|
| **op**                   	| **input shape** 	| **dim** 	| **ord** 	| **baseline (master)** 	| **optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a)** 	| **speedup ratio (baseline/optimized)** 	|
| torch.norm               	| (2**18, 1)      	| -1      	| fro     	| 34.3755531            	| 0.0125408             	| 2741.094                               	|
|                          	|                 	|         	| inf     	| 34.0952635            	| 0.0122237             	| 2789.271                               	|
|                          	|                 	|         	| -inf    	| 34.3674493            	| 0.0120759             	| 2845.953                               	|
|                          	|                 	|         	| 0       	| 34.1004515            	| 0.0175261             	| 1945.69                                	|
|                          	|                 	|         	| 1       	| 34.1688442            	| 0.0121593             	| 2810.089                               	|
|                          	|                 	|         	| -1      	| 33.949492             	| 0.0120282             	| 2822.487                               	|
|                          	|                 	|         	| 2       	| 34.3669581            	| 0.0120401             	| 2854.366                               	|
|                          	|                 	|         	| -2      	| 33.9252067            	| 0.0121069             	| 2802.139                               	|
|                          	|                 	|         	|         	|                       	|                       	|                                        	|
| torch.linalg.vector_norm 	| (2**18, 1)      	| -1      	| inf     	| 34.090879             	| 0.0095105             	| 3584.545                               	|
|                          	|                 	|         	| -inf    	| 34.3708754            	| 0.0099111             	| 3467.931                               	|
|                          	|                 	|         	| 0       	| 34.0880775            	| 0.0141716             	| 2405.38                                	|
|                          	|                 	|         	| 1       	| 34.1392851            	| 0.0093174             	| 3664.036                               	|
|                          	|                 	|         	| -1      	| 33.925395             	| 0.0092483             	| 3668.302                               	|
|                          	|                 	|         	| 2       	| 34.3854165            	| 0.0092459             	| 3719.002                               	|
|                          	|                 	|         	| -2      	| 33.932972             	| 0.0093007             	| 3648.429                               	|

### Proof
<details>
<summary>For those interested :)</summary>

<img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b">
<img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322">

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143
Approved by: https://github.com/lezcano
2024-03-20 23:20:25 +00:00
aa74a8b9e5 Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-20 22:41:13 +00:00
666d6291af Cast checkpoint weights to match model parameter's dtype (#122100)
Fixes #121986
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122100
Approved by: https://github.com/BowenBao
2024-03-20 22:01:40 +00:00
2289fa5f5a [while_loop] fix mode not on stack error (#122323)
Fixes https://github.com/pytorch/pytorch/issues/121453.

This is caused by missing  `with mode` in FakeTensor mode.

Test Plan:
add new tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122323
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #122244
2024-03-20 21:17:33 +00:00
512251c8f3 Use tree_map to get device ids and device types for activation checkpointing (#121462)
`get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty.

Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462
Approved by: https://github.com/soulitzer
2024-03-20 21:09:21 +00:00
cyy
1dd1899fd6 Add missing throw of std::runtime_error in dynamo/guards.cpp (#122306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122306
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-03-20 20:50:01 +00:00
d2a8d3864c [PT2][Inductor] Change the log for the group batch fusion (#122245)
Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```

Differential Revision: D55103303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245
Approved by: https://github.com/jackiexu1992
2024-03-20 20:45:37 +00:00
61ff41f0ca [while_loop] disable closure capturing and manually set the inputs. (#122244)
For while_loop operator, it's important to keep the output ordering consistent with input ordering. Previously, we're using set_graph_inputs="automatic", which doesn't respect such ordering. This PR changes it to "manual" and respects the original user inputs' ordering. We disable closures for body and cond fn as they require some additional designs. This PR is just to prevent the bleeding.

 Repro:
```python
import torch
from torch._higher_order_ops.while_loop import while_loop
from torch._functorch.aot_autograd import aot_export_module

class Nested(torch.nn.Module):
    def forward(self, ci, cj, a, b):
        def cond_fn(i1, j1, x1, y1):
            return i1 > 0
        def body_fn(i1, j1, x1, y1):
            def cond_fn_nested(i2, j2, x2, y2):
                return j2 > 0
            def body_fn_nested(i2, j2, x2, y2):
                return i2.clone(), j2 - 1, x2 + 3.14, y2 - 2.71
            i1, j1, x1, y1 = while_loop(
                cond_fn_nested, body_fn_nested, [i1, j1, x1, y1]
            )
            return i1 - 1, j1.clone(), x1 * 2, y1 / 2
        return while_loop(cond_fn, body_fn, (ci, cj, a, b))

nested = Nested()
torch.compile(nested, backend="eager", fullgraph=True)(torch.tensor(2), torch.tensor(2), torch.randn(2, 2), torch.randn(2, 2))
```

Test plan:
add new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122244
Approved by: https://github.com/aakhundov
2024-03-20 20:14:35 +00:00
2f6e8e84c5 Fix _chunk_cat.out issue (#122076)
# PR
Vectors allocated inside `get_chunk_cat_metadata()` are out of local scope when used in `_chunk_cat_out_cuda_contiguous()`. This PR fixes the issue by returning vectors from `get_chunk_cat_metadata`.
This PR also added a few unit tests to cover more edge cases.

# Tests
This PR is tested with the following command and no error shows. So the flaky test error should be resolved.

- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32`
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32 --repeat 1500`

Fixes #122026
Fixes #121950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122076
Approved by: https://github.com/yifuwang
2024-03-20 20:01:38 +00:00
c84f81b395 [export] add pass to remove auto functionalized hop (#122246)
Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use

Test Plan: added unit test

Differential Revision: D55103867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246
Approved by: https://github.com/angelayi
2024-03-20 19:31:52 +00:00
d813474363 [Pytorch] auto format _python_dispatch file (#122226)
Summary: Auto format the _python_dispatch file, to make D55091453 easier to review

Test Plan: `arc lint`

Differential Revision: D55091454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122226
Approved by: https://github.com/aakhundov
2024-03-20 19:28:39 +00:00
821ad56ea6 [CI] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-20 19:06:10 +00:00
91fdaa1b41 [Sparsity] add support for H100 compute capability 9.x (#121768)
Summary: as title

Test Plan: buck test mode/opt //caffe2/test/...

Differential Revision: D54792168

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121768
Approved by: https://github.com/SherlockNoMad
2024-03-20 19:00:54 +00:00
d1e8b97387 [export] Log module hierarchy. (#121970)
Summary:
We can also log the module hierarchy in the following format:
```
:ToplevelModule
sparse:SparshArch
dense:DenseArch
```
So that we can have more information recorded about model's identity.

Test Plan: CI

Differential Revision: D54921097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121970
Approved by: https://github.com/angelayi
2024-03-20 18:59:42 +00:00
0696db8202 Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit 17489784b635187316c6c856c5fe6b6a28d8a15a.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))
2024-03-20 18:34:43 +00:00
1d13c82559 Precompile in background (#121997)
Precompile benchmarking choices in parallel, and then wait on those choices prior to benchmarking. In the case of deferred templates, we only only wait only those choices in the scheduler to allow multiple separate lowerings to compile in parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121997
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275
2024-03-20 18:34:12 +00:00
65eb22158e Revert "Update jvp to support symbolic execution. (#120338)"
This reverts commit afc4c9382ff8b55da848ef40b4a17a92fb3d2ad6.

Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/huydhn due to Broke dynamo tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2010276712))
2024-03-20 18:04:53 +00:00
072935917b Update cuda_to_hip_mappings.py (#122110)
Added one datatype mapping (cuda_bf16.h), and a number of cub/hipcub mappings. Note: the missing mappings were discovered when hipifying the Mamba model's (https://github.com/state-spaces/mamba) forward kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122110
Approved by: https://github.com/jithunnair-amd, https://github.com/Skylion007
2024-03-20 17:17:53 +00:00
334f7e43f9 [TD] Remove credentials requirement for retrieval (#122279)
Made the bucket readable by public
https://s3.console.aws.amazon.com/s3/buckets/target-determinator-assets?region=us-east-1&bucketType=general&tab=permissions

The only jobs that matter here are the retrieval and td jobs, which were both successful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122279
Approved by: https://github.com/huydhn
2024-03-20 15:55:46 +00:00
2e02e1efad Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Fixes https://github.com/pytorch/pytorch/issues/122127

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-20 14:44:55 +00:00
15a8185cd3 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit 2b060983809e5fe8706acd085fff67b6a27bfb5f.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/zou3519 due to This caused build failures for 2+ pytorch devs, so we're reverting it to be safe ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2009661069))
2024-03-20 14:10:12 +00:00
06db0a9f78 Revert "Upgrade submodule sleef to fix build warning (#122168)"
This reverts commit eec8b252b70b2489aee7281d336eb9c32dd85483.

Reverted https://github.com/pytorch/pytorch/pull/122168 on behalf of https://github.com/zou3519 due to trying to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/122168#issuecomment-2009653474))
2024-03-20 14:05:58 +00:00
8a94005d46 [dynamo][runtime_asserts] Ignore failures on sorting sympy relations (#122205)
Differential Revision: [D55075500](https://our.internmc.facebook.com/intern/diff/D55075500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122205
Approved by: https://github.com/ezyang
2024-03-20 13:25:37 +00:00
afc4c9382f Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
2024-03-20 13:09:19 +00:00
17489784b6 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-20 13:09:19 +00:00
eb1d6ed9f9 [Inductor] fix addmm fusion check (#121953)
Fixes #121253.

To avoid functional issue, disable pattern match for `addmm` when `beta!=1 or 0` or `alpha!=1`, as either `mkl_linear` or `mkldnn_linear` doesn't accept `beta` or `alpha` as parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121953
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-03-20 09:22:51 +00:00
ee6ce31b1d [BE][fix] fix test_tp_random_state and add it to periodic test list (#122248)
fix #122184 . Add the test to periodic test so that we can capture the error at CI in future.

**Test**:
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122248
Approved by: https://github.com/wanchaol
2024-03-20 08:24:14 +00:00
a1d02b423c XFAIL detectron2_maskrcnn_r_101_c4 CPU inductor accuracy (#122263)
This starts to fail in trunk after the stack https://github.com/pytorch/pytorch/pull/122066 lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122263
Approved by: https://github.com/jansel
2024-03-20 08:03:34 +00:00
477d154ffd [dynamo] Add missing _nonvar_fields annotations (#122219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122219
Approved by: https://github.com/anijain2305
ghstack dependencies: #122218
2024-03-20 07:53:18 +00:00
46bf37b3f7 [dynamo] Replace VariableTracker.apply with visit/realize_all (#122218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122218
Approved by: https://github.com/anijain2305
2024-03-20 07:53:18 +00:00
a0db2e4237 [dynamo] Fixed handling of ImportError (#122222)
Fixes #122088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122222
Approved by: https://github.com/anijain2305
2024-03-20 07:52:01 +00:00
7832efb242 [export] skip nn_module_stack verifier for non-fx.GraphModule modules (#122210)
Downstream users of torch.export may have different module classes (e.g. LoweredBackendModule), which cannot be checked for metadata in the same way. Add lines to skip this for non-fx.GraphModule modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122210
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-20 07:40:48 +00:00
7d2b2dec4b [Pytoch][Vulkan] Register run_conv1d_context (#122172)
Summary: We have rewritten `conv1d` as `create_conv1d_context` and `run_conv1d_context` to enable prepack of `weight` and `bias`. We have registered `create_conv1d_context` but not `run_conv1d_context`. We add the registration in this diff.

Test Plan:
```
[luwei@devbig439.ftw3 /data/users/luwei/fbsource (f89a7de33)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Using additional configuration options from /home/luwei/.buckconfig.d/experiments_from_buck_start
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

Building: finished in 0.1 sec (100%) 394/394 jobs, 0/394 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (208 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (81 ms)
[----------] 2 tests from VulkanAPITest (289 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (289 ms total)
[  PASSED  ] 2 tests.
```

full test result
```
...
[----------] 427 tests from VulkanAPITest (22583 ms total)

[----------] Global test environment tear-down
[==========] 427 tests from 1 test suite ran. (22583 ms total)
[  PASSED  ] 426 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 11 DISABLED TESTS
```

Differential Revision: D55052816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122172
Approved by: https://github.com/nathanaelsee
2024-03-20 07:36:23 +00:00
e7141d117f [IntraNodeComm] refactor rendezvous into a separate method for better code organization and error handling (#120968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120968
Approved by: https://github.com/wanchaol
2024-03-20 06:54:25 +00:00
cyy
9f572b99a6 [Clang-tidy header][29/N] Enable clang-tidy warnings in aten/src/ATen/core/*.h (#122190)
This PR enables clang-tidy in `aten/src/ATen/core/*.h`, which ends the series of patches beginning from #122015.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122190
Approved by: https://github.com/Skylion007
2024-03-20 06:17:37 +00:00
11e64b4ba8 [dtensor] aten.cat to use stack strategy approach (#122209)
This PR switch aten.cat to use the strategy approach that is similar to
aten.stack, as these two ops share similar semantics

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209
Approved by: https://github.com/wz337
2024-03-20 04:19:25 +00:00
5b7ceab650 Support auto_functionalize in pre-dispatch (#122177)
Summary: Title

Test Plan: CI

Differential Revision: D55042061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122177
Approved by: https://github.com/zou3519
2024-03-20 04:17:58 +00:00
dc89d8b74a Fix broken lint after #116876 (#122253)
Trivial fixes, so let's do this instead of reverting the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122253
Approved by: https://github.com/clee2000
2024-03-20 04:09:00 +00:00
de950039fc Use .get in xml parsing (#122103)
Check that the `classname` attribute actually exists.
#122017
I expect this route to happen very rarely

At a certain point, we should just remove this parsing altogether since everything uses pytest now...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122103
Approved by: https://github.com/huydhn
2024-03-20 04:07:49 +00:00
6662627c89 Add APIs for custom device using TensorIteratorBase. (#120792)
1) add operand and get_dim_names API;
2) set will_resize to true when output tensor is undefined;
3) add abs_stub for dummy device and calculate on cpu device;
4) support dummy device copy with stride;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120792
Approved by: https://github.com/ezyang
2024-03-20 03:51:09 +00:00
f8565c4a28 [sigmoid] Clean up serialization API. (#122102)
Summary: Entirely remove the old serializer code to avoid further confusion and code bloat.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D54857118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122102
Approved by: https://github.com/tugsbayasgalan
2024-03-20 03:45:36 +00:00
1f8177dedf [Inductor][CPU] fix flash attention last_stride!=1 issue (#122083)
Fixes #121174.

Conv converts the input of sdpa to channel last, resulting in accuracy issue. Ensure the layout in lowering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122083
Approved by: https://github.com/eellison, https://github.com/jgong5
2024-03-20 02:22:33 +00:00
cyy
55310e58a9 Use constexpr for index variables (#122178)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122178
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-03-20 02:20:17 +00:00
eec8b252b7 Upgrade submodule sleef to fix build warning (#122168)
Subsequent PR to https://github.com/pytorch/pytorch/pull/118980, fix sleef build warning.

submodule sleef, include this sleef PR: https://github.com/shibatch/sleef/pull/514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122168
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-03-20 02:14:56 +00:00
cbbed46377 Defer selection of triton template (#120275)
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:

- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.

In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.

Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.

Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
2024-03-20 01:40:33 +00:00
e5e0685f61 Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)"
This reverts commit 88ebdbc97c103271766203df6662240e95a09b42.

Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the distributed failure looks legit as it is also failing in trunk 88ebdbc97c ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2008483316))
2024-03-20 01:12:24 +00:00
19d6004b97 add int8 woq mm pattern matcher (#120985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120985
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/eellison
2024-03-20 00:23:41 +00:00
6fefc52a2b Set py3.x build-environment name consistently (#122247)
https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != *py3.8*`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. 03b987fe3f
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247
Approved by: https://github.com/malfet
2024-03-19 23:56:38 +00:00
6c659bbc36 [codemod][lowrisk] Remove unused exception parameter from caffe2/c10/mobile/CPUCachingAllocator.cpp (#116875)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: kimishpatel, palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116875
Approved by: https://github.com/Skylion007
2024-03-19 23:52:09 +00:00
6b95dc8884 [codemod][lowrisk] Remove unused exception parameter from caffe2/torch/csrc/jit/frontend/lexer.cpp (#116876)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116876
Approved by: https://github.com/Skylion007
2024-03-19 23:51:26 +00:00
d0153ca755 use make_storage_impl to create storages for COWStorage. (#121896)
Thanks to https://github.com/pytorch/pytorch/pull/118459, `make_storage_impl` will use the func ,which register for other backends, to create StorageImpl.

`make_storage_impl` completely overwrites the `make_intrusive<StorageImpl>`, so it makes sense to replace  `make_intrusive<StorageImpl>` with `make_storage_impl` to create storage in cow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121896
Approved by: https://github.com/ezyang
2024-03-19 23:40:15 +00:00
4aaf25bc38 delete useless cast_outputs call in unary_op_impl_float_out (#120486)
cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486
Approved by: https://github.com/ezyang
2024-03-19 23:37:06 +00:00
2980779d0b [codemod] Remove unused variables in caffe2/caffe2/experiments/operators/tt_pad_op.h (#120177)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120177
Approved by: https://github.com/Skylion007
2024-03-19 23:36:52 +00:00
2239b55cd1 Add some more sanity asserts to checkPoolLiveAllocations (#122223)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122223
Approved by: https://github.com/eellison
2024-03-19 23:26:19 +00:00
139647d317 Fix #83241: torch.nn.TripletMarginLoss allowed margin less or equal to 0 (#121978)
Documentation states that the parameter margin of torch.nn.TripletMarginLoss is greater than 0, however any value was being accepted. Also fixed torch.nn.TripletMarginWithDistanceLoss which had the same problem. Added error test input for the new ValueError.

Fixes #83241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121978
Approved by: https://github.com/mikaylagawarecki
2024-03-19 23:19:11 +00:00
a843bbdb21 [codemod] Remove unused variables in caffe2/caffe2/opt/nql/graphmatcher.cc (#118116)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: malfet, dmm-fb

Differential Revision: D52981072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118116
Approved by: https://github.com/Skylion007
2024-03-19 22:45:43 +00:00
f05af9e377 [codemod] Remove unused variables in caffe2/caffe2/opt/nql/ast.h (#120176)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D53779579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120176
Approved by: https://github.com/Skylion007
2024-03-19 22:44:51 +00:00
03b987fe3f [CI] Test that NumPy-2.X builds are backward compatible with 1.X (#122157)
By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X

This has no affects on binary builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157
Approved by: https://github.com/atalman
2024-03-19 22:40:26 +00:00
f8becb626f [codemod] Remove unused variables in caffe2/caffe2/contrib/fakelowp/spatial_batch_norm_fp16_fake_op.h (#120178)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D53779549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120178
Approved by: https://github.com/Skylion007
2024-03-19 22:36:38 +00:00
94eb940a02 [codemod] Remove unused variables in caffe2/caffe2/operators/softmax_op_cudnn.cc (#121995)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D54931224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995
Approved by: https://github.com/Skylion007
2024-03-19 22:35:58 +00:00
a6aa3afa77 [codemod] Remove unused variables in caffe2/caffe2/video/video_decoder.cc (#122151)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D54378401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122151
Approved by: https://github.com/Skylion007
2024-03-19 22:34:17 +00:00
a80c60ad8f [codemod] Remove unused variables in caffe2/caffe2/operators/conv_op_cudnn.cc (#122161)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122161
Approved by: https://github.com/Skylion007
2024-03-19 22:33:19 +00:00
02f436da6d [codemod][bugfix] Fix addressing bug in caffe2/caffe2/video/video_input_op.h (#121856)
Summary:
# Diff Specific

The signature of `copyFrom` is
```
void Tensor::CopyFrom(const Tensor& src, bool async) {
```
so the `&context` always evaluated to true.

I could dig around to see if anyone cares about what the flag should actually be, but this is old code in caffe2, so I've just used `true` and we'll keep using whatever behaviour we've been using since 2019 or so when this was written.

# General

A bug in this code was identified by `-Waddress`, which we are working to enable globally.

This diff fixes the bug. There are a few types of fixes it might employ:

The bug could be `const_char_array == "hello"` which compares two addresses and therefore is almost always false. This is fixed with `const_char_array == std::string_view("hello")` because `string_view` has an `==` operator that makes an appropriate comparison.

The bug could be `if(name_of_func)` which always returns true because the function always has an address. Likely you meant to call the function here!

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121856
Approved by: https://github.com/Skylion007
2024-03-19 22:28:06 +00:00
1c4887d52b fix dlrm accuracy test in max-autotune (#122012)
torchrec_dlrm training fail the accuracy check when max-autotune is enabled.

I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass.

The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](99e6e669b5/torchrec/datasets/utils.py (L28)) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-19 22:23:42 +00:00
c71554b944 Revert "[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052)"
This reverts commit 206da97b8b61f51041f67de68e68e9a1875589ab.

Reverted https://github.com/pytorch/pytorch/pull/122052 on behalf of https://github.com/huydhn due to Although this look fixed on OSS, it is still failing internally.  I have added the reproducible buck command in the diff D55046262 ([comment](https://github.com/pytorch/pytorch/pull/122052#issuecomment-2008253185))
2024-03-19 22:22:12 +00:00
7678be4667 Replace numel with sym_numel in is_int_or_symint (#122145)
Fixes https://github.com/pytorch/pytorch/issues/122124

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122145
Approved by: https://github.com/Skylion007
2024-03-19 21:58:43 +00:00
6915a5be70 Increase numel limit to 2^63 for replicatepad1d (#122199)
Summary: As title

Test Plan:
```
CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing
```

Also benchmarked in N5106027
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
11.058772478103638 18.912256770000006 735.4118906278957
# after changes
10.621162576675415 18.58972748 765.7121070725207
```

Differential Revision: D55030372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199
Approved by: https://github.com/ezyang
2024-03-19 21:55:34 +00:00
b12d297b44 [AARCH64] Hide FP16 scalar arithmetic behind proper feature flag (#122204)
On Apple Silicon:
```
% sysctl machdep.cpu.brand_string; clang -dM -E - < /dev/null|grep __ARM_FEATURE_FP16
machdep.cpu.brand_string: Apple M1
#define __ARM_FEATURE_FP16_FML 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
```
On Graviton2 with respective `-march` flag:
```
# ./cpuinfo/build/cpu-info |grep Microarch -A1; gcc -dM -E - -march=armv8.2-a+fp16 </dev/null | grep __ARM_FEATURE_FP16
Microarchitectures:
	8x Neoverse N1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
```
Test Plan: CI

Reviewed By: dimitribouche

Differential Revision: D55033347

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122204
Approved by: https://github.com/huydhn
2024-03-19 21:18:09 +00:00
901ba2be86 [quant][pt2e] Add support for conv transpose + bn + {relu} weights fusion in PTQ (#122046)
Summary:

also added some utils in xnnpack_quantizer_utils.py
* annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT
* annotate_conv_transpose_relu

conv_transpose + bn weights fusion is performed automatically and can not be disabled currently
we can add support to allow disable this fusion later if needed

Test Plan:
python test/test_quantization.py -k test_conv_transpose_bn_fusion

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046
Approved by: https://github.com/andrewor14
2024-03-19 21:00:57 +00:00
bc1fef113d Respect TORCH_DISABLE_ADDR2LINE in symbolizer (#121359)
If TORCH_DISABLE_ADDR2LINE is set, the symbolizer will instead give the
filename of the shared library as the filename, the offset in that library as the linenumber,
and use dladdr to get the function name if possible. This is much faster than using addr2line,
and the symbols can be later resolved offline using addr2line if desired.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121359
Approved by: https://github.com/aaronenyeshi
2024-03-19 20:50:26 +00:00
7718a1cd4f T159183991: Error: EXC_SOFTWARE / SIGABRT at IGPyTorchFramework:-[MPSImageWrapperTrampoline endSynchronization:] (MPSImageWrapper.mm<line_num>):cpp_exception_clas (#122132)
Summary: Prevent crash by not throwing a C++ exception.

Test Plan: spongebobsandcastle

Reviewed By: SS-JIA

Differential Revision: D55036050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122132
Approved by: https://github.com/SS-JIA
2024-03-19 20:01:33 +00:00
c0b2e56c8f Support triton.language.dtype with torch.compile -- Second Attempt (#122141)
This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141
Approved by: https://github.com/jansel
ghstack dependencies: #122140
2024-03-19 19:40:52 +00:00
58a805da71 [UserDefinedTriton] Move constant args out of the fx graph (#122140)
@ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140
Approved by: https://github.com/jansel
2024-03-19 19:40:52 +00:00
c5ffebebab [export] allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910)
Creating this after [PR](https://github.com/pytorch/pytorch/pull/121642) got reverted.

Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.

Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
    def forward(self, x, y):
        return x + y[1:]

dx = Dim("dx", min=1, max=3)
ep = export(
    Bar(),
    (torch.randn(2, 2), torch.randn(3, 2)),
    dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```

In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```

This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121910
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-03-19 19:08:05 +00:00
d56ab7b020 Revert "[torch export][serialize] create a more compact stacktrace format for serialization (#121675)"
This reverts commit eae89138d891d0310483c4d86dcb69b16de0a6b5.

Reverted https://github.com/pytorch/pytorch/pull/121675 on behalf of https://github.com/jeanschmidt due to It seems that this PR broke lint jobs, I am reverting to confirm if this is the case ([comment](https://github.com/pytorch/pytorch/pull/121675#issuecomment-2007919486))
2024-03-19 19:02:09 +00:00
36e5c1dcab Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit edd04b7c16cc6715411119bb7db234a9df59065f.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))
2024-03-19 18:59:46 +00:00
88999674a0 Revert "Update jvp to support symbolic execution. (#120338)"
This reverts commit 39877abee2c3ad1956013d467b0f6e86cd20acfb.

Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2007898831))
2024-03-19 18:50:12 +00:00
e0d57001ef [codemod] Remove unused variables in caffe2/caffe2/experiments/operators/fully_connected_op_prune.h (#122165)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D54380402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122165
Approved by: https://github.com/Skylion007
2024-03-19 18:41:16 +00:00
6bd2d12bc7 release gil in prepareProfiler (#121949)
Initializing profiler while holding gil can lead to deadlocks, as it makes some presumably synchronizing cuda calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121949
Approved by: https://github.com/aaronenyeshi
2024-03-19 18:05:21 +00:00
7fb2d69282 [PT2] - Fix cat backwards wrapping on symints (#121527)
Summary:
Wrapping was comparing Symint and ints forcing a guard. Rewrite it with TORCH_GUARD_SIZE_OBLIVIOUS
```
[trainer0|0]:  File "<invalid>", line 0, in THPEngine_run_backward(_object*, _object*, _object*)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor>>&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::generated::CatBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor>>&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::generated::details::cat_tensors_backward(at::Tensor const&, std::vector<std::vector<c10::SymInt, std::allocator<c10::SymInt>>, std::allocator<std::vector<c10::SymInt, std::allocator<c10::SymInt>>>> const&, std::vector<c10::ScalarType, std::allocator<c10::ScalarType>> const&, long)
[trainer0|0]:  File "<invalid>", line 0, in c10::operator==(c10::SymInt const&, int)
[trainer0|0]:  File "<invalid>", line 0, in c10::SymBool::guard_bool(char const*, long) const
[trainer0|0]:  File "<invalid>", line 0, in torch::impl::PythonSymNodeImpl::guard_bool(char const*, long)
```

Test Plan: Regular CI

Differential Revision: D54667300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121527
Approved by: https://github.com/ezyang
2024-03-19 18:03:02 +00:00
8de4d86479 Back out "[fx] Preserve Fx graph node order in partitioner across runs (#115621)" (#122113)
Summary:
Original commit changeset: 6578f47abfdb

Original Phabricator Diff: D54913931

Differential Revision: D55027171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122113
Approved by: https://github.com/osalpekar
2024-03-19 18:00:37 +00:00
eae89138d8 [torch export][serialize] create a more compact stacktrace format for serialization (#121675)
Summary:
- we want fx nodes' stack trace format to be backward compatible and same as before in the program we export
- however in the serialized format, we would want to show a more compact stack_trace format, otherwise the nodes attributes are dominated by stack traces
- the diff implements the minimal in serialization process to dedupe node stack traces by resorting to a fileinfo_list and a filename_to_abbrev map, so we can use index to represent filenames, use lineno to represent lines.

Test Plan:
# llm
base on D54497918
```
buck2 run @//mode/dev-nosan fbcode//executorch/examples/models/llama2:export_llama -- -c ~/stories110M.pt -p ~/params.json
```
set up breakpoint after serialization/deserialization
- serialize
```
(Pdb) v_meta = [n.meta for n in exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193647450
(Pdb) json_program = json.dumps(_dataclass_to_dict(serialized_graph.co_fileinfo_ordered_list),cls=EnumEncoder)
(Pdb) json_bytes = json_program.encode('utf-8')
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(json_bytes)).number
1193604333
(Pdb) sys.getsizeof(json_bytes)
3846
(Pdb) compressed_bytes = zstd.ZstdCompressor().compress(json_bytes)
(Pdb) sys.getsizeof(compressed_bytes)
1139
```
in P1193647450 (before serialization), search for `stack_trace`
in P1193604333 (after serialization), search for `stack_trace` and `co_fileinfo_ordered_list`

[note: didn't do compression in this diff since the size is pretty small and it adds complexity if we do compression]
- deserialize
```
(Pdb) v_meta = [n.meta for n in deserialized_exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193629435
```
in P1193629435, search for `stack_trace`

# ads

Differential Revision: D54654443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121675
Approved by: https://github.com/angelayi
2024-03-19 17:58:12 +00:00
eqy
271b12c790 [Functorch] Bump tolerances for test_per_sample_grads_embeddingnet_mechanism_functional_call_cuda (#122014)
the `rtol` was indeed a problem on Grace Hopper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122014
Approved by: https://github.com/zou3519
2024-03-19 17:52:39 +00:00
ba9a1d96a4 Add scuba logging for TorchScript usage (#121936)
Summary: Infra to log live usage of TorchScript internally

Test Plan: manually tested

Differential Revision: D54923510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121936
Approved by: https://github.com/zhxchen17
2024-03-19 17:38:27 +00:00
4819da60ab [TD] Add LLM retrieval + heuristic (#121836)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121836
Approved by: https://github.com/osalpekar
2024-03-19 17:31:47 +00:00
cec0fd6f2f [pt2] add symbolic shape support for decompose mm and expose max_block to user config (#121440)
Summary:
1) As described in https://fb.workplace.com/groups/1075192433118967/permalink/1381918665779674/
As a follow up, we can increase max_block["y"] to sovle the issue
2) add symbolic shape support for decompose mm pass. I did not find a good way to compare symint with int. So when there is a symbolic shape, i would assume it is a "large" dim.

Test Plan:
Without change block: aps-pt2-7c23cea900

increase y_block: aps-pt2_dynamic_shape-25a027423c

Differential Revision: D54525453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121440
Approved by: https://github.com/mengluy0125, https://github.com/Yuzhen11
2024-03-19 17:31:16 +00:00
764eae9c4e Revert "Add Flash Attention support on ROCM (#121561)"
This reverts commit a37e22de7059d06b75e4602f0568c3154076718a.

Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm.  We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))
2024-03-19 17:14:28 +00:00
88ebdbc97c [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-03-19 16:51:43 +00:00
2164b7f746 Flatten/Unflatten micro optimization in proxy_tensor.py (#121993)
Lowers compile time by 1s across all suites on average
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121993
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/zou3519
2024-03-19 16:49:28 +00:00
42624bceb6 Fixes nan with large bf16 values (#122135)
Fixes #121558

Performance on main:
``` Markdown
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.608132004970683 | 65.90210803551601  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.75877740024589  | 64.83824399765581  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 16.465420153690506 |  67.6770955324173  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 17.398148600477725 | 68.19829455344006  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 29.053532000398263 | 99.58901099162175  |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 |  27.826815698063   | 98.05690299253911  |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 49.89655229728669  | 178.24282555375248 |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 48.840098950313404 | 174.5950729819015  |
|     1      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 505.66218036692584 | 1865.9265094902366 |
|     1      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 295.0534054543823  | 967.3831606050952  |
|     1      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.496030446141958 | 55.11070846114308  |
|     1      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.47399884648621  | 55.452342028729625 |
|     1      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 13.216444296995178 | 55.14447903260589  |
|     1      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 12.763233599252999 | 55.142355500720434 |
|     1      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 19.409965351223946 |  74.9107634765096  |
|     1      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 19.02470579952933  | 74.84168506925926  |
|     1      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 46.37695319834165  | 172.19150450546294 |
|     1      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 45.225963747361675 | 185.19691249821335 |
|     1      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 634.3090848531574  | 2249.057865119539  |
|     1      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 320.47313248040155 | 1053.0515247955916 |
|     4      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 13.448987301671878 | 63.63581650657579  |
|     4      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.509283400140703 | 63.059300999157124 |
|     4      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 19.71098779467866  | 105.55780201684684 |
|     4      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 18.264925852417946 | 105.12311349157244 |
|     4      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 45.218703348655254 | 222.87272597895935 |
|     4      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 43.55393464793451  | 230.63290398567915 |
|     4      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 134.02968645095825 | 514.6893998607993  |
|     4      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 157.13709802366793 | 624.5892751030624  |
|     4      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 1776.7079547047617 | 6353.551096981391  |
|     4      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1143.6000745743513 | 3811.8767354171723 |
|     4      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.717129248427227 | 55.35991647047922  |
|     4      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.746983398916198 | 55.76716404175386  |
|     4      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 17.255573300644752 | 106.47456656442955 |
|     4      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 16.46409669774584  | 108.07770595420152 |
|     4      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 46.63354124641045  | 213.74862996162847 |
|     4      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 47.01801469782367  | 240.78139301855117 |
|     4      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 127.76448752265424 | 508.08745552785695 |
|     4      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 168.6308984644711  | 667.2996102133766  |
|     4      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 2268.1598202325404 | 7727.2648515645415 |
|     4      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1242.8469699807465 | 4161.965740495361  |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 14.340955897932872 | 93.72280450770633  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 13.25262250029482  |  93.2030284893699  |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 27.598425600444898 | 183.23776399483904 |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 26.362583553418514 | 183.51862096460536 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 84.52303148806094  | 383.50319798337296 |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 89.41743348259479  | 432.5502900755964  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 217.76640450116247 | 943.9354750793427  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 303.0781910638325  | 1225.4394043702632 |
|     8      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 3470.8542854059488 | 12194.579601055011 |
|     8      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2268.1174043100327 | 7608.0941944383085 |
|     8      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.289720651460811 | 95.88620596332476  |
|     8      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.618648946750909 | 95.56685149436818  |
|     8      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 31.567946751601994 | 180.62468653079122 |
|     8      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 28.611703700153157 | 189.4215695792809  |
|     8      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 84.11306998459621  | 385.25596749968827 |
|     8      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 93.82540901424363  | 455.77428903197875 |
|     8      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 226.80530551588163 | 965.8026450779289  |
|     8      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 327.4116570246406  | 1312.5067745568228 |
|     8      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 4445.5064804060385 | 15020.768146496266 |
|     8      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2433.0302356975153 | 8300.016750581563  |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```

Performance on this branch:
```Markdown
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.783618393586949 | 65.59692794689909  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.064015300711617 | 56.99719698168337  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 16.629025398287922 | 68.65267595276237  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 17.462356004398313 | 68.35797848179936  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 |  29.5476081490051  | 101.22994752600789 |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 28.395320149138573 | 98.62275794148445  |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 50.50016101449728  | 181.4357690163888  |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 49.450615647947416 | 175.86063902126625 |
|     1      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 506.06461532879626 | 1866.0613044630736 |
|     1      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 299.9336270149797  | 976.4662646921353  |
|     1      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.45752210286446  | 58.79682704107836  |
|     1      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.407129396684468 | 58.14061599085107  |
|     1      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 13.822759891627355 | 56.56979401828722  |
|     1      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 13.39154909946956  |  56.7130644340068  |
|     1      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 20.282494352431968 | 77.29688903782517  |
|     1      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 19.899454596452415 |  75.4446149803698  |
|     1      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 48.494275606935844 | 177.5322465109639  |
|     1      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 46.84524350450374  | 189.1778860008344  |
|     1      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 635.1026654010639  | 2248.0451600858937 |
|     1      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 335.1591735263355  | 1080.4320796160027 |
|     4      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 13.63953539985232  | 65.50709309522063  |
|     4      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.858113402035087 | 63.021871959790595 |
|     4      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 19.98318645055406  | 105.87883047992364 |
|     4      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 18.619045056402683 | 104.90188701078296 |
|     4      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 45.91175540117546  | 226.00732848513871 |
|     4      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 44.39614630537107  | 232.39317198749632 |
|     4      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 135.5409600073472  | 522.7949097752571  |
|     4      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 158.79383607534692 | 628.5856699105352  |
|     4      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 1775.9978299727663 | 6343.203847063706  |
|     4      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1160.680354805663  | 3842.235009651631  |
|     4      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.553713708417488 | 65.50691701704638  |
|     4      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.486379051348194 |  56.9980075233616  |
|     4      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 17.56585600087419  | 107.89892700267956 |
|     4      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 16.828144202008843 | 109.05519902007653 |
|     4      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 48.23235589428805  | 217.8974545095116  |
|     4      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 49.09284680034033  | 244.73925953498107 |
|     4      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 134.77827049791813 | 522.7259948151186  |
|     4      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 176.60772847011688 | 681.5171707421541  |
|     4      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 2267.821540008299  | 7720.425300067291  |
|     4      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1295.3941145678982 | 4272.425139788538  |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 14.514714101096615 |  94.2192979855463  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 13.553097198018804 |  93.244242540095   |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 27.95821905019693  | 185.0469880155288  |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 26.709681446664035 | 184.22623950755226 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 85.85420495364815  | 388.3417735341937  |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 89.97473795898259  | 434.4228169647977  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 220.6919804448262  | 958.9654899900779  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 306.55586952343583 | 1233.2170095760375 |
|     8      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 3470.7326447824016 | 12183.611298678443 |
|     8      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2299.064100370742  | 7669.618452200666  |
|     8      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.427107692928985 | 96.96270158747211  |
|     8      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.856995843118057 | 96.38117247959599  |
|     8      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 |  32.9956392000895  | 182.52741603646427 |
|     8      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 29.397601098753512 | 191.0755339777097  |
|     8      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 89.06024845782667  | 392.2585004474967  |
|     8      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 97.78487798757851  | 462.07307645818213 |
|     8      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 |  240.521906001959  | 992.4693452194335  |
|     8      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 341.98952303268015 | 1339.2950996058062 |
|     8      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 4445.311005110853  | 15001.030603889374 |
|     8      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2535.9767401823774 | 8528.990152990447  |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```

```
{'avg_forward_time_nan_fix': 399.7900972732653,
 'avg_backward_time_nan_fix': 1409.652114014413,
 'avg_forward_time_main_branch': 394.6807206988645,
 'avg_backward_time_main_branch': 1399.4055472857629,
 'geo_mean_nan_fix': 150.95049601244946,
 'geo_mean_main_branch': 148.3381648508822}
 ```

The y axis is wrong and is micro seconds but the relative comparison still works
<img width="790" alt="Screenshot 2024-03-18 at 3 34 15 PM" src="https://github.com/pytorch/pytorch/assets/32754868/ca278c15-b815-4535-bdcd-07e522055466">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122135
Approved by: https://github.com/cpuhrsch
2024-03-19 16:32:00 +00:00
e26280ad8b Fix typing for autograd.Function with ctx-less forward (#122167)
Previously, typing an autograd.Function like the following would lead to
a mypy error (which expects the first arg to forward to be named `ctx`).

This PR fixes that by deleting the ctx arg.

```py
class MySin(torch.autograd.Function):
    @staticmethod
    def forward(x: torch.Tensor) -> torch.Tensor:
        return x.sin()

    @staticmethod
    def setup_context(*args, **kwargs):
        pass

    @staticmethod
    def backward(ctx, grad):
        if grad.stride(0) > 1:
            return grad.sin()
        return grad.cos()
```

Test Plan:
- tested locally (I don't know how to put up a test in CI for this).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122167
Approved by: https://github.com/soulitzer
2024-03-19 16:15:23 +00:00
f9ed1c432d Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 0ff1109e2688b8c841c9dd0eeecfba16f027b049.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))
2024-03-19 15:40:26 +00:00
c05bf0037d [dynamo] Remove copy_graphstate/restore_graphstate (#122067)
Some dead code cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122067
Approved by: https://github.com/oulgen
2024-03-19 15:37:53 +00:00
7673cb534a Revert "Skip nonzero unbacked SymInt memo in inference mode (#122147)"
This reverts commit 5e2687391229cee6e4dc0214f9208b4ecbe058c1.

Reverted https://github.com/pytorch/pytorch/pull/122147 on behalf of https://github.com/jeanschmidt due to Reverting to see if trunk error in inductor are related ([comment](https://github.com/pytorch/pytorch/pull/122147#issuecomment-2007513000))
2024-03-19 15:37:24 +00:00
cyy
6c01c25319 [Clang-tidy header][28/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122175)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h, following https://github.com/pytorch/pytorch/pull/122023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122175
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-19 14:08:54 +00:00
6c50308801 [ATen-Vulkan][EZ] Small fixes: fix gpu size calculation and Half scalartype ctype mapping (#122096)
Summary:
## Context

Some small fixes to the ATen-Vulkan backend.

The first is that GPU sizes for a 4 dimensional tensor with width packing had a small bug:

```
      case 4:
        switch (memory_layout) {
          case api::GPUMemoryLayout::TENSOR_WIDTH_PACKED:
            gpu_sizes.at(0) = sizes.at(0);
            gpu_sizes.at(1) = sizes.at(1);
            // should be gpu_sizes.at(2) == sizes.at(2)
            gpu_sizes.at(2) = sizes.at(3);
            gpu_sizes.at(3) = api::utils::align_up(sizes.at(3), INT64_C(4));
            break;
```

This was fixed by simplifying the logic of GPU size calculation for texture storage.

The second was to modify the ctype mapping of the `api::kHalf` scalar type to be `float` instead of `unsigned short`. This is because GLSL does not natively support `float16`, so even with a FP16 texture type CPU/GPU transfer shaders will have to read from and write to `float` buffers.

In the future, we will look into integrating [VK_KHR_shader_float16_int8](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_float16_int8.html) into ATen-Vulkan to allow for 16 bit and 8 bit types to be referenced explicitly.

Test Plan: CI

Differential Revision: D55018171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122096
Approved by: https://github.com/jorgep31415
2024-03-19 13:27:27 +00:00
39877abee2 Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
2024-03-19 13:06:42 +00:00
edd04b7c16 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-19 13:06:42 +00:00
6b5259e507 [lint] bump lint dependency PyYAML to 6.0.1 to support Python 3.12 (#122022)
[PyYAML 6.0.0](https://pypi.org/project/PyYAML/6.0) was released 2.5 years ago and it is not installable with Python 3.12.

This PR bumps the version of [PyYAML to 6.0.1](https://pypi.org/project/PyYAML/6.0.1) in `lintrunner` configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122022
Approved by: https://github.com/Skylion007
2024-03-19 12:23:49 +00:00
8168338063 Add CPU implementation for torch._int_mm (s8*s8->s32) (#121792)
Fixes #121647

**Description**
Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it.
Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.

**Test plan**
`python test/test_linalg.py -k test__int_mm_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-03-19 08:44:33 +00:00
0d845f7b07 Fix auto_functionalize (#121990)
Differential Revision: D54964130

When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-03-19 07:11:11 +00:00
a2a88f39ee Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121537
Approved by: https://github.com/ezyang
2024-03-19 06:15:00 +00:00
0ff1109e26 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-19 06:02:28 +00:00
09ce76809c Improve compiler detection on MacOS (#121406)
By relying on `is_apple_clang` helper function rather than on compiler name (as `gcc` is clang on MacOS):
```
% which gcc; gcc -v
/usr/bin/gcc
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.3.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```
But
```
% /opt/homebrew/bin/gcc-13 -v
Using built-in specs.
COLLECT_GCC=/opt/homebrew/bin/gcc-13
COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
Target: aarch64-apple-darwin23
Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Homebrew GCC 13.2.0)
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121406
Approved by: https://github.com/malfet, https://github.com/jansel
2024-03-19 05:32:08 +00:00
FEI
8499767e96 add sdpa choice for DeviceType::PrivateUse1 (#121409)
Fixes  #116854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121409
Approved by: https://github.com/drisspg
2024-03-19 05:08:46 +00:00
5bc7f7f977 [dynamo] Make tx.next_instruction lazy (#122066)
Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py
from 2.5s to 2.4s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122066
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055, #122058, #122060, #122063
2024-03-19 04:23:30 +00:00
153a01833b [dynamo] Optimize SourcelessBuilder (#122063)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 2.7s to 2.5s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122063
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055, #122058, #122060
2024-03-19 04:23:30 +00:00
8082adcf65 [dynamo] Only rename a proxy once (#122060)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 3.9s to 2.7s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122060
Approved by: https://github.com/oulgen
ghstack dependencies: #122039, #122043, #122055, #122058
2024-03-19 04:23:27 +00:00
2bec55c5f9 [dynamo] Remove VariableTracker.parents_tracker (#122058)
This is leftover from mutable variable tracker days and no longer needed.

Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py
from 4.2s to 3.9s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122058
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055
2024-03-19 04:23:24 +00:00
3c706bf483 [dynamo] Optimize BuiltinVariable (#122055)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 5.1s to 4.2s (compared to 2 PRs ago).

This works by precomputing (and caching) the parts of `BuiltinVariable.call_function` that don't depend on the values of args/kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122055
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043
2024-03-19 04:23:20 +00:00
07caea5c12 [dynamo] Refactor COMPARE_OP and comparison builtins (#122043)
This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure.  This change regresses overheads a bit, but this is fixed in the next PR.

New test skips are variants of `type(e) is np.ndarray` previously falling back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039
2024-03-19 04:23:17 +00:00
769ff86b91 [dynamo] Optimize COMPARE_OP (#122039)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 5.6 to 5.1s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122039
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-19 04:23:14 +00:00
cyy
e1706bba3b [Clang-tidy header][27/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122023)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h, following #122015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122023
Approved by: https://github.com/ezyang
2024-03-19 03:26:15 +00:00
5e26873912 Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-19 03:20:33 +00:00
8860c625ea [dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726
Approved by: https://github.com/jansel
2024-03-19 03:11:31 +00:00
f84d560236 [dynamo] Raise accumulated cache size limit (#122130)
Fixes #114511

This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #121954, #122005
2024-03-19 02:35:48 +00:00
7084528eb9 [dynamo][model_output] Do not include none for CustomizedDictVariable (#122005)
Fixes https://github.com/pytorch/pytorch/issues/120923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122005
Approved by: https://github.com/weifengpy, https://github.com/jansel
ghstack dependencies: #121954
2024-03-19 02:35:48 +00:00
2b06098380 Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-19 02:22:04 +00:00
6502c888cf Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010
Approved by: https://github.com/eellison
2024-03-19 02:17:10 +00:00
18d94d7165 Make FX nodes sortable (#122071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122071
Approved by: https://github.com/oulgen
2024-03-19 01:40:56 +00:00
1f4d4d3b78 [fx] preserver partiioner order fix (#122111)
Summary:
Previous implementation seems to introduce a key value of {"node": none}. This causes an error in logging later on because we extract the name from the "node" but it is a string instead of a torch.fx.node

This seems to cause tests to pass.

Test Plan:
CI

ExecuTorch CI:
buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_models

Reviewed By: larryliu0820

Differential Revision: D55026133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122111
Approved by: https://github.com/mikekgfb
2024-03-19 01:00:44 +00:00
34f36a28df [MPS] Fwd-fix for clamp regression (#122148)
Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it

- Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with
  ```
    /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library'
   ```
- Change the order of max and min call as it's apparently important for
  consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148
Approved by: https://github.com/huydhn
2024-03-19 00:52:45 +00:00
ae983d2d6e Fix typo in sparse.rst (#121826)
Change word "on" to "one" when talking in the third person.

Fixes #121770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826
Approved by: https://github.com/janeyx99
2024-03-19 00:17:19 +00:00
e6cf3e90a5 [AOTAutograd / Functionalization] Fix incorrect expand_inverse (#122114)
This is a rebase of https://github.com/pytorch/pytorch/pull/114538,
originally submited by @jon-chuang.

Fixes #114302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122114
Approved by: https://github.com/bdhirsh
2024-03-18 22:52:57 +00:00
ba69dc6675 [Easy] add option to print compilation time (#121996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996
Approved by: https://github.com/davidberard98
2024-03-18 22:42:41 +00:00
2ab8b34433 Error out in case of in-source builds (#122037)
Such builds could not succeed, as arch-specific ATen dispatch mechanism will create temporary files that will be added to the build system with every rebuild, which will result in build failures

Fixes https://github.com/pytorch/pytorch/issues/121507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122037
Approved by: https://github.com/PaliC, https://github.com/kit1980
2024-03-18 21:48:18 +00:00
e6a461119a [functorch] Add batch rule for linalg.lu_unpack (#121811)
Fixes: https://github.com/pytorch/pytorch/issues/119998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121811
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-03-18 21:24:16 +00:00
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
a17cd226d6 [inductor] Enable FX graph caching on another round of inductor tests (#121994)
Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994
Approved by: https://github.com/eellison
2024-03-18 20:55:18 +00:00
7c5e29ae71 Back out "Support triton.language.dtype with torch.compile (#121690)" (#122108)
Summary: Some hard to deal with package import/export related problems. Lets revert and start with clean slate.

Test Plan: CI

Differential Revision: D55024877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122108
Approved by: https://github.com/ezyang
2024-03-18 20:50:28 +00:00
685ace3834 [compiled autograd] add dynamo segfault test (#122004)
To catch issues like https://github.com/pytorch/pytorch/issues/121862 in CI. This passes because we reverted the PRs, and https://github.com/pytorch/pytorch/pull/121870 confirms that this test can catch it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122004
Approved by: https://github.com/eellison
2024-03-18 20:07:15 +00:00
40acc84aaf Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-18 19:38:15 +00:00
0a1b3be216 chore: add unit test to verify split_by_tags output_type (#121262)
Add a test case as per https://github.com/pytorch/pytorch/pull/120361#issuecomment-1979163324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121262
Approved by: https://github.com/atalman
2024-03-18 19:19:26 +00:00
676a77177e Revert "[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)"
This reverts commit 4cbf963894e78d1cfedffe4f829740dc99163caa.

Reverted https://github.com/pytorch/pytorch/pull/121908 on behalf of https://github.com/jeanschmidt due to this is due to OIDC can't work on forked PR due to token write permissions can't be shared ([comment](https://github.com/pytorch/pytorch/pull/121908#issuecomment-2004707582))
2024-03-18 19:03:11 +00:00
df1cdaedeb Log restart reasons and extra compile time in CompilationMetrics (#121827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121827
Approved by: https://github.com/ezyang, https://github.com/yanboliang
2024-03-18 18:59:25 +00:00
74c09a757b Simplify Storage meta conversion with PyObject preservation (#122018)
Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on
finalizers on Storage PyObject to handle removal from dict.

Irritatingly, we still have to attach finalizer, because we don't have
a weak key AND value dict (only one or the other).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018
Approved by: https://github.com/eellison, https://github.com/kurtamohler
2024-03-18 18:55:58 +00:00
32410f80ec [Caffe2 CPU tests] Update CMakeLists.txt (#119643)
I was trying to build PyTorch with USE_GLOG=ON (so we could get better timestamps around the nccl logging) and ran into this error

```
[1/7] Linking CXX executable bin/verify_api_visibility
FAILED: bin/verify_api_visibility
: && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O2 -g -DNDEBUG -rdynamic     -Wl,--no-as-needed caffe2/CMakeFiles/verify_api_visibility.dir/__/aten/src/ATen/test/verify_api_visibility.cpp.o -o bin/verify_api_visibility -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/usr/local/cuda/lib64:/root/conda/lib:/mnt/code/pytorch/build/lib:  lib/libgtest_main.a  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /root/conda/lib/libmkl_intel_lp64.so  /root/conda/lib/libmkl_gnu_thread.so  /root/conda/lib/libmkl_core.so  -fopenmp  /usr/lib64/libpthread.so  -lm  /usr/lib64/libdl.so  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /root/conda/lib/libglog.so.0.4.0  /root/conda/lib/libgflags.so.2.2.2  -lpthread  /usr/local/cuda/lib64/libcudart.so  /usr/local/cuda/lib64/libnvToolsExt.so  lib/libgtest.a  -pthread && /root/conda/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/verify_api_visibility && :
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /mnt/code/pytorch/build/lib/libtorch.so: undefined reference to symbol '_ZTVN10__cxxabiv117__class_type_infoE@@CXXABI_1.3'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/lib64/libstdc++.so.6: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
```

Adding stdc++ explicitly to the list of libraries to link seems to fix the build, and I was able to get a working build of PyTorch.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119643
Approved by: https://github.com/zdevito
2024-03-18 18:35:32 +00:00
5d52b163d1 [dynamo] Optimize load/store/const op handling (#122038)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 6.7s to 5.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122038
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033, #122034, #122035
2024-03-18 18:08:06 +00:00
4034873a31 [dynamo] Optimize builtin handling (#122035)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 7.3s to 6.7s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122035
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033, #122034
2024-03-18 18:08:06 +00:00
6ca0323615 [dynamo] Optimize VariableTracker.__post_init__ (#122034)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 8.6s to 7.3s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122034
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033
2024-03-18 18:08:06 +00:00
115c9c6d6b Remove __getattribute__ on autograd.Function (#122033)
Improves `benchmarks/dynamo/microbenchmarks/overheads.py` from 38.7us to
34.3us.

See #122029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122033
Approved by: https://github.com/zou3519, https://github.com/soulitzer
ghstack dependencies: #122032
2024-03-18 18:08:06 +00:00
5a10b56083 [dynamo] Small microbenchmark changes (#122032)
Used to generate numbers in #122029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032
Approved by: https://github.com/yanboliang
2024-03-18 18:08:06 +00:00
1a58e9d357 [TD] LLM indexer to run daily (#121835)
Run indexer daily
Run indexer in docker container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121835
Approved by: https://github.com/osalpekar, https://github.com/malfet
2024-03-18 16:34:01 +00:00
ceb1910bad Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)"
This reverts commit 11b36e163df66196d24fbded4b37ef8f8c032640.

Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to New action is breaking current ci in not rebased PRs ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2004393980))
2024-03-18 16:33:23 +00:00
11b36e163d [BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-18 15:40:43 +00:00
c4d24b5b7f special-case cuda array interface of zero size (#121458)
Fixes #98133
retry of #98134
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121458
Approved by: https://github.com/bdice, https://github.com/ptrblck, https://github.com/mikaylagawarecki
2024-03-18 15:21:38 +00:00
f7908d9fa8 enable reshape+linear+reshape fusion for dynamic shapes (#121116)
reshape+linear+reshape fusion for dynamic shapes has been disabled in https://github.com/pytorch/pytorch/pull/107123.
Re-enable it by comparing the symbolic values in case of dynamic shapes. This will improve the performance for dynamic shape cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121116
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-18 14:46:27 +00:00
f2f8eeea94 Inductor: fix Conv output stride for dynamic shapes (#121400)
Fixes https://github.com/pytorch/pytorch/issues/120873.
Fixes the output stride of Conv in the case of dynamic shapes. The previous logic in inductor assumed that the output stride of Conv is always channels last while it is actually contiguous if `dynamic_shapes and is_contiguous_storage_and_layout(x)`.

### Static shape
In static shape cases, since weight is prepacked (`weight_t.is_mkldnn()` will be `true`), we'll always force output to be channels last in the Conv kernel, thus it's fine to have the assumption in Inductor that the output stride of Conv is always channels last.
96ed37ac13/aten/src/ATen/native/mkldnn/Conv.cpp (L357-L358)

### Dynamic shape
In dynamic shape cases, we won't do weight prepack for Conv, in this case, the Conv kernel decides the output layout based on the input and weight layout.
96ed37ac13/torch/_inductor/fx_passes/mkldnn_fusion.py (L1024-L1025)

For input with `channels = 1`, like tensor of size `(s0, 1, 28, 28)` and stride `(784, 784, 28, 1)`, in Inductor, with `req_stride_order` in channels last order, the `require_stride_order` on `x` of such size and stride won't change the stride of the tensor since stride for dimensions of size 1 is ignored
96ed37ac13/torch/_inductor/ir.py (L5451)

While in Conv kernel, such tensor is consider it as **contiguous** tensor instead of channels last tensor thus the output of the Conv kernel will be in contiguous format.
96ed37ac13/aten/src/ATen/native/ConvUtils.h (L396-L404)

To align the behavior of the Conv kernel, we set the output_stride in such case to be contiguous instead of channels last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121400
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-18 10:56:58 +00:00
206da97b8b [aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052)
looks like we already support aoti_torch_cuda_sort in C shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122052
Approved by: https://github.com/oulgen
2024-03-18 09:14:35 +00:00
65ccac6f17 Fix triton import time cycles (#122059)
Summary: `has_triton` causes some import time cycles. Lets use `has_triton_package` which is enough.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test -- --exact 'fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test - test_collect_features_from_graph_module_nodes (fblearner.flow.projects.model_processing.pytorch_model_export_utils.logical_transformations.tests.filter_inference_feature_metadata_test.FilterInferenceFromFeatureMetadataTest)'
```
now passes

Differential Revision: D55001430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122059
Approved by: https://github.com/aakhundov
2024-03-18 05:50:32 +00:00
bc9d054260 [executorch hash update] update the pinned executorch hash (#122061)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122061
Approved by: https://github.com/pytorchbot
2024-03-18 05:02:27 +00:00
7380585d97 [vision hash update] update the pinned vision hash (#122062)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122062
Approved by: https://github.com/pytorchbot
2024-03-18 03:41:50 +00:00
e39aedfcc5 Fix fx graph triton import bug (#122041)
Summary: Unless we register triton to be a special import, FX graph import mechanism imports it as `from fx-generated._0 import triton as triton` which is obviously broken.

Test Plan:
I could not figure out how to write a test for this but
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//tgif/lib/tests/gpu_tests:lowering_pass_test -- -r test_default_ait_lowering_multi_hardwares
```
now passes

Differential Revision: D54990782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122041
Approved by: https://github.com/aakhundov
2024-03-17 22:48:51 +00:00
5030913d6a [test] Delete variables that have been declared but not referenced di… (#121964)
Delete variables that have been declared but not referenced in aten/src/ATen/test/cuda_distributions_test.cu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121964
Approved by: https://github.com/janeyx99
2024-03-17 09:45:05 +00:00
cyy
d9460758df [Clang-tidy header][26/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122015)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122015
Approved by: https://github.com/ezyang
2024-03-17 07:56:45 +00:00
c568b84794 [dynamo][guards] Move backend match to eval_frame (#121954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121954
Approved by: https://github.com/jansel
2024-03-17 06:52:10 +00:00
fc504d719f [executorch hash update] update the pinned executorch hash (#122036)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122036
Approved by: https://github.com/pytorchbot
2024-03-17 04:56:37 +00:00
6f74b76072 Move get_unwrapped outside of disable_functorch (#121849)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121849
Approved by: https://github.com/albanD
2024-03-16 22:25:07 +00:00
3bd38928ba [export] Improve consistency for nn_module_stack metadata, add checks to _trace.py (#120661)
We would like to improve consistency for nn_module_stack metadata in torch.export.

This PR ensures that all tests in test/export/test_export.py has the following constraints:
- Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules
- Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields)
- Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661
Approved by: https://github.com/avikchaudhuri
2024-03-16 21:44:52 +00:00
6d9588a12b [inductor] disable linear weight prepacking pass on double (#121478)
Fix #121175

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121478
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-03-16 13:24:21 +00:00
9990d1bc22 Add 'profiler/python' to the package.' (#121892)
Fixes #ISSUE_NUMBER
expose the `py_symbolize` interface for use.
thank you
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121892
Approved by: https://github.com/zdevito
2024-03-16 11:11:26 +00:00
5f601a41e0 Pin protobuf to 3.20.2 on macOS (#121918)
The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, 3bc2bb6781.  This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29

The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too.  We want to eventually just have one requirements.txt file.

Fixes https://github.com/pytorch/pytorch/issues/122008
Fixes https://github.com/pytorch/pytorch/issues/121927
Fixes https://github.com/pytorch/pytorch/issues/121946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918
Approved by: https://github.com/kit1980
2024-03-16 09:48:05 +00:00
4d9d5fe540 [executorch hash update] update the pinned executorch hash (#122009)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122009
Approved by: https://github.com/pytorchbot
2024-03-16 04:46:45 +00:00
4d92928fe2 [dynamo] Add tests for fake FSDP (#121610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121610
Approved by: https://github.com/yanboliang
ghstack dependencies: #121735, #120965
2024-03-16 04:29:59 +00:00
0b7d9711d4 [dynamo] Add support for nn.Parameter constructor (part 2) (#120965)
This handles the case where the tensor isn't an input.

The changes to dynamo tests are cases where we would previously fall back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120965
Approved by: https://github.com/yanboliang
ghstack dependencies: #121735
2024-03-16 04:29:58 +00:00
040b925753 [Compiled Autograd] Reorder accumulate grad nodes (#121735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121735
Approved by: https://github.com/xmfan
2024-03-16 04:29:56 +00:00
f0b9a8344a [vision hash update] update the pinned vision hash (#121177)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121177
Approved by: https://github.com/pytorchbot
2024-03-16 03:25:08 +00:00
b94691700e [FSDP] Avoided CPU sync in clip_grad_norm_ (#122001)
Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it.

`FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001
Approved by: https://github.com/wanchaol
2024-03-16 03:01:49 +00:00
7bc91d5dc2 [mergebot][BE] If we don't have any required checks, don't run required checks (#121921)
This PR addresses the issue identified in #121920. The existing problem is that all tests are deemed mandatory if none are selected as required. This behavior is particularly noticeable during a force merge operation.

In the context of a force merge, it may not be necessary to execute any tests which are not required (imo). However, this proposed change could be seen as controversial, hence it has been separated from the main update for further discussion and review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121921
Approved by: https://github.com/huydhn
ghstack dependencies: #121920
2024-03-16 01:35:21 +00:00
2b71b21a3f Don't use Proxy torch function in the sym size calls (#121981)
Fixes #ISSUE_NUMBER

Changes from https://github.com/pytorch/pytorch/pull/121938 + adds test

@bypass-github-pytorch-ci-checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121981
Approved by: https://github.com/davidberard98
2024-03-16 01:20:26 +00:00
37e563276b Document complex optimizer semantic behavior (#121667)
<img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667
Approved by: https://github.com/albanD
2024-03-16 00:43:47 +00:00
12662900f9 [inductor] FX graph cache: Fix bug handling constants (#121925)
Summary: During key calculation for FX graph caching: Rather than specialize on "small" vs. "large" tensor constants (i.e., inlined vs. not inlined), always hash on the tensor value. Doing so avoids the complication of trying to later attach the constant values as attributes to an already-compiled module. Instead, different constants will cause an FX graph cache miss and we'll just compile.

Test Plan: New unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121925
Approved by: https://github.com/eellison
2024-03-16 00:11:51 +00:00
cyy
6b0f61891f [Clang-tidy header][25/N] Fix clang-tidy warnings and enable clang-tidy on c10/cuda/*.{cpp,h} (#121952)
This PR enables clang-tidy to code in c10/cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121952
Approved by: https://github.com/Skylion007
2024-03-16 00:09:54 +00:00
0cc60a05da Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381)"
This reverts commit ca80d07ac71c1bfc9b13c3281a713fed89f15e0f.

Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR.  Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))
2024-03-15 23:53:05 +00:00
07ec3356b9 Revert "Force upsample to be float32 (#121324)"
This reverts commit 2770e3addd9f05101705f0fef85a163e0034b8a5.

Reverted https://github.com/pytorch/pytorch/pull/121324 on behalf of https://github.com/huydhn due to I think it is better to revert and reland this next week 2770e3addd ([comment](https://github.com/pytorch/pytorch/pull/121324#issuecomment-2000617536))
2024-03-15 23:20:01 +00:00
256c0ec1e5 [docs] Added comment on replicate -> partial for _NormPartial (#121976)
Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869, #121945
2024-03-15 23:04:06 +00:00
b717aa6f36 Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)"
This reverts commit 2c33e3a372c077badc561b4aad4997e52c03610a.

Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to I am seeing lots of inductor jobs failing after this change 2c33e3a372.  They looks unrelated though but this change updates Docker image so may be something sneaks in.  I will try to revert this to see if it helps and will reland the change after ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2000547641))
2024-03-15 22:05:21 +00:00
ca80d07ac7 Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-15 21:54:50 +00:00
26aaabb979 [c10d] initialize lastEnqueuedSeq_ and lastCompletedSeq_ (#121980)
Summary:
It is found that this 2 unitilized number was logged with some super
large or negative numbers, which is confusing. So we need to initialize
them. Now -1 indicate the number if invalid, or no work is completed or
enqueued yet. 0 could be a legit seq id.
Test Plan:
Build

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121980
Approved by: https://github.com/xw285cornell, https://github.com/wconstab, https://github.com/kwen2501, https://github.com/XilunWu
2024-03-15 21:45:15 +00:00
dfc5e9325d format caffe2/torch/_export/serde/serialize.py (#121670)
Summary: black caffe2/torch/_export/serde/serialize.py

Test Plan: tests

Differential Revision: D54654847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121670
Approved by: https://github.com/angelayi
2024-03-15 21:30:16 +00:00
53d2188df9 Update get_aten_graph_module (#121937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937
Approved by: https://github.com/andrewor14
2024-03-15 20:35:55 +00:00
af86d67d61 [Doc][NVTX] Add documentation for nvtx.range (#121699)
The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699
Approved by: https://github.com/janeyx99
2024-03-15 20:26:44 +00:00
b92daff6e9 [DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942)
Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD.

Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942
Approved by: https://github.com/wanchaol
2024-03-15 20:21:27 +00:00
f4dd2fda51 [DTensor] Supported 2D clip_grad_norm_ (#121945)
This PR adds support for 2D `clip_grad_norm_` (`foreach=True`).
- This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it.
- This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D.

Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869
2024-03-15 20:11:24 +00:00
2c33e3a372 [BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-15 20:09:50 +00:00
6f4fa8e9a1 [inductor] FX graph cache: simplify "current callable" logic (#121903)
Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903
Approved by: https://github.com/eellison
2024-03-15 20:00:08 +00:00
d0d09f5977 Fix torch.compile links (#121824)
Fixes https://github.com/pytorch/pytorch.github.io/issues/1567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824
Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet
ghstack dependencies: #121823
2024-03-15 19:49:37 +00:00
8a5a377190 Move doc links to point to main (#121823)
The previous links were pointing to an outdated branch

Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + `

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823
Approved by: https://github.com/albanD, https://github.com/malfet
2024-03-15 19:49:37 +00:00
535bc71d03 Enable FX graph caching in another batch of inductor tests (#121697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121697
Approved by: https://github.com/eellison
2024-03-15 19:38:51 +00:00
3ee319c49c Fall back to eager mode when viewing with differing bitwidths (#120998) (#121786)
The inductor lowering code for viewing a tensor as a type with a different bitwidth currently doesn't generate valid triton code. This change looks for a source and destination dtype and, if different sizes, falls back to the eager mode aten implementation.  Prior to this change, this condition would throw an exception.

Fixes #120998.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121786
Approved by: https://github.com/peterbell10, https://github.com/bertmaher
2024-03-15 19:33:30 +00:00
409b1a6081 Add lowering for cummax, cummin (#120429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429
Approved by: https://github.com/peterbell10
2024-03-15 19:04:38 +00:00
d04faf4531 [dynamo][compile-time] Remove preserve rng state per op (#121923)
We already have one globally - 02bb2180f4/torch/_dynamo/convert_frame.py (L477)

I don't think we need per op.

Saves ~2 seconds on this benchmark

~~~
def fn(x):
    for _ in range(10000):
        x = torch.ops.aten.sin(x)
    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121923
Approved by: https://github.com/jansel
2024-03-15 18:24:46 +00:00
67ec870234 Fix FakeTensorUpdater logic for updating fake tensors (#116168)
Fixes https://github.com/pytorch/pytorch/issues/114464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116168
Approved by: https://github.com/peterbell10
2024-03-15 18:22:24 +00:00
239d87af5e combine loops so fn_name correct in error message (#121601)
The error message shown when input aliasing is detected in `while_loop_func` may not have the correct `fn_name` as it set only in the previous for loop.  This change merges the two loops so that `fn_name` has the correct value.

No Issue Number for this minor change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121601
Approved by: https://github.com/albanD
2024-03-15 17:14:56 +00:00
39fdde7f84 [release] Increase version 2.3.0->2.4.0 (#121974)
Branch cut for 2.3.0 completed hence advance main version to 2.4.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121974
Approved by: https://github.com/jeanschmidt
2024-03-15 17:09:33 +00:00
565d1e28ab update kineto submodule commit id (#121843)
Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto changes from https://github.com/pytorch/kineto/pull/880

Test Plan: CI

Differential Revision: D54828357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121843
Approved by: https://github.com/aaronenyeshi
2024-03-15 16:55:25 +00:00
3c3d7455a3 Disable inductor (default) and inductor (dynamic) by default in the perf run launcher (#121914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121914
Approved by: https://github.com/desertfire
2024-03-15 16:46:24 +00:00
ef25d83a62 [export] Add serialization support for tokens (#121552)
Differential Revision: [D54906766](https://our.internmc.facebook.com/intern/diff/D54906766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121552
Approved by: https://github.com/zhxchen17
2024-03-15 16:15:11 +00:00
014f91a9d9 [FSDP2] implement HSDP (#121569)
support HSDP in per-parameter sharding FSDP: https://github.com/pytorch/pytorch/issues/121023

HSDP is a hybrid of FSDP and DDP: reduce-scatter grads intra-node (FSDP), and all-reduce grads inter-node (DDP)

for unit test, we are testing 2 + 2 GPUs in single node: ``pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp``

allreduce overlaps with next reduce-scatter in profiler traces
<img width="886" alt="Screenshot 2024-03-14 at 3 02 52 PM" src="https://github.com/pytorch/pytorch/assets/134637289/98f1f2b5-c99d-4744-9938-10d0431487e5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121569
Approved by: https://github.com/awgu
2024-03-15 10:00:18 +00:00
4cbf963894 [BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)
Switch to use LF S3 bucket for pull on linux-jammy-py3_9-gcc and docs jobs. This is required to migrate to ARC and move to use LF resources.

Depends on https://github.com/pytorch/pytorch/pull/121907
Follow up issue https://github.com/pytorch/pytorch/issues/121919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121908
Approved by: https://github.com/malfet
2024-03-15 09:09:53 +00:00
2770e3addd Force upsample to be float32 (#121324)
Fixes #121072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324
Approved by: https://github.com/ezyang
2024-03-15 07:50:45 +00:00
e25054b248 [compiled autograd] free stack objects before calling compiled graph (#121707)
Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707
Approved by: https://github.com/jansel
2024-03-15 07:12:38 +00:00
5a2b4fc8f0 [dynamo] Convert invalid args into graph breaks (#121784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784
Approved by: https://github.com/yanboliang
2024-03-15 06:51:27 +00:00
fc33bbf827 better support set_default_dtype(torch.float16), update doc (#121730)
1. Fixes #121300
2. Previously, calling `torch.tensor([2j])` after `torch.set_default_dtype(torch.float16)` will cause a runtime error. This PR also fixes it and enables test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121730
Approved by: https://github.com/peterbell10
2024-03-15 06:48:42 +00:00
8fdd8125b6 [executorch hash update] update the pinned executorch hash (#121871)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121871
Approved by: https://github.com/pytorchbot
2024-03-15 05:25:36 +00:00
cyy
fb10e13000 [Clang-tidy header][24/N] Fix clang-tidy warnings on c10/cuda/*.{cpp,h} (#120781)
This PR begins to clean clang-tidy warnings of code in c10/cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120781
Approved by: https://github.com/ezyang
2024-03-15 05:03:22 +00:00
e4fda049c2 DTensor: add comm tests to test_tp_examples (#121669)
This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for `test_transformer_training`.

Fixes #121649

Test plan:

```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121669
Approved by: https://github.com/wanchaol
2024-03-15 03:37:48 +00:00
02083f5452 [DCP][DSD] Add AdamW to distributed state dict unit tests (#121774)
Thanks @fegin for removing the fsdp root module check in DCP to unblock test updates. https://github.com/pytorch/pytorch/pull/121544

This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option.

- test_fsdp
- test_compiled_fsdp
- test_fsdp2
- test_ddp
- test_fsdp_ddp
- test_cpu_offload_full_state_dict

In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue https://github.com/pytorch/pytorch/issues/121186 to keep track.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121774
Approved by: https://github.com/Skylion007
ghstack dependencies: #121773
2024-03-15 03:33:33 +00:00
efbeefbb84 [executorch] Make trymerge force merges actually work with executorch (#121920)
This PR addresses an issue with the trymerge function for executorch, which currently uses Facebook CLA instead of Easy CLA. This bug has been patched in #121921. However, the patch is potentially controversial, and we still want to verify Facebook CLA if it exists. Therefore, this PR includes Facebook CLA in our set of mandatory checks.

Additionally, this PR removes Facebook CLA from one of the mocks. This change is necessary because the specific PR used for testing fails due to the presence of Facebook CLA in the mock.

## Testing:
We run `find_matching_merge_rule(pr = GitHubPR("pytorch", "executorch", 2326), skip_mandatory_checks=True, skip_internal_checks=True)` to check if things work

https://pastebin.com/HHSFp2Gw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121920
Approved by: https://github.com/huydhn
2024-03-15 03:21:44 +00:00
a623666066 [dynamo][compile-time] Make output_graph new_var linear (#121858)
Fixes https://github.com/pytorch/pytorch/issues/121679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121858
Approved by: https://github.com/jansel
2024-03-15 03:20:04 +00:00
3bc2bb6781 use two pass reduction for deterministic reduction order (#115620)
## Motivation
Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`.

## Latest update on 1.15:
55d81901bc.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap.
```
vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)
```
Examples code:
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}
```
will be changed to
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}
```

## Descriptions
Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order.
9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)
9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)
```
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);
```

## Test results
I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):
| |  before this PR   | after this PR  |
| ---- |  ----  | ----  |
| torchbench | 1.303  | 1.301 |
| hugginface | 1.346  | 1.343 |
| timms | 1.971 | 1.970 |

```
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done
```
System info
```
ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-15 02:03:10 +00:00
0cd094a4fd Revert "[aoti] Fix compilation bug for buffer mutations (#121688)"
This reverts commit 9f314d4aa82169ee552ae2a8ad701bd0441a12b7.

Reverted https://github.com/pytorch/pytorch/pull/121688 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121688#issuecomment-1998740094))
2024-03-15 01:34:04 +00:00
01d7c948e2 Make torch/_inductor/comms.py recognize native funcol IRs as collective IRs (#118498)
### Summary

As title. After this PR, Inductor should recognize native funcol IRs as collectives wherever the existing funcol IRs are recognized as collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118498
Approved by: https://github.com/wanchaol
2024-03-15 01:24:36 +00:00
60ccf81490 [dynamo] Refactor update_block_stack into a seperate function (#121810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121810
Approved by: https://github.com/williamwen42
ghstack dependencies: #121790
2024-03-15 01:01:05 +00:00
1e9a7df8fe [dynamo] Compile time optimizations in tx.step() (#121790)
`python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
- Before: `symbolic_convert_overhead_stress_test: 10.7s`
- After: `symbolic_convert_overhead_stress_test: 8.6s`

`tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790
Approved by: https://github.com/oulgen
2024-03-15 01:01:05 +00:00
1afa8e0985 Fix #83153: torch.nn.hardtahn allowed min_val to be greater than max_val (#121627)
Fixes #83153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121627
Approved by: https://github.com/albanD
2024-03-15 00:57:45 +00:00
710446b1eb [dtensor] refactor and generalize stack strategy (#121869)
This PR rewrite the stack strategy to be more generalized, basically
stack/cat like strategy follow pattern need to be smarter, i.e. it
should be able to identify:
1. PR, PP, RP -> follow PP
2. RR, SR, RS -> follow SS

So this PR refactors how the follow strategy should work, and make sure
we start following the strategy that incurred lowest cost. i.e. for
multiple PR, RP placements, we should be able to further delay the
pending sum reductions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869
Approved by: https://github.com/awgu
2024-03-15 00:34:25 +00:00
92ed8553a6 Revert "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864)
This reverts commit 9373ad0bb87b364375a468c296d2daef0e8817d7.

Revert "Add Cudagraphs disable checking (#121018)"

This reverts commit 4af0e634bf02309583dfe3b5c3421442fda5ec7e.

Causes compilation time increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864
Approved by: https://github.com/eellison
2024-03-15 00:03:09 +00:00
d604ab81a2 [PyTorch] Fix static runtime sigrid_hash precomputed multiplier pass (#120851)
This pass was broken.

Differential Revision: [D54336561](https://our.internmc.facebook.com/intern/diff/D54336561/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54336561/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120851
Approved by: https://github.com/houseroad
2024-03-15 00:02:38 +00:00
cceabe873f [jit] ClassType hashing: hash on compilation_unit as well (#121928)
Following up on #121874 - it turns out that in our case, we're seeing repeated class names that are from different compilation units.  Our previous hash function wasn't considering the compilation unit, leading to hash collisions (and then exponential memory usage in the number of copies of this class name)

Differential Revision: [D54916455](https://our.internmc.facebook.com/intern/diff/D54916455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121928
Approved by: https://github.com/eellison
ghstack dependencies: #121874
2024-03-14 23:16:08 +00:00
2d9cee20a2 [jit] AliasDB type hash - don't always return 0 (#121874)
This hash was missing an assignment, so for almost all types it was returning "0".

c10::flat_hash_map turns out to have really bad behavior with a terrible hash like this, nearly exponential in memory usage.

Differential Revision: [D54916424](https://our.internmc.facebook.com/intern/diff/D54916424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121874
Approved by: https://github.com/eellison
2024-03-14 23:16:08 +00:00
57b20c51b9 Don't record autograd state ops while torch.compile in pre-dispatch export (#121736)
Summary: Refer to OSS PR for details

Test Plan: CI

Differential Revision: D54812833

In pre-dispatch export, we have a special proxy torch mode where we intercept torch._C._set_grad_enabled op to correctly capture user's intention on train/eval. However, this is bit problematic when we are tracing torch.cond during export as it calls torch.compile internally. As a result, we end up capturing unwanted autograd context manager  calls that are happening inside dynamo framework code because the top level tracer is still active. We fix it by turning off this proxy torch mode. We can still capture autograd ops inside cond branches because dynamo will translate them into HOP for us, so we don't have to intercept with special proxy mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121736
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
2024-03-14 23:06:10 +00:00
bd7beef529 [Inductor] Update the cpp_wrapper entry function signature (#121745)
Summary: Update the entry function to use AtenTensorHandle instead of at::Tensor. This makes the compilation of the generated cpp wrapper code much faster: test_cpu_cpp_wrapper.py from 35 min to 21 min, and test_cuda_cpp_wrapper.py from 21 min to 14 min.

Differential Revision: [D54818715](https://our.internmc.facebook.com/intern/diff/D54818715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121745
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #121523, #121743, #121744
2024-03-14 22:23:00 +00:00
8be80706b4 [AOTI] Add pybind for tensor_converter util functions (#121744)
Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523, #121743
2024-03-14 22:20:51 +00:00
46493ee9b5 [AOTI][refactor] Update tensor_converter util functions (#121743)
Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor.

Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523
2024-03-14 22:17:54 +00:00
3df1b3b0ad [jit] support getattr/hasattr on NamedTuple (#121863)
getattr is already supported on objects, and seems like for the most part for NamedTuples. The only remaining gap seems to be that hasattr only accepted objects, not NamedTuples. This PR adds support, and adds some basic tests.

Differential Revision: [D54888612](https://our.internmc.facebook.com/intern/diff/D54888612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121863
Approved by: https://github.com/eellison
2024-03-14 22:07:28 +00:00
818b14025a [AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523)
Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu.

Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523
Approved by: https://github.com/chenyang78
2024-03-14 22:05:38 +00:00
43e243180b Add gpt-fast as a static benchmark (#121886)
Run:
```
python benchmarks/gpt_fast/benchmark.py
```
It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like:
```
name,mode,target,actual,percentage
Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48%
Llama-2-7b-chat-hf,int8,155,158.964615,102.56%
Mixtral-8x7B-v0.1,int8,97,99.760132,102.85%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886
Approved by: https://github.com/Chillee
2024-03-14 21:46:59 +00:00
0e68eb1505 Add privateuseone flags for c10::EventFlag (#121118)
Fixes #117341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121118
Approved by: https://github.com/albanD
2024-03-14 20:07:12 +00:00
9f314d4aa8 [aoti] Fix compilation bug for buffer mutations (#121688)
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78
2024-03-14 19:35:26 +00:00
0636c11811 [AOTInductor] Include build cmds at the end of wrapper file (#121872)
Summary:
For easier debugging, include build commands at the end of codegen wrapper.

{F1468438991}

Test Plan: CI

Differential Revision: D54882164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121872
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-03-14 18:41:17 +00:00
c409292197 [sigmoid] Use deserializer from oss. (#121839)
Summary:
Old path:
thrift -> thrift deserializer -> graph module.
new path:
thrift -> python dataclass -> oss deserializer -> graph_module

Test Plan:
CI
buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference

Reviewed By: SherlockNoMad

Differential Revision: D54855251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121839
Approved by: https://github.com/angelayi
2024-03-14 18:38:58 +00:00
499136a4dd [Inductor] Fix a dynamic shape problem when lowering diagonal (#121881)
Summary: when computing the diagonal size, we need to use correct symbolic min/max function.

Differential Revision: [D54884899](https://our.internmc.facebook.com/intern/diff/D54884899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121881
Approved by: https://github.com/aakhundov
2024-03-14 18:36:37 +00:00
5b1642516f [with_effects] Skip over profiler.record_function_exit (#121829)
Summary:
tldr: User calls to `torch.autograd.profiler.record_function` fails when tracing with non-strict pre-dispatch export due to an effect token failure, so the solution is to skip over these operators 😅

Some user code contains calls to a `torch.autograd.profiler.record_function` context, like https://fburl.com/code/uesgknbq and https://fburl.com/code/iogbnsfw, which is used for adding user-defined events into the profiler.

Currently these function calls will be skipped/removed in dynamo (https://fburl.com/code/fkf7qmai) but **non-strict pre-dispatch export** will hit these operators during tracing. However, it seems that although these operators get hit by the dispatcher, they don't actually show up in the final graph (maybe they get DCE-d).

However, an issue comes up with a recent change with effect tokens (D54639390) which creates tokens if it sees a ScriptObject during tracing. The operator `torch.ops.profiler.record_function_exit` takes in a ScriptObject, so the effect tokens framework now tries to add an effect token to this operator, but results in the following error: (https://www.internalfb.com/intern/everpaste/?handle=GI-hvBknzj2ZxYkBABNzdztDxJVAbsIXAAAB, P1195258619)

The reason is because this operator only gets hit during pre-dispatch, not post-dispatch tracing. During pre-dispatch tracing, we first trace using post-dispatch to collect metadata needed for functionalization, and then we do pre-dispatch tracing to construct the graph. The metadata collection phase is also when we determine what operators need effect tokens and create those tokens. However, since the operator only shows up in pre-dispatch tracing, we do create any tokens. During the actual pre-dispatch tracing to create the graph, we then run into this operator and try to get a token, but none exist, causing an error :(

This PR just blocks the record_function operator from being looked at by the effect tokens framework. But a proper fix might be to have functionalization run on the pre-dispatch graph or have the operator also show up in the post-dispatch graph. But since in the PT2 stack dynamo just gets rid of this operator so that it won't show up anywhere downstream, I think we can also just ignore this operator.

Test Plan: Fixed test for P1195258619

Differential Revision: D54857444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121829
Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan
2024-03-14 18:09:43 +00:00
f1f7c5c31e [ez] Document for add_var_to_val (#121850)
Summary: Add doc for ShapeEnv.add_var_to_val

Test Plan: doc only change

Reviewed By: izaitsevfb

Differential Revision: D54872335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121850
Approved by: https://github.com/izaitsevfb
2024-03-14 18:01:09 +00:00
4c3a052acf [BE] Add S3 bucket argument to number of workflows (#121907)
Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`):
- _docs
- _linux-test workflows
- download-build-artifacts
- pytest-cache-download
- upload-test-artifacts

This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907
Approved by: https://github.com/malfet
2024-03-14 17:57:05 +00:00
38d7d366b9 [FSDP2] Added 2D DCP save/load test (#121747)
To prepare for FSDP2 + TP/SP in torchtrain, we should verify that we can resume training correctly with DCP save/load. For loading into a new model/optimizer instance, torchtrain uses lightweight `ModelWrapper` and `OptimizerWrapper`. In the added unit test, we use `get_optimizer_state_dict` directly to show the minimal requirement for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121747
Approved by: https://github.com/wz337
2024-03-14 17:24:17 +00:00
443444dc7f [c10d] Add generic scuba logging capability into c10d (#121859)
Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.

With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_

The implementation needs to make sure that Scuba will be used only for FB internal use
cases.

For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.

Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy

Reviewed By: wconstab

Differential Revision: D54556219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
2024-03-14 16:03:45 +00:00
83f8e51404 Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning (#119986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119986
Approved by: https://github.com/kadeng
ghstack dependencies: #119685
2024-03-14 16:03:10 +00:00
be0bdf111c relax tol for flaky nansum_out_dtype_cuda_float32 test (#121550)
TestReductionsCUDA.test_nansum_out_dtype_cuda_float32 would fail or pass depending on the random inputs. Observed by ROCm internal QA testing.  But same problematic random inputs breaks the test for CUDA, verified on V100.

There is precedent in another test within the same file to relax tolerance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121550
Approved by: https://github.com/albanD
2024-03-14 15:28:45 +00:00
7e13b5ba29 Checkout release branch rather then commit_hash when building triton release (#115379) (#121901)
Cherry pick of https://github.com/pytorch/pytorch/pull/115379 from Release 2.2 that should be applied to main and Release 2.3 as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121901
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt
2024-03-14 14:42:29 +00:00
956059fa2e [Fix] Fixed behaviour for the conversion of complex tensors to bool (#121803)
Fixes #120875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121803
Approved by: https://github.com/lezcano
2024-03-14 13:35:15 +00:00
1251f0fa31 Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685
Approved by: https://github.com/cpuhrsch, https://github.com/kadeng
2024-03-14 13:25:23 +00:00
38d9bb5abc Make PyTorch compilable against upcoming Numpy-2.0 (#121880)
Test plan:
```
% python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))"
2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9])
% python -c "import torch;print(torch.rand(3, 3).numpy())"
[[0.0931946  0.44874293 0.8480404 ]
 [0.93877375 0.10188377 0.67375803]
 [0.02520031 0.89019287 0.5691561 ]]

```
Fixes https://github.com/pytorch/pytorch/issues/121798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880
Approved by: https://github.com/albanD
2024-03-14 05:36:50 +00:00
b4c53aa0ec Do not compile FP16 arith internally (#121844)
Also, decorate unused args with `C10_UNUSED` to fix linter warnings
Test Plan: `buck2 build -c fbcode.arch=aarch64  //caffe2:ATen-cpu`

Differential Revision: D54870507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121844
Approved by: https://github.com/osalpekar
2024-03-14 05:19:02 +00:00
3eb322ff29 Handle transitive replacements in Triton kernel mutation analysis (#121867)
Summary: Previously, we didn't handle transitive replacements in MLIR walk-based function info mining in the Triton kernel mutation analysis pass. As a result, for the TTIR below:

```
tt.func private @cumsum__fp32S1_16S__1cconstexpr_1__2cconstexpr_False_(%arg0: tensor<1x16xf32> loc("...":296:0)) -> tensor<1x16xf32> attributes {noinline = false} {
    %0 = "tt.scan"(%arg0) <{axis = 1 : i32, reverse = false}> ({
    ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)):
      %1 = tt.call @_sum_combine__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc16)
      tt.scan.return %1 : f32 loc(#loc16)
    }) : (tensor<1x16xf32>) -> tensor<1x16xf32> loc(#loc16)
    tt.return %0 : tensor<1x16xf32> loc(#loc18)
  } loc(#loc15)
```

the mined function dict looked like this:

```
{Intermediate(idx=25): [Op(name='tt.call',
                           fn_call_name='_sum_combine__fp32_fp32__',
                           args=[Intermediate(idx=26),
                                 Intermediate(idx=26)])],
 Intermediate(idx=27): [Op(name='tt.scan.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=25)])],
 Intermediate(idx=-4): [Op(name='tt.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=27)])]}
```

whereas it should look like this (not the `Param(idx=0)` arguments of the `tt.call`):

```
{Intermediate(idx=25): [Op(name='tt.call',
                           fn_call_name='_sum_combine__fp32_fp32__',
                           args=[Param(idx=0),
                                 Param(idx=0)])],
 Intermediate(idx=27): [Op(name='tt.scan.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=25)])],
 Intermediate(idx=-4): [Op(name='tt.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=27)])]}
```

This is fixed in the PR.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_cumsum
.
----------------------------------------------------------------------
Ran 1 test in 1.771s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121867
Approved by: https://github.com/oulgen
2024-03-14 04:06:37 +00:00
4cd503c1f3 Enable FX graph cache for a batch of inductor tests (#121696)
Summary: Get more FX graph cache coverage by enabling it for these unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696
Approved by: https://github.com/eellison
2024-03-14 03:39:59 +00:00
15abc56bd5 Graph break on step closure in optimizer (#121777)
Fixes https://github.com/pytorch/pytorch/issues/116494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121777
Approved by: https://github.com/yanboliang
2024-03-14 03:18:23 +00:00
f85f58bf86 Fix quantized linear vulkan tests (#120960)
Summary: Fixed quatized linear vulkan tests by using an old pack_biases function.

Test Plan:
**Vulkan quantized api tests**
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

...
...
...
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (5 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (4 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (2 ms)
...
...
[----------] 85 tests from VulkanAPITest (1704 ms total)

[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1704 ms total)
[  PASSED  ] 85 tests.

  YOU HAVE 8 DISABLED TESTS

**Vulkan api tests**
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

[----------] Global test environment tear-down
[==========] 426 tests from 1 test suite ran. (4997 ms total)
[  PASSED  ] 423 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] VulkanAPITest.log_softmax_underflow
[  FAILED  ] VulkanAPITest.log_softmax

Differential Revision: D54396367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120960
Approved by: https://github.com/yipjustin
2024-03-14 02:23:00 +00:00
a37caa6ed3 [Quant][Inductor] Enable quantization linear pattern fusion with int8_mixed_bf16 for gelu (#116004)
**Summary**
Enable QLinear Unary pattern for gelu with int8_mix_bf16

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_int8_mixed_bf16

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116004
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
ghstack dependencies: #114853, #114854
2024-03-14 01:52:12 +00:00
43d68e9c8f [Quant][Inductor] Enable quantization linear pattern fusion for gelu inside inductor (#114854)
**Summary**
Enable QLinear Unary pattern for gelu with int8

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_cpu

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114854
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #114853
2024-03-14 01:49:14 +00:00
25e00545bb [Quant][PT2E] Enable linear and linear-unary post-op gelu quant recipe for x86 inductor quantizer (#114853)
**Summary**
Add Gelu for linear-unary post-op quantization recipe to x86 inductor quantizer.

**Test plan**
python -m pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_unary_gelu
python test/test_quantization.py -k test_linear_unary_with_quantizer_api
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114853
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2024-03-14 01:46:35 +00:00
a04e7fca8e Use memcache versioning for autotune remote cache (#121748)
Summary: Internal training platform doesn't get updated very frequently, so lets use versioning for memcache.

Test Plan: existing tests

Differential Revision: D54818197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121748
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-03-14 00:36:10 +00:00
7e076c75bd [C10D] Fix coalescedCollective op Flight Recording (#120430)
Also noticed and filed https://github.com/pytorch/pytorch/issues/120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120430
Approved by: https://github.com/kwen2501
2024-03-13 23:55:00 +00:00
bf7ac4ddf7 Revert "[export] allow Dim(1,2) for export dynamic shapes (#121642)"
This reverts commit a8dcbf2749f2081f939621db2d38fd15ab7e34a3.

Reverted https://github.com/pytorch/pytorch/pull/121642 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121642#issuecomment-1996121710))
2024-03-13 23:51:20 +00:00
3e02a7efcd Only FA2 doesn't support attn-mask (#121825)
Fixes #121783

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121825
Approved by: https://github.com/cpuhrsch
2024-03-13 23:03:39 +00:00
a8dcbf2749 [export] allow Dim(1,2) for export dynamic shapes (#121642)
Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.

Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
    def forward(self, x, y):
        return x + y[1:]

dx = Dim("dx", min=1, max=3)
ep = export(
    Bar(),
    (torch.randn(2, 2), torch.randn(3, 2)),
    dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```

In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```

This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121642
Approved by: https://github.com/avikchaudhuri
2024-03-13 22:59:07 +00:00
70c6f542f2 Revert "[dynamo] Convert invalid args into graph breaks (#121784)"
This reverts commit 0df39480f6a74c9094555e8a61a8c8bb01716d4e.

Reverted https://github.com/pytorch/pytorch/pull/121784 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks ONNX test in trunk 0c1ac4484d ([comment](https://github.com/pytorch/pytorch/pull/121784#issuecomment-1995979435))
2024-03-13 22:12:43 +00:00
aaff8d274a CUDA fast path for _chunk_cat() (#120678)
This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081).

Performance on a production benchmark:

- Float16 in, Float16 out: 249 -> 500
- BFloat16 in, BFloat16 out: 248 -> 500
- BFloat16 in, Float32 out: 126 -> 278
- Float32 in, Float32 out: 153 -> 260
- Float64 in, Float64 out: 79 -> 132
- int8 in, int8 out: 332 -> 908
- int16 in, int16 out: 250 -> 489
- int32 in, int32 out: 153 -> 260
- int64 in, int64 out: 79 -> 132

Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](7b3febdca7/torch/distributed/_composable/fsdp/_fsdp_collectives.py (L176))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678
Approved by: https://github.com/yifuwang
2024-03-13 22:02:06 +00:00
c53e3f57b5 allow fp16 in quant/dequant decompositions (#121738)
Test Plan:
```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp16 --pt2e_quantize "xnnpack_dynamic" -2
```

Reviewed By: kirklandsign

Differential Revision: D54785950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121738
Approved by: https://github.com/jerryzh168
2024-03-13 21:45:08 +00:00
c7193f4099 [DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479)
This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict.

This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce.

**TODOs**

- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [x] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of all_reduces.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.

Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479
Approved by: https://github.com/wz337
ghstack dependencies: #113209
2024-03-13 21:41:22 +00:00
dd568f4207 [Export, AOTInductor] Populate ShapeEnv's var_to_val during deserialization (#121759)
Summary:
Deserialization didn't populate ShapeEnv's `var_to_val` field properly, and AOTInductor is relying on this field to compile dynamic shape properly.
As a result, when AOTI failed at compiling a deserialized ExportedProgram.

Test Plan: buck2 test  mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference

Differential Revision: D54559494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121759
Approved by: https://github.com/avikchaudhuri
2024-03-13 21:28:25 +00:00
a2a4693c1b Revert "Init CUDA instead of faking memory stats (#121698)"
This reverts commit 2460f0b1c7bb6e088aca1f6e9bb62c834053d71b.

Reverted https://github.com/pytorch/pytorch/pull/121698 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
45a835cef2 Revert "[compiled autograd] free stack objects before calling compiled graph (#121707)"
This reverts commit 5b90074540577267c29f5f784be123ee54f6491d.

Reverted https://github.com/pytorch/pytorch/pull/121707 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
8b1b61bc70 [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Differential Revision: [D54818488](https://our.internmc.facebook.com/intern/diff/D54818488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-13 21:13:21 +00:00
58ff55aac5 Add support for tt.scan to triton kernel mutation analysis (#121828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121828
Approved by: https://github.com/aakhundov, https://github.com/Skylion007
2024-03-13 20:37:56 +00:00
8e6d572b4e [DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209)
Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/)

**TL;DR**
This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion.

This PR does not invent any algorithm and simply reflects the bucket size users set to DDP.

**Implementation Details**
*Fusion with concat op*
The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient.

Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer.

*Fusion with `all_reduce_coalesced`*
The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling.

**Limitations**
Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs.

**TODOs**
- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [ ] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of `all_reduce`s.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209
Approved by: https://github.com/yf225
2024-03-13 20:37:09 +00:00
0c1ac4484d Support call_method in DDPOptimizer (#121771)
This PR fixes Issue #111279.

While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be:
```python
class ToyModel(nn.Module):
    def __init__(self,):
        super().__init__()
        self.linear = nn.Linear(128, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear.forward(x) # Error
        # return self.linear(x) # OK
```

Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes.

This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771
Approved by: https://github.com/yf225
2024-03-13 20:03:15 +00:00
0df39480f6 [dynamo] Convert invalid args into graph breaks (#121784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784
Approved by: https://github.com/yanboliang
ghstack dependencies: #121615, #121616
2024-03-13 20:02:33 +00:00
5b90074540 [compiled autograd] free stack objects before calling compiled graph (#121707)
Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707
Approved by: https://github.com/jansel
ghstack dependencies: #121698
2024-03-13 19:31:44 +00:00
2460f0b1c7 Init CUDA instead of faking memory stats (#121698)
This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698
Approved by: https://github.com/jansel
2024-03-13 19:31:44 +00:00
cd949d133e Support setUpClass & tearDownClass with instantiate_device_type_tests() (#121686)
Summary: instantiate_device_type_tests() creates dynamic test case classes that derive from a "template class". By default, the test harness will call the setUpClass() and tearDownClass() methods defined by the template class (if the template class defines them). We can explicitly create these methods in the dynamic class and arrange to call those methods in both base classes. That allows us to support setUpClass & tearDownClass test classes used with instantiate_device_type_tests().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121686
Approved by: https://github.com/ezyang, https://github.com/eellison
2024-03-13 18:28:42 +00:00
ffabb25c48 Count the number of entries directly in avg_pool2d lowering (#121429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121429
Approved by: https://github.com/peterbell10
ghstack dependencies: #116085
2024-03-13 18:19:47 +00:00
a19a05fd1d Add lowering for avg_pool{1, 3}d (#116085)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116085
Approved by: https://github.com/peterbell10
2024-03-13 18:19:47 +00:00
79fac48bb3 Use pytorch bot's labeler (#121762)
Change corresponds to https://github.com/pytorch/test-infra/pull/4995
Testing (very light) in https://github.com/malfet/deleteme/pull/81
Should help with https://github.com/pytorch/test-infra/issues/4950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121762
Approved by: https://github.com/huydhn
2024-03-13 17:16:49 +00:00
05df03ec1b Allow custom attributes for torch function subclasses (#121693)
Added custom attribute access with test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121693
Approved by: https://github.com/anijain2305
2024-03-13 17:01:57 +00:00
92a2b214f8 Make translation validation more user friendly (#120880)
Two main changes:

- Don't rethrow the exception when we fail in TV, just throw the entire
  thing and trust the user will inspect logs / backtrace to see we
  failed in TV

- Don't add an event to the TV logs until we've confirmed that the event
  actually runs without erroring.  This prevents us from putting events
  that e.g., fail because the guard on data dependent size, and the
  failing in TV.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120880
Approved by: https://github.com/lezcano, https://github.com/ysiraichi
2024-03-13 15:21:59 +00:00
b1d5998956 Upgrade to tlparse 0.3.7 (#121772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121772
Approved by: https://github.com/Skylion007
2024-03-13 15:21:20 +00:00
5498804ec2 [MPS] Fix naive matmul for BFloat16 (#121731)
Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate

Fixes https://github.com/pytorch/pytorch/issues/121583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731
Approved by: https://github.com/albanD
2024-03-13 14:34:03 +00:00
559ca13b3f [dynamo] Refactor TorchInGraphFunctionVariable for compile time (#121616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121616
Approved by: https://github.com/oulgen
ghstack dependencies: #121615
2024-03-13 14:21:21 +00:00
51cf57c6c6 Revert "Include torch warn in each error in cudnn/Conv_v8.cpp (#120719)"
This reverts commit 5fd7f5c4e336c2c3041e10529990c620cc8cf9a5.

Reverted https://github.com/pytorch/pytorch/pull/120719 on behalf of https://github.com/janeyx99 due to sorry but am reverting as this prints unwanted warnings even when an exception is not thrown  ([comment](https://github.com/pytorch/pytorch/pull/120719#issuecomment-1994491826))
2024-03-13 14:09:38 +00:00
a157a0d00d [constraints] Fix scalar type for constraint_range to Long (#121752)
Differential Revision: [D54822125](https://our.internmc.facebook.com/intern/diff/D54822125)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121752
Approved by: https://github.com/ezyang
2024-03-13 11:11:09 +00:00
7fe0cc53e9 make _process_dynamic_shapes an implementation detail (#121713)
Summary: `_process_dynamic_shapes` converts new dynamic shapes to old constraints, but in the future may not need to do so. Preparing for that future.

Test Plan: CI

Differential Revision: D54780374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121713
Approved by: https://github.com/tugsbayasgalan
2024-03-13 08:33:00 +00:00
5088e4956e Add quantized conv transpose2d op (#120151)
Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[  PASSED  ] 417 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[  PASSED  ] 77 tests.
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large

 9 FAILED TESTS
  YOU HAVE 8 DISABLED TESTS

Differential Revision: D52344261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120151
Approved by: https://github.com/yipjustin
2024-03-13 08:09:57 +00:00
e99fa0042c Back out "[DeviceMesh] Add support for nD slicing (#119752)" (#121763)
Summary:
Original commit changeset: e52b8809c8d8

Original Phabricator Diff: D54778906

We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248

Test Plan: Sandcastle

Reviewed By: satgera

Differential Revision: D54825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar
2024-03-13 07:22:08 +00:00
be33d31ae2 add std::ostream& operator<< for BFloat16 in BFloat16.h (#121302)
This PR Move `operator<<` of `BFloat16` to `BFloat16.h`.

Previously, this function is in `TensorDataContainer.h`. If need `std::cout` a `BFloat16` variable when debugging, `TensorDataContainer.h` have to be included. This is inconvient and counterintuitive.

Other dtypes such as `Half`, define their `operator<<` in headers where they are defined such as `Half.h`. Therefore, I think it makes more sense to move `operator<<` of `BFloat16` to `BFloat16.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121302
Approved by: https://github.com/ezyang
2024-03-13 06:47:34 +00:00
5986552ebe [nit][DCP][DSD] Remove variables not being used in test_state_dict.py #121204 (#121773)
Replacing https://github.com/pytorch/pytorch/pull/121204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121773
Approved by: https://github.com/Skylion007
2024-03-13 06:35:04 +00:00
da2a9a0512 _foreach_copy with different src/dst dtypes (#121717)
Fixes #115171

```
torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation'
[------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          14.2        |          12.6        |           12.7
      num_tensors: 256   |         688.0        |         510.3        |          514.0
      num_tensors: 1024  |        2768.0        |        2053.3        |         2047.7

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          10.0        |           8.9        |            8.8
      num_tensors: 256   |         497.6        |         344.3        |          348.3
      num_tensors: 1024  |        1991.9        |        1392.0        |         1389.0

Times are in microseconds (us).

[----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          10.0        |           8.8        |            8.8
      num_tensors: 256   |         497.5        |         344.5        |          348.0
      num_tensors: 1024  |        1993.2        |        1390.4        |         1387.5

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          19.0        |          17.9        |           18.1
      num_tensors: 256   |         707.2        |         540.2        |          543.1
      num_tensors: 1024  |        2900.6        |        2156.6        |         2159.2

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          13.8        |          13.7        |           13.1
      num_tensors: 256   |         513.2        |         352.6        |          350.4
      num_tensors: 1024  |        2047.6        |        1404.4        |         1400.4

Times are in microseconds (us).

[----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          13.6        |          12.8        |           14.2
      num_tensors: 256   |         511.9        |         351.8        |          350.6
      num_tensors: 1024  |        2045.4        |        1402.2        |         1401.4

Times are in microseconds (us).

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717
Approved by: https://github.com/janeyx99
2024-03-13 05:42:28 +00:00
a13dd92d88 [dynamo] Minor compile time optimizations in torch.py (#121615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121615
Approved by: https://github.com/oulgen
2024-03-13 05:36:22 +00:00
d619be57c0 [executorch hash update] update the pinned executorch hash (#121056)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121056
Approved by: https://github.com/pytorchbot
2024-03-13 04:54:16 +00:00
0c1d59b72f CI: Fix flaky artifact upload step (#121733)
This PR changes the upload artifact step of the wheels and conda build to write
each matrix entry to a different file. This is because updating the same file
from multiple jobs can be flaky as is warned in the docs for upload-artifact

> Warning: Be careful when uploading to the same artifact via multiple jobs as artifacts may become corrupted. When uploading a file with an identical name and path in multiple jobs, uploads may fail with 503 errors due to conflicting uploads happening at the same time. Ensure uploads to identical locations to not interfere with each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121733
Approved by: https://github.com/huydhn
ghstack dependencies: #121268
2024-03-13 04:42:52 +00:00
52ed35bb64 [inductor] Update triton pin (#121268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121268
Approved by: https://github.com/oulgen, https://github.com/malfet
2024-03-13 04:42:52 +00:00
07330ff7b6 [MPS][BE] Define _compute_tolerances (#121754)
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.

Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
2024-03-13 04:08:06 +00:00
f83392b677 cublasLt workspace warning info is misleading, the unit of measuremen… (#121073)
cublasLt workspace warning info is misleading, the unit of measurement should be KiB instead of bytes

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121073
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-03-13 03:37:40 +00:00
e755dab0d1 [ROCm] Enable several test_unary_ufuncs UTs on ROCm (#121104)
Enabled:
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex64
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex64
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atanh_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atanh_cuda_complex128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121104
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2024-03-13 03:34:22 +00:00
f24ae66abf [AOTInductor] Skip tests on RoCM for duplicate_constant_folding (#121750)
Summary: Skip AMD tests for duplicated kernels in constant folding

Test Plan: Diff is test

Differential Revision: D54820804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121750
Approved by: https://github.com/huydhn
2024-03-13 03:21:21 +00:00
9f235971f0 Gate tt.reduce Triton mutation tests on Triton version (#121753)
Summary: The goal is to make the `test_argmax` and `test_reduce_sum` to work both before and after https://github.com/openai/triton/pull/3191 is included into the Triton pin. This is important to make those tests work during the Triton pin update process both in OSS and internally.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_reduce_sum -k test_argmax
..
----------------------------------------------------------------------
Ran 2 tests in 1.906s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121753
Approved by: https://github.com/Skylion007
2024-03-13 01:43:02 +00:00
7d05c4c093 Remove error anti-pattern when dealing with dynamic shape output (#121681)
There are cases where capture_dynamic_output_shape_ops=True and we will still see DynamicOutputShapeException. For example, when an op doesn't have a meta kernel implemented to return the correct dynamic shape output. If we blindly give users instructions to set capture_dynamic_output_shape_ops to True, users would try it and see no change. As witnessed in this issue:
https://github.com/pytorch/pytorch/issues/121036#issuecomment-1985221435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121681
Approved by: https://github.com/tugsbayasgalan
2024-03-13 00:45:23 +00:00
9df0dca7f6 Revert "[ Inductor ] Shape padding honors output stride preservation (#120797)"
This reverts commit 57fc35a3af09f7657b2be593a1046f0ac2dd50ab.

Reverted https://github.com/pytorch/pytorch/pull/120797 on behalf of https://github.com/williamwen42 due to perf regression on dashboard ([comment](https://github.com/pytorch/pytorch/pull/120797#issuecomment-1992857428))
2024-03-13 00:43:34 +00:00
02bb2180f4 [torch export] replace traceback.extract_stack with CapturedTraceback.extract (#121449)
Summary:
with a simple bench in TestDeserializer.test_basic function:
```
time_start = time.time()
for i in range(1000):
    self.check_graph(MyModule(), inputs)
warnings.warn(f"time_taken: {time.time() - time_start}")
```
and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s)

Test Plan:
test passed, see summary

compared debug trace before and after:
- exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685
- slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347

Differential Revision: D54237017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449
Approved by: https://github.com/angelayi
2024-03-13 00:19:05 +00:00
7a53dedb07 CI: Specify libc and libstdcxx versions in conda environments (#121556)
Without this we get mismatches between the GLIBC and GLIBCXX ABI used
by conda packages vs pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556
Approved by: https://github.com/isuruf, https://github.com/malfet
2024-03-13 00:12:54 +00:00
68be750e17 Cleanup some exception handling in triton mutation tracking (#121739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121739
Approved by: https://github.com/Skylion007
ghstack dependencies: #121690
2024-03-13 00:02:36 +00:00
a9274c9a2c Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
This PR corrects the example in the AOTInductor example which currently fails with:
```
/home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’
   21 |     std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl;
      |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672
Approved by: https://github.com/desertfire
2024-03-12 23:43:40 +00:00
79ee6bbde3 Support triton.language.dtype with torch.compile (#121690)
Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work.
```
(Pdb) p triton.language.float32
triton.language.fp32
(Pdb) p str(triton.language.float32)
'fp32'
(Pdb) p repr(triton.language.float32)
'triton.language.fp32'
```
This means that we need to "rewrite" them for fx graph and inductor execution.

This PR allows Mamba2 to work with `torch.compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690
Approved by: https://github.com/Skylion007
2024-03-12 23:21:46 +00:00
22bb24986d [dynamo][guards] Use lazy variable tracker for func defaults (#121388)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388
Approved by: https://github.com/jansel
2024-03-12 22:48:48 +00:00
519151a062 [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/izaitsevfb
2024-03-12 22:18:43 +00:00
a95ceb51a2 Release fix pinning slow-tests.json (#121746)
Apply release changes script adds version to SLOW_TESTS_FILE which should not change

Test:
```
SLOW_VER=test
sed -i -e s#/slow-tests.json#"/slow-tests.json?versionId=${SLOW_VER}"#  tools/stats/import_test_stats.py
```
Output:
```
SLOW_TESTS_FILE = ".pytorch-slow-tests.json"
...
url = "https://ossci-metrics.s3.amazonaws.com/slow-tests.json?versionId=test"
```

related to: https://github.com/pytorch/pytorch/pull/121726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121746
Approved by: https://github.com/huydhn
2024-03-12 22:04:55 +00:00
a5ec45f2ec [Inductor Cutlass backend] Move tests to separate file (#121489)
Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489
Approved by: https://github.com/jansel
2024-03-12 21:59:48 +00:00
844bfbbd2e feat: Update Dockerfile default versions for Python, OS, and CUDA arch list (#121560)
- Update Dockerfile default versions for Python, OS, and CUDA arch list
	- Python 3.8 is EOL later this year, the `docker.Makefile` has 3.10 as default
	- `docker.Makefile` is using 22.04 so this just aligns that
	- The GPU feature list is quite dated, most of those architectures are long past EOL and we aren't getting the newer cards (A100, H100) into that list until now https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121560
Approved by: https://github.com/seemethere, https://github.com/Neilblaze, https://github.com/atalman, https://github.com/malfet
2024-03-12 21:43:26 +00:00
d62bdb087d [Profiler] add missing field device_resource_id (#121480)
Fixes #121479

Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480
Approved by: https://github.com/aaronenyeshi
2024-03-12 21:42:53 +00:00
5478a4e348 Don't run non-strict for test case that doesn't need non-strict (#121710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #121652, #121678, #121687
2024-03-12 21:32:33 +00:00
5b506c8bce Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388)"
This reverts commit 04a5d6e8d3f09ee6741484bcfea022228f747b09.

Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))
2024-03-12 21:31:18 +00:00
522d972924 [eazy] add more log when accuracy check fail (#121656)
Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training.

With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-12 20:58:20 +00:00
f50c652422 avoid aten dispatch shadowing type with variable (#121659)
Summary:
`DECLARE_DISPATCH` is shadowing the variable data with the data type:
`extern TORCH_API struct name name` -> `extern TORCH_API struct gemm_stub gemm_stub` for instance.
This is probably dangerous behavior to rely on, as the compiler needs to always resolve to type and/or data based on context. Previous macro fails with VS2022.

Test Plan: `buck2 build arvr/mode/win/vs2022/cpp20/opt //xplat/caffe2:aten_pow_ovrsource`

Differential Revision: D54699849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121659
Approved by: https://github.com/albanD
2024-03-12 20:50:47 +00:00
6d8a7d6e58 [pytorch] optional zero points on dequantize per channel (#121724)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2364

bypass-github-export-checks

Test Plan: sandcastle

Reviewed By: mikekgfb

Differential Revision: D54709217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724
Approved by: https://github.com/mikekgfb
2024-03-12 19:54:11 +00:00
a6149eba12 [easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
Summary:
# Why?

Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here.

Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well.

Test Plan: ci

Differential Revision: D54764829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662
Approved by: https://github.com/aakhundov
2024-03-12 19:27:56 +00:00
90e886aa6c Sanity check for non-strict (#121687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #121652, #121678
2024-03-12 18:21:32 +00:00
443e241cc5 Don't cache predispatch kernels (#121712)
Summary: Title

Test Plan: CI

Differential Revision: D54791087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712
Approved by: https://github.com/ydwu4
2024-03-12 18:05:59 +00:00
a26480a4d1 [dtensor] move early return check into redistribute autograd function (#121653)
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
2024-03-12 17:37:30 +00:00
00a53b58dd Refactor release only changes to two step execution (#121728)
Refactor release only changes to two step execution.

1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed.

2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728
Approved by: https://github.com/jeanschmidt
2024-03-12 17:22:22 +00:00
4e63d9065a [dynamo] Delete record replay tests as they are not maintained (#121705)
Fixes https://github.com/pytorch/pytorch/issues/115518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705
Approved by: https://github.com/mlazos
2024-03-12 17:16:34 +00:00
cd1751b14f [dynamo] Measure Dynamo cache latency lookup (#121604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604
Approved by: https://github.com/jansel
ghstack dependencies: #121614, #121622
2024-03-12 17:09:11 +00:00
22489bfe70 [dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622
Approved by: https://github.com/jansel
ghstack dependencies: #121614
2024-03-12 17:09:11 +00:00
2348e8e4e7 [dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614)
Use NO_HASATTR guard for the common part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614
Approved by: https://github.com/jansel
2024-03-12 17:08:56 +00:00
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec5fb9bccb49bf4ee4ab561e872c0f50.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
b84f94f6a3 Restore timestamps on C++ logs without glog (#121384)
It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy.

I tested by running `build/bin/TCPStoreTest` and observing the log messages there.  I am actually not sure how to look at the log messages from Python though.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-03-12 17:01:32 +00:00
704e15307e [caffe2] replace refernces to np.asscalar (#121332) (#121545)
Summary:

`np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly:
```python
def asscalar(a):
    return a.item()
```
This fixes all of the references.

Test Plan: visual inspection and automated tests

Differential Revision: D54697760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545
Approved by: https://github.com/malfet
2024-03-12 16:58:47 +00:00
d1715c3adb [export] Update error message for set_grad (#121666)
Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666
Approved by: https://github.com/ydwu4
2024-03-12 16:41:45 +00:00
3c8c7e2a46 [dynamo] Tweak naming for module hook bw_state (#121609)
Some minor changes not related to the other PRs in the stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609
Approved by: https://github.com/yanboliang
2024-03-12 16:27:56 +00:00
7a68e0a3e8 [DCP][state_dict] Remove the check of FSDP has root (#121544)
Root may not exist due to FSDP lazy initialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544
Approved by: https://github.com/Skylion007
ghstack dependencies: #121273, #121276, #121290
2024-03-12 15:43:19 +00:00
85dc254364 [DTensor] Moved Transformer sharding to staticmethod (#121660)
To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests.

Test Plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #121360, #121357
2024-03-12 15:08:57 +00:00
cc51e100f5 [ET-VK] Enable Dynamic shape support via tensor virtual and physical resizing (#121598)
Summary:
## Context

This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways:

1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes.
2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed.

Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing.

Differential Revision: D54728401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598
Approved by: https://github.com/jorgep31415
2024-03-12 14:32:00 +00:00
2a99e6f299 Update error message (#121644)
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.

Update the error message to explicitly say that sparse_allreduce is not supported.

Test Plan: sandcastle

Differential Revision: D54759307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
2024-03-12 13:04:21 +00:00
edf22f3a48 Modify signature of dequantize ops for decomposed quantized Tensor (#119173) (#121450)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308

Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.

At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.

This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel

Reviewed By: digantdesai

Differential Revision: D53590486

Pulled By: manuelcandales

Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
2024-03-12 12:36:31 +00:00
06d2392003 Support tt.reduce in Triton kernel analysis pass (#121706)
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.

Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
2024-03-12 11:38:28 +00:00
78b4793c96 [dynamo][compile-time] Caching VTs to reduce compile-time (#121031)
Reduces the `torch.compile(backend="eager")` for this code

~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

From 18 seconds to 12 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031
Approved by: https://github.com/jansel
2024-03-12 09:19:50 +00:00
52ad2b682c Generate predispatch tests (#121678)
In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678
Approved by: https://github.com/angelayi
ghstack dependencies: #121652
2024-03-12 08:34:50 +00:00
656134c38f [ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504)
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases  (m or n or k = 0)

CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2024-03-12 07:29:57 +00:00
4373 changed files with 189526 additions and 368958 deletions

View File

@ -1,3 +1,4 @@
# We do not use this library in our Bazel build. It contains an
# infinitely recursing symlink that makes Bazel very unhappy.
third_party/ittapi/
third_party/opentelemetry-cpp

View File

@ -204,7 +204,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.7
ROCM_VERSION=6.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -215,7 +215,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.0
ROCM_VERSION=6.1
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -229,6 +229,7 @@ case "$image" in
BASEKIT_VERSION=2024.0.0-49522
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
@ -305,6 +306,12 @@ case "$image" in
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
;;
*)
# Catch-all for builds that are not hardcoded.
@ -359,7 +366,7 @@ if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
fi
# Build image
DOCKER_BUILDKIT=1 docker build \
docker build \
--no-cache \
--progress=plain \
--build-arg "BUILD_ENVIRONMENT=${image}" \
@ -398,6 +405,8 @@ DOCKER_BUILDKIT=1 docker build \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
--build-arg "SKIP_LLVM_SRC_BUILD_INSTALL=${SKIP_LLVM_SRC_BUILD_INSTALL:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -1 +1 @@
e2a8f9548aecb62a68e264607174a7d207ed2929
d4b3e5cc607e97afdba79dc90f8ef968142f347c

View File

@ -1 +1 @@
0a22a91d04c2b4a029a69a198eac390089c3e891
bbe6246e37d8aa791c67daaf9d9d61b26c9ccfdc

View File

@ -0,0 +1 @@
b8c64f64c18d8cac598b3adb355c21e7439c21de

View File

@ -1 +1 @@
a9bc1a36470eefafe0e2ab2503b8698f1e89e7e3
45fff310c891f5a92d55445adf8cc9d29df5841e

View File

@ -113,7 +113,6 @@ install_centos() {
glibc-devel \
glibc-headers \
glog-devel \
hiredis-devel \
libstdc++-devel \
libsndfile-devel \
make \

View File

@ -57,8 +57,21 @@ fi
# Uncomment the below when resolved to track the latest conda update
# as_jenkins conda update -y -n base conda
if [[ $(uname -m) == "aarch64" ]]; then
export SYSROOT_DEP="sysroot_linux-aarch64=2.17"
else
export SYSROOT_DEP="sysroot_linux-64=2.17"
fi
# Install correct Python version
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"
# Also ensure sysroot is using a modern GLIBC to match system compilers
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\
python="$ANACONDA_PYTHON_VERSION" \
${SYSROOT_DEP}
# libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 -c conda-forge
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
@ -110,14 +123,5 @@ fi
pip_install -r /opt/conda/requirements-docs.txt
fi
# HACK HACK HACK
# gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu
# Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda
# So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0
# Same is true for gcc-12 from Ubuntu-22.04
if grep -e [12][82].04.[623] /etc/issue >/dev/null; then
rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6
fi
popd
fi

View File

@ -4,11 +4,6 @@ set -ex
install_ubuntu() {
apt-get update
apt-get install -y --no-install-recommends \
libhiredis-dev \
libleveldb-dev \
liblmdb-dev \
libsnappy-dev
# Cleanup
apt-get autoclean && apt-get clean
@ -20,12 +15,6 @@ install_centos() {
# See http://fedoraproject.org/wiki/EPEL
yum --enablerepo=extras install -y epel-release
yum install -y \
hiredis-devel \
leveldb-devel \
lmdb-devel \
snappy-devel
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -33,12 +33,12 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.17.0
pip_install onnx==1.15.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240301 --no-deps
pip_install onnxscript==0.1.0.dev20240315 --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

View File

@ -11,7 +11,8 @@ mkdir -p $pb_dir
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
NPROC=$[$(nproc) - 2]
pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig
popd

View File

@ -61,6 +61,10 @@ install_ubuntu() {
rocprofiler-dev \
roctracer-dev
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.1) ]]; then
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated rocm-llvm-dev
fi
# precompiled miopen kernels added in ROCm 3.5, renamed in ROCm 5.5
# search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails

View File

@ -13,8 +13,11 @@ conda_reinstall() {
}
if [ -n "${ROCM_VERSION}" ]; then
TRITON_REPO="https://github.com/ROCmSoftwarePlatform/triton"
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton-rocm"
elif [ -n "${BASEKIT_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
else
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton"

View File

@ -3,7 +3,7 @@ set -xe
# Intel® software for general purpose GPU capabilities.
# Refer to https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html
# Refer to https://dgpu-docs.intel.com/releases/LTS_803.29_20240131.html
# Intel® oneAPI Base Toolkit (version 2024.0.0) has been updated to include functional and security updates.
# Refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
@ -21,7 +21,7 @@ function install_ubuntu() {
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/production/2328 unified" \
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
| tee /etc/apt/sources.list.d/oneAPI.list

View File

@ -85,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.8.0
mypy==1.9.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.8.0
#Pinned versions: 1.9.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -134,9 +134,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.9.1
optree==0.11.0
#Description: A library for tree manipulation
#Pinned versions: 0.9.1
#Pinned versions: 0.11.0
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
@ -147,9 +147,9 @@ optree==0.9.1
#test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
#test_fake_tensor.py, test_mps.py
pillow==10.2.0
pillow==10.3.0
#Description: Python Imaging Library fork
#Pinned versions: 10.2.0
#Pinned versions: 10.3.0
#test that import:
protobuf==3.20.2
@ -228,12 +228,11 @@ scikit-image==0.20.0 ; python_version >= "3.10"
#Pinned versions: 0.20.3
#test that import:
scipy==1.6.3 ; python_version < "3.10"
scipy==1.8.1 ; python_version == "3.10"
scipy==1.10.1 ; python_version == "3.11"
scipy==1.10.1 ; python_version <= "3.11"
scipy==1.12.0 ; python_version == "3.12"
# Pin SciPy because of failing distribution tests (see #60347)
#Description: scientific python
#Pinned versions: 1.6.3
#Pinned versions: 1.10.1
#test that import: test_unary_ufuncs.py, test_torch.py,test_tensor_creation_ops.py
#test_spectral_ops.py, test_sparse_csr.py, test_reductions.py,test_nn.py
#test_linalg.py, test_binary_ufuncs.py
@ -264,10 +263,10 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#Pinned versions:
#test that import:
#wheel not found on aarch64, and source build requires rust
lintrunner==0.10.7 ; platform_machine == "x86_64"
#lintrunner is supported on aarch64-linux only from 0.12.4 version
lintrunner==0.12.5
#Description: all about linters!
#Pinned versions: 0.10.7
#Pinned versions: 0.12.5
#test that import:
rockset==1.0.3
@ -280,9 +279,9 @@ ghstack==0.8.0
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.3
jinja2==3.1.4
#Description: jinja2 template engine
#Pinned versions: 3.1.3
#Pinned versions: 3.1.4
#test that import:
pytest-cpp==2.3.0
@ -311,3 +310,5 @@ lxml==5.0.0.
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries
PyGithub==2.3.0

View File

@ -61,15 +61,20 @@ COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# Install XPU Dependencies
ARG BASEKIT_VERSION
COPY ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
# TODO: will add triton xpu commit
COPY ci_commit_pins/triton.txt triton.txt
COPY ci_commit_pins/triton-xpu.txt triton-xpu.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt
# (optional) Install database packages like LMDB and LevelDB
ARG DB
@ -85,11 +90,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# Install XPU Dependencies
ARG BASEKIT_VERSION
COPY ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh

View File

@ -169,9 +169,11 @@ RUN rm install_acl.sh
ENV INSTALLED_ACL ${ACL}
# Install ccache/sccache (do this last, so we get priority in PATH)
ARG SKIP_SCCACHE_INSTALL
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
RUN if [ -z "${SKIP_SCCACHE_INSTALL}" ]; then bash ./install_cache.sh; fi
RUN rm install_cache.sh
# Add jni.h for java host build
COPY ./common/install_jni.sh install_jni.sh
@ -188,7 +190,9 @@ ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
ARG SKIP_LLVM_SRC_BUILD_INSTALL
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
RUN if [ -n "${SKIP_LLVM_SRC_BUILD_INSTALL}" ]; then set -eu; rm -rf /opt/llvm; fi
# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell

View File

@ -1,5 +1,9 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/../pytorch/common_utils.sh"
LOCAL_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
ROOT_DIR=$(cd "$LOCAL_DIR"/../.. && pwd)
TEST_DIR="$ROOT_DIR/test"

View File

@ -3,6 +3,20 @@
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
echo "sudo may print the following warning message that can be ignored. The chown command will still run."
echo " sudo: setrlimit(RLIMIT_STACK): Operation not permitted"
echo "For more details refer to https://github.com/sudo-project/sudo/issues/42"
sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace
}
# Disable shellcheck SC2064 as we want to parse the original owner immediately.
# shellcheck disable=SC2064
trap_add cleanup_workspace EXIT
sudo chown -R jenkins /var/lib/jenkins/workspace
git config --global --add safe.directory /var/lib/jenkins/workspace
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# TODO: This can be removed later once vision is also part of the Docker image
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

View File

@ -81,7 +81,22 @@ if ! which conda; then
export USE_MKLDNN=0
fi
else
export CMAKE_PREFIX_PATH=/opt/conda
# CMAKE_PREFIX_PATH precedences
# 1. $CONDA_PREFIX, if defined. This follows the pytorch official build instructions.
# 2. /opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}, if ANACONDA_PYTHON_VERSION defined.
# This is for CI, which defines ANACONDA_PYTHON_VERSION but not CONDA_PREFIX.
# 3. $(conda info --base). The fallback value of pytorch official build
# instructions actually refers to this.
# Commonly this is /opt/conda/
if [[ -v CONDA_PREFIX ]]; then
export CMAKE_PREFIX_PATH=${CONDA_PREFIX}
elif [[ -v ANACONDA_PYTHON_VERSION ]]; then
export CMAKE_PREFIX_PATH="/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}"
else
# already checked by `! which conda`
CMAKE_PREFIX_PATH="$(conda info --base)"
export CMAKE_PREFIX_PATH
fi
# Workaround required for MKL library linkage
# https://github.com/pytorch/pytorch/issues/119557
@ -223,6 +238,24 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
echo "sudo may print the following warning message that can be ignored. The chown command will still run."
echo " sudo: setrlimit(RLIMIT_STACK): Operation not permitted"
echo "For more details refer to https://github.com/sudo-project/sudo/issues/42"
sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace
}
# Disable shellcheck SC2064 as we want to parse the original owner immediately.
# shellcheck disable=SC2064
trap_add cleanup_workspace EXIT
sudo chown -R jenkins /var/lib/jenkins/workspace
git config --global --add safe.directory /var/lib/jenkins/workspace
fi
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
@ -248,13 +281,17 @@ else
( ! get_exit_code python setup.py clean bad_argument )
if [[ "$BUILD_ENVIRONMENT" != *libtorch* ]]; then
# rocm builds fail when WERROR=1
# XLA test build fails when WERROR=1
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0 release candidate for builds
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0rc1
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py bdist_wheel
@ -354,4 +391,8 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
python tools/stats/export_test_times.py
fi
print_sccache_stats
# snadampal: skipping it till sccache support added for aarch64
# https://github.com/pytorch/pytorch/issues/121559
if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then
print_sccache_stats
fi

View File

@ -159,7 +159,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.5"
pip_install --user "tlparse==0.3.7"
PATH="$(python -m site --user-base)/bin:$PATH"
}

View File

@ -45,6 +45,10 @@ time python test/run_test.py --verbose -i distributed/test_device_mesh
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -59,16 +59,16 @@ print("sample mean: ", sample_mean)
print("sample sigma: ", sample_sigma)
if math.isnan(sample_mean):
raise Exception("""Error: sample mean is NaN""")
raise Exception("""Error: sample mean is NaN""") # noqa: TRY002
elif math.isnan(sample_sigma):
raise Exception("""Error: sample sigma is NaN""")
raise Exception("""Error: sample sigma is NaN""") # noqa: TRY002
z_value = (sample_mean - mean) / sigma
print("z-value: ", z_value)
if z_value >= 3:
raise Exception(
raise Exception( # noqa: TRY002
f"""\n
z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run

View File

@ -26,8 +26,8 @@ echo "error: python_doc_push_script.sh: version (arg2) not specified"
fi
# Argument 1: Where to copy the built documentation to
# (pytorch.github.io/$install_path)
install_path="${1:-${DOCS_INSTALL_PATH:-docs/${DOCS_VERSION}}}"
# (pytorch_docs/$install_path)
install_path="${1:-${DOCS_INSTALL_PATH:-${DOCS_VERSION}}}"
if [ -z "$install_path" ]; then
echo "error: python_doc_push_script.sh: install_path (arg1) not specified"
exit 1
@ -68,8 +68,8 @@ build_docs () {
}
git clone https://github.com/pytorch/pytorch.github.io -b "$branch" --depth 1
pushd pytorch.github.io
git clone https://github.com/pytorch/docs pytorch_docs -b "$branch" --depth 1
pushd pytorch_docs
export LC_ALL=C
export PATH=/opt/conda/bin:$PATH
@ -105,6 +105,7 @@ if [ "$is_main_doc" = true ]; then
echo undocumented objects found:
cat build/coverage/python.txt
echo "Make sure you've updated relevant .rsts in docs/source!"
echo "You can reproduce locally by running 'cd docs && make coverage && cat build/coverage/python.txt'"
exit 1
fi
else

View File

@ -6,6 +6,27 @@
set -ex
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
echo "sudo may print the following warning message that can be ignored. The chown command will still run."
echo " sudo: setrlimit(RLIMIT_STACK): Operation not permitted"
echo "For more details refer to https://github.com/sudo-project/sudo/issues/42"
sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace
}
# Disable shellcheck SC2064 as we want to parse the original owner immediately.
# shellcheck disable=SC2064
trap_add cleanup_workspace EXIT
sudo chown -R jenkins /var/lib/jenkins/workspace
git config --global --add safe.directory /var/lib/jenkins/workspace
fi
echo "Environment variables:"
env
@ -90,9 +111,6 @@ if [[ -n $TESTS_TO_INCLUDE ]]; then
INCLUDE_CLAUSE="--include $TESTS_TO_INCLUDE"
fi
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Environment variables"
env
@ -163,6 +181,11 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
# TODO: revisit this once the CI is stabilized on aarch64 linux
export VALGRIND=OFF
fi
install_tlparse
# DANGER WILL ROBINSON. The LD_PRELOAD here could cause you problems
@ -211,8 +234,6 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so
# Disable valgrind for asan
export VALGRIND=OFF
# Increase stack size, because ASAN red zones use more stack
ulimit -s 81920
(cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)")
echo "The next four invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"
@ -289,19 +310,23 @@ test_dynamo_shard() {
test_inductor_distributed() {
# Smuggle a few multi-gpu tests here so that we don't have to request another large node
echo "Testing multi_gpu tests in test_torchinductor"
pytest test/inductor/test_torchinductor.py -k test_multi_gpu
pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device
pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices
pytest test/distributed/test_c10d_functional_native.py
pytest test/distributed/_tensor/test_dtensor_compile.py
pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp
pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype
python test/run_test.py -i inductor/test_torchinductor.py -k test_multi_gpu --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_non_default_cuda_device --verbose
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
@ -313,13 +338,13 @@ test_inductor() {
python tools/dynamo/verify_dynamo.py
python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose
# Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state
python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo --verbose
python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose
# docker build uses bdist_wheel which does not work with test_aot_inductor
# TODO: need a faster way to build
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
fi
}
@ -432,6 +457,17 @@ test_perf_for_dashboard() {
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then
# TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.
# The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data
# to fill the dashboard.
python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true
# Copy cudagraph results as mock data, easiest choice?
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"
fi
done
done
}
@ -486,6 +522,11 @@ test_single_dynamo_benchmark() {
fi
}
test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-micro-reports
python benchmarks/gpt_fast/benchmark.py
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -593,6 +634,12 @@ test_inductor_torchbench_cpu_smoketest_perf(){
done
}
test_torchbench_gcp_smoketest(){
pushd "${TORCHBENCHPATH}"
python test.py -v
popd
}
test_python_gloo_with_tls() {
source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
assert_git_not_dirty
@ -1116,11 +1163,33 @@ test_executorch() {
assert_git_not_dirty
}
test_linux_aarch64(){
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop --verbose
# Dynamo tests
python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \
dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \
dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose
# Inductor tests
python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \
inductor/test_config inductor/test_control_flow inductor/test_coordinate_descent_tuner inductor/test_fx_fusion \
inductor/test_group_batch_fusion inductor/test_inductor_freezing inductor/test_inductor_utils \
inductor/test_inplacing_pass inductor/test_kernel_benchmark inductor/test_layout_optim \
inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \
inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \
inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
fi
if [[ "${TEST_CONFIG}" == *backward* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
test_linux_aarch64
elif [[ "${TEST_CONFIG}" == *backward* ]]; then
test_forward_backward_compatibility
# Do NOT add tests after bc check tests, see its comment.
elif [[ "${TEST_CONFIG}" == *xla* ]]; then
@ -1145,6 +1214,8 @@ elif [[ "$TEST_CONFIG" == deploy ]]; then
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
install_torchvision
id=$((SHARD_NUMBER-1))
@ -1172,6 +1243,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
TORCHBENCHPATH=$(pwd)/torchbench test_torchbench_gcp_smoketest
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any

View File

@ -17,22 +17,22 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers
call %INSTALLER_DIR%\install_magma.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
call %INSTALLER_DIR%\install_sccache.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Miniconda has been installed as part of the Windows AMI with all the dependencies.
:: We just need to activate it here
call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Override VS env here
pushd .
@ -41,8 +41,8 @@ if "%VC_VERSION%" == "" (
) else (
call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%VC_VERSION%
)
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
@echo on
popd
@ -52,12 +52,12 @@ set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION%
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
goto fail
)
rem version transformer, for example 10.1 to 10_1.
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
goto fail
)
set VERSION_SUFFIX=%CUDA_VERSION:.=_%
set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
@ -101,8 +101,8 @@ if "%USE_CUDA%"=="1" (
:: CMake requires a single command as CUDA_NVCC_EXECUTABLE, so we push the wrappers
:: randomtemp.exe and sccache.exe into a batch file which CMake invokes.
curl -kL https://github.com/peterjc123/randomtemp-rust/releases/download/v0.4/randomtemp.exe --output %TMP_DIR_WIN%\bin\randomtemp.exe
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
echo @"%TMP_DIR_WIN%\bin\randomtemp.exe" "%TMP_DIR_WIN%\bin\sccache.exe" "%CUDA_PATH%\bin\nvcc.exe" %%* > "%TMP_DIR%/bin/nvcc.bat"
cat %TMP_DIR%/bin/nvcc.bat
set CUDA_NVCC_EXECUTABLE=%TMP_DIR%/bin/nvcc.bat
@ -114,8 +114,8 @@ if "%USE_CUDA%"=="1" (
set
python setup.py bdist_wheel
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
sccache --show-stats
python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"
(
@ -135,3 +135,8 @@ python -c "import os, glob; os.system('python -mpip install --no-index --no-deps
sccache --show-stats --stats-format json | jq .stats > sccache-stats-%BUILD_ENVIRONMENT%-%OUR_GITHUB_JOB_ID%.json
sccache --stop-server
exit /b 0
:fail
exit /b 1

View File

@ -36,6 +36,7 @@ hicpp-exception-baseclass,
hicpp-avoid-goto,
misc-*,
-misc-const-correctness,
-misc-include-cleaner,
-misc-use-anonymous-namespace,
-misc-unused-parameters,
-misc-no-recursion,

View File

@ -54,6 +54,7 @@ per-file-ignores =
torch/ao/quantization/fx/_decomposed.py: TOR901
torch/distributed/_functional_collectives.py: TOR901
torch/distributed/_spmd/data_parallel.py: TOR901
torch/distributed/_tensor/_collective_utils.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

1
.gitattributes vendored
View File

@ -4,3 +4,4 @@
.github/generated-* linguist-generated=true
.github/scripts/gql_mocks.json linguist-generated=true
third_party/LICENSES_BUNDLED.txt linguist-generated=true
tools/build/bazel/requirements.txt linguist-generated=true

View File

@ -8,7 +8,18 @@ body:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
existing and past issues](https://github.com/pytorch/pytorch/issues)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/master/dynamo/index.html)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)
Note: if you're submitting an issue that you generated from a fuzzer. Please do the following:
- Ensure rtol/atol are at default tolerances
- Dont compare indices of max/min etc, because that avoids the above requirement
- If comparing eager and torch.compile at fp16/bf16, you should use fp32 as baseline
If the above requirements are met, add the label "topic: fuzzer" to your issue.
- type: textarea
attributes:
label: 🐛 Describe the bug
@ -33,7 +44,7 @@ body:
label: Minified repro
description: |
Please run the minifier on your example and paste the minified code below
Learn more here https://pytorch.org/docs/master/compile/troubleshooting.html
Learn more here https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
placeholder: |
env TORCHDYNAMO_REPRO_AFTER="aot" python your_model.py
or

View File

@ -21,6 +21,7 @@ self-hosted-runner:
- linux.rocm.gpu
- macos-m1-stable
- macos-m1-13
- macos-m1-14
- macos-12-xl
- macos-12
- macos12.3-m1

View File

@ -9,6 +9,10 @@ inputs:
use-gha:
description: If set to any value, use GHA to download the artifact. Otherwise use s3.
required: false
s3-bucket:
description: S3 bucket to download builds
required: false
default: "gha-artifacts"
runs:
using: composite
@ -18,9 +22,10 @@ runs:
uses: seemethere/download-artifact-s3@v4
with:
name: ${{ inputs.name }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Download PyTorch Build Artifacts from GHA
if: inputs.use-gha
if: ${{ inputs.use-gha }}
uses: actions/download-artifact@v3
with:
name: ${{ inputs.name }}
@ -29,6 +34,10 @@ runs:
shell: bash
run: unzip -o artifacts.zip
- name: Remove artifacts.zip
shell: bash
run: rm artifacts.zip
- name: Output disk space left
shell: bash
run: df -H

View File

@ -13,6 +13,13 @@ inputs:
required: true
type: string
description: JSON description of what test configs to run.
selected-test-configs:
required: false
type: string
description: |
A comma-separated list of test configurations from the test matrix to keep,
The empty list means we are going to keep every configurations by defaults
default: ""
job-name:
type: string
required: false
@ -40,6 +47,9 @@ outputs:
ci-no-td:
description: True if ci-no-td label was on PR or [ci-no-td] in PR body.
value: ${{ steps.filter.outputs.ci-no-td }}
ci-td-distributed:
description: True if ci-td-distributed label was on PR or [ci-td-distributed] in PR body.
value: ${{ steps.filter.outputs.ci-td-distributed }}
runs:
using: composite
@ -123,6 +133,7 @@ runs:
--workflow "${GITHUB_WORKFLOW}" \
--job-name "${JOB_NAME}" \
--test-matrix "${{ inputs.test-matrix }}" \
--selected-test-configs "${{ inputs.selected-test-configs }}" \
--pr-number "${PR_NUMBER}" \
--tag "${TAG}" \
--event-name "${EVENT_NAME}" \

207
.github/actions/linux-build/action.yml vendored Normal file
View File

@ -0,0 +1,207 @@
name: linux-build
inputs:
build-environment:
required: true
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
default: "true"
description: If set, upload generated build artifacts.
build-with-debug:
required: false
default: "false"
description: If set, build in debug mode.
sync-tag:
required: false
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
default: "5.2"
description: Runner label to select worker type
runner:
required: false
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
default: ""
GITHUB_TOKEN:
description: GitHub token
required: true
HUGGING_FACE_HUB_TOKEN:
description: Hugging Face Hub token
required: false
default: ""
outputs:
docker-image:
value: ${{ steps.calculate-docker-image.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ steps.filter.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
runs:
using: composite
steps:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
role-duration-seconds: 10800
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Parse ref
id: parse-ref
shell: bash
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
# Apply the filter logic to the build step too if the test-config label is already there
- name: Select all requested test configurations (if the test matrix is available)
id: filter
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
id: build
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
shell: bash
run: |
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
-e SCCACHE_BUCKET \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e TORCH_CUDA_ARCH_LIST \
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
shell: bash
run: |
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

384
.github/actions/linux-test/action.yml vendored Normal file
View File

@ -0,0 +1,384 @@
name: linux-test
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
HUGGING_FACE_HUB_TOKEN:
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
required: false
default: ""
GITHUB_TOKEN:
description: GitHub token
required: true
#env:
# GIT_DEFAULT_BRANCH: ${{ inputs.default_branch }}
runs:
using: composite
steps:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
- name: Lock NVIDIA A100 40GB Frequency
shell: bash
run: |
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
- name: Start monitoring script
id: monitor-script
shell: bash
continue-on-error: true
run: |
python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84
python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Download build artifacts
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Download TD artifacts
continue-on-error: true
uses: ./.github/actions/download-td-artifacts
- name: Parse ref
id: parse-ref
shell: bash
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
- name: Check for keep-going label and re-enabled test issues
# This uses the filter-test-configs action because it conviniently
# checks for labels and re-enabled test issues. It does not actually do
# any filtering. All filtering is done in the build step.
id: keep-going
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Test
id: test
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
PR_NUMBER: ${{ github.event.pull_request.number }}
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
GITHUB_JOB: ${{ github.job }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
shell: bash
run: |
set -x
if [[ $TEST_CONFIG == 'multigpu' ]]; then
TEST_COMMAND=.ci/pytorch/multigpu-test.sh
elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then
TEST_COMMAND=.ci/onnx/test.sh
else
TEST_COMMAND=.ci/pytorch/test.sh
fi
# detached container should get cleaned up by teardown_ec2_linux
# TODO: Stop building test binaries as part of the build phase
# Used for GPU_FLAG since that doesn't play nice
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
-e GITHUB_REPOSITORY \
-e GITHUB_WORKFLOW \
-e GITHUB_JOB \
-e GITHUB_RUN_ID \
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e JOB_NAME \
-e BASE_SHA \
-e BRANCH \
-e SHA1 \
-e AWS_DEFAULT_REGION \
-e IN_WHEEL_TEST \
-e SHARD_NUMBER \
-e TEST_CONFIG \
-e NUM_TEST_SHARDS \
-e REENABLED_ISSUES \
-e CONTINUE_THROUGH_ERROR \
-e VERBOSE_TEST_LOGS \
-e NO_TEST_TIMEOUT \
-e NO_TD \
-e TD_DISTRIBUTED \
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e HUGGING_FACE_HUB_TOKEN \
-e DASHBOARD_TAG \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--ipc=host \
--shm-size="${SHM_SIZE}" \
--tty \
--detach \
--name="${container_name}" \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
# Propagate download.pytorch.org IP to container
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"
echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"
- name: Upload pytest cache if tests failed
uses: ./.github/actions/pytest-cache-upload
continue-on-error: true
if: failure() && steps.test.conclusion && steps.test.conclusion == 'failure'
with:
cache_dir: .pytest_cache
shard: ${{ matrix.shard }}
sha: ${{ github.event.pull_request.head.sha || github.sha }}
test_config: ${{ matrix.config }}
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
- name: Print remaining test logs
shell: bash
if: always() && steps.test.conclusion
run: |
cat test/**/*_toprint.log || true
- name: Stop monitoring script
if: always() && steps.monitor-script.outputs.monitor-script-pid
shell: bash
continue-on-error: true
env:
MONITOR_SCRIPT_PID: ${{ steps.monitor-script.outputs.monitor-script-pid }}
run: |
kill "$MONITOR_SCRIPT_PID"
- name: Upload test artifacts
uses: ./.github/actions/upload-test-artifacts
if: always() && steps.test.conclusion && steps.test.conclusion != 'skipped'
with:
file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
use-gha: ${{ inputs.use-gha }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Collect backtraces from coredumps (if any)
if: always()
shell: bash
run: |
# shellcheck disable=SC2156
find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \;
- name: Store Core dumps on S3
uses: seemethere/upload-artifact-s3@v5
if: failure()
with:
name: coredumps-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}
retention-days: 14
if-no-files-found: ignore
path: ./**/core.[1-9]*
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
# not seem to help. Here are some symptoms:
# * Calling nvidia-smi timeouts after 60 second
# * Fail to run nvidia-smi with an unable to determine the device handle for GPU
# unknown error
# * Test fails with a missing CUDA GPU error when initializing CUDA in PyTorch
# * Run docker --gpus all fails with error response from daemon
#
# As both the root cause and recovery path are unclear, let's take the runner out of
# service so that it doesn't get any more jobs
- name: Check NVIDIA driver installation step
if: failure() && steps.install-nvidia-driver.outcome && steps.install-nvidia-driver.outcome != 'skipped'
shell: bash
env:
RUNNER_WORKSPACE: ${{ runner.workspace }}
run: |
set +e
set -x
nvidia-smi
# NB: Surprisingly, nvidia-smi command returns successfully with return code 0 even in
# the case where the driver has already crashed as it still can get the driver version
# and some basic information like the bus ID. However, the rest of the information
# would be missing (ERR!), for example:
#
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
# | | | MIG M. |
# |===============================+======================+======================|
# | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! |
# |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default |
# | | | ERR! |
# +-------------------------------+----------------------+----------------------+
#
# +-----------------------------------------------------------------------------+
# | Processes: |
# | GPU GI CI PID Type Process name GPU Memory |
# | ID ID Usage |
# |=============================================================================|
# +-----------------------------------------------------------------------------+
#
# This should be reported as a failure instead as it will guarantee to fail when
# Docker tries to run with --gpus all
#
# So, the correct check here is to query one of the missing piece of info like
# GPU name, so that the command can fail accordingly
nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
NVIDIA_SMI_STATUS=$?
# These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action
if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
echo "NVIDIA driver installation has failed, shutting down the runner..."
.github/scripts/stop_runner_service.sh
fi
# For runner with multiple GPUs, we also want to confirm that the number of GPUs are the
# power of 2, i.e. 1, 2, 4, or 8. This is to avoid flaky test issue when one GPU fails
# https://github.com/pytorch/test-infra/issues/4000
GPU_COUNT=$(nvidia-smi --list-gpus | wc -l)
NVIDIA_SMI_STATUS=$?
# These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action
if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
echo "NVIDIA driver installation has failed, shutting down the runner..."
.github/scripts/stop_runner_service.sh
fi
# Check the GPU count to be a power of 2
if [ "$GPU_COUNT" -le 8 ] && [ "$GPU_COUNT" -ne 1 ] && [ "$GPU_COUNT" -ne 2 ] && [ "$GPU_COUNT" -ne 4 ] && [ "$GPU_COUNT" -ne 8 ]; then
echo "NVIDIA driver detects $GPU_COUNT GPUs. The runner has a broken GPU, shutting it down..."
.github/scripts/stop_runner_service.sh
fi

View File

@ -9,6 +9,10 @@ inputs:
job_identifier:
description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.
required: true
s3_bucket:
description: S3 bucket to download PyTest cache
required: false
default: "gha-artifacts"
runs:
using: composite
@ -30,6 +34,7 @@ runs:
CACHE_DIR: ${{ inputs.cache_dir }}
JOB_IDENTIFIER: ${{ inputs.job_identifier }}
REPO: ${{ github.repository }}
BUCKET: ${{ inputs.s3_bucket }}
run: |
python3 .github/scripts/pytest_cache.py \
--download \
@ -38,3 +43,4 @@ runs:
--job_identifier $JOB_IDENTIFIER \
--temp_dir $RUNNER_TEMP \
--repo $REPO \
--bucket $BUCKET \

View File

@ -15,10 +15,12 @@ runs:
category=$1
# If it is GCP runner (runner name contains gcp), do not run this
runner_name_str=${{ runner.name }}
if [[ $runner_name_str != *"gcp"* ]]; then
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
else
if [[ -f /.inarc ]]; then
echo "ARC Runner, no info on ec2 metadata"
elif [[ $runner_name_str == *"gcp"* ]]; then
echo "Runner is from Google Cloud Platform, No info on ec2 metadata"
else
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
fi
}
echo "ami-id: $(get_ec2_metadata ami-id)"
@ -26,8 +28,14 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Check if in a ARC runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT
- name: Start docker if docker deamon is not running
shell: bash
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";
@ -58,6 +66,7 @@ runs:
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Kill any existing containers, clean up images
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
shell: bash
run: |
# ignore expansion of "docker ps -q" since it could be empty
@ -96,3 +105,28 @@ runs:
echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts
cat /etc/hosts
- name: Check that the docker daemon is running
shell: bash
continue-on-error: true
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'true' }}
run: |
set +x
max_attempts=30
delay=10
attempt=1
for attempt in $(seq 1 $max_attempts); do
echo "Attempt $attempt of $max_attempts: Checking if Docker daemon is running..."
if docker info > /dev/null 2>&1; then
echo "Docker is running. Proceeding with the next steps"
exit 0
else
echo "Docker is not running yet."
echo "Retrying in $delay seconds..."
sleep $delay
fi
done
echo "Reached maximum attempts to connect to Docker. Exiting."
exit 1

View File

@ -11,6 +11,10 @@ inputs:
Suffix to add to the filename of the artifacts. This should include the
workflow job id, see [Job id in artifacts].
required: true
s3-bucket:
description: S3 bucket to download builds
required: false
default: "gha-artifacts"
runs:
using: composite
@ -42,7 +46,7 @@ runs:
env:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# Remove any previous test reports if they exist
# Remove any previous usage logs if they exist
rm -f logs-*.zip
# this workflow is also run in bazel build test, but we dont generate usage reports for it
# so check to see if the file exists first
@ -53,6 +57,18 @@ runs:
zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log'
fi
- name: Zip debugging artifacts for upload
if: runner.os != 'Windows' && !inputs.use-gha
shell: bash
env:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# Remove any previous debugging artifacts if they exist
rm -f debug-*.zip
if [ -d 'test/debug' ]; then
zip -r "debug-${FILE_SUFFIX}.zip" test/debug
fi
# Windows zip
- name: Zip JSONs for upload
if: runner.os == 'Windows' && !inputs.use-gha
@ -87,6 +103,7 @@ runs:
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
@ -97,6 +114,7 @@ runs:
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
@ -108,12 +126,25 @@ runs:
if: ${{ !inputs.use-gha }}
continue-on-error: true
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
if-no-files-found: ignore
path: logs-*.zip
- name: Store Debug Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
continue-on-error: true
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
if-no-files-found: ignore
path: debug-*.zip
# GHA upload
- name: Store Test Downloaded JSONs on Github
uses: actions/upload-artifact@v3

View File

@ -1 +1 @@
87aeb554d3e2f7855b7abe5120c282f59648ed7a
ea437b31ce316ea3d66fe73768c0dcb94edb79ad

View File

@ -1 +1 @@
2c127da8b5e2e8f44b50994c6cb931bcca267cfe
d23a6e1664d20707c11781299611436e1f0c104f

View File

@ -1 +1 @@
707a632930bfde19ffb361cdf5c31a7682af4e67
e3fc03314dab5f44e3ed9ccbba6c15fbca3285cd

13
.github/label_to_label.yml vendored Normal file
View File

@ -0,0 +1,13 @@
# Use this to auto apply labels based on other labels. Applies to both PRs and
# issues. Currently only supports any and all
- any:
- "module: custom operators"
- "module: aotdispatch"
then:
- "module: pt2-dispatcher"
- any:
- "module: dynamo"
- "module: pt2-dispatcher"
- "module: inductor"
then:
- "oncall: pt2"

14
.github/labeler.yml vendored
View File

@ -35,6 +35,9 @@
- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
- torch/distributed/_tensor/**
- torch/distributed/fsdp/**
- torch/csrc/inductor/**
- test/cpp/aoti_abi_check/**
- test/cpp/aoti_inference/**
"module: cpu":
- aten/src/ATen/cpu/**
@ -55,6 +58,17 @@
- third_party/mkl-dnn.BUILD
- torch/csrc/jit/codegen/onednn/**
- test/test_jit_llga_fuser.py
- test/test_mkldnn.py
"ciflow/linux-aarch64":
- third_party/ideep
- caffe2/ideep/**
- caffe2/python/ideep/**
- cmake/Modules/FindMKLDNN.cmake
- third_party/mkl-dnn.BUILD
- torch/csrc/jit/codegen/onednn/**
- test/test_jit_llga_fuser.py
- test/test_mkldnn.py
"module: amp (automated mixed precision)":
- torch/amp/**

View File

@ -28,12 +28,13 @@
- caffe2/python/onnx/**
approved_by:
- BowenBao
- abock
- justinchuby
- liqunfu
- shubhambhokare1
- thiagocrepaldi
- titaiwangms
- wschin
- xadupre
mandatory_checks_name:
- EasyCLA
- Lint
@ -236,6 +237,23 @@
- Lint
- pull
- name: XPU ATen
patterns:
- aten/src/ATen/xpu/**
- c10/xpu/**
- torch/csrc/xpu/**
- torch/xpu/**
- test/xpu/**
- third_party/xpu.txt
approved_by:
- EikanWang
- jgong5
- gujinghui
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: Distributions
patterns:
- torch/distributions/**
@ -357,12 +375,14 @@
- name: CPU inductor
patterns:
- torch/_inductor/mkldnn_lowerings.py
- torch/_inductor/fx_passes/mkldnn_fusion.py
- torch/_inductor/fx_passes/quantization.py
- torch/_inductor/codegen/cpp.py
- test/inductor/test_mkldnn_pattern_matcher.py
- test/inductor/test_cpu_repo.py
- test/inductor/test_cpu_cpp_wrapper.py
- aten/src/ATen/cpu/**
- aten/src/ATen/native/quantized/cpu/**
- test/quantization/core/test_quantized_op.py
- torch/ao/quantization/quantizer/x86_inductor_quantizer.py

View File

@ -1,5 +1,6 @@
tracking_issue: 24422
ciflow_tracking_issue: 64124
TD_rollout_issue: 123120
ciflow_push_tags:
- ciflow/binaries
- ciflow/binaries_conda
@ -7,6 +8,8 @@ ciflow_push_tags:
- ciflow/binaries_wheel
- ciflow/inductor
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly
- ciflow/periodic
@ -15,9 +18,12 @@ ciflow_push_tags:
- ciflow/trunk
- ciflow/unstable
- ciflow/xpu
- ciflow/torchbench
retryable_workflows:
- lint
- pull
- trunk
- linux-binary
- windows-binary
labeler_config: labeler.yml
label_to_label_config: label_to_label.yml

View File

@ -5,7 +5,7 @@
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.19.12
jinja2==3.0.1
jinja2==3.1.4
lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84

View File

@ -1,4 +1,4 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.9.1
optree==0.11.0

View File

@ -26,4 +26,7 @@ pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.9.1
optree==0.11.0
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

99
.github/scripts/amd/package_triton_wheel.sh vendored Executable file
View File

@ -0,0 +1,99 @@
set -ex
# Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm
ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"
# Find rocm_version.h header file for ROCm version extract
rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"
if [ ! -f "$rocm_version_h" ]; then
rocm_version_h="${ROCM_HOME}/include/rocm_version.h"
fi
# Error out if rocm_version.h not found
if [ ! -f "$rocm_version_h" ]; then
echo "Error: rocm_version.h not found in expected locations." >&2
exit 1
fi
# Extract major, minor and patch ROCm version numbers
MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')
MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')
PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')
ROCM_INT=$(($MAJOR_VERSION * 10000 + $MINOR_VERSION * 100 + $PATCH_VERSION))
echo "ROCm version: $ROCM_INT"
# Check TRITON_ROCM_DIR is set
if [[ -z "${TRITON_ROCM_DIR}" ]]; then
export TRITON_ROCM_DIR=third_party/amd/backend
fi
# Remove packaged libs and headers
rm -rf $TRITON_ROCM_DIR/include/*
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
LIBNUMA_PATH="/usr/lib64/libnuma.so.1"
LIBELF_PATH="/usr/lib64/libelf.so.1"
OS_SO_PATHS=(
$LIBELF_PATH
$LIBNUMA_PATH
$LIBTINFO_PATH
)
for lib in "${OS_SO_PATHS[@]}"
do
cp $lib $TRITON_ROCM_DIR/lib/
done
# Required ROCm libraries
if [[ "${MAJOR_VERSION}" == "6" ]]; then
libamdhip="libamdhip64.so.6"
else
libamdhip="libamdhip64.so.5"
fi
# Required ROCm libraries - ROCm 6.0
ROCM_SO=(
"${libamdhip}"
"libhsa-runtime64.so.1"
"libamd_comgr.so.2"
"libdrm.so.2"
"libdrm_amdgpu.so.1"
)
if [[ $ROCM_INT -ge 60100 ]]; then
ROCM_SO+=("librocprofiler-register.so.0")
fi
for lib in "${ROCM_SO[@]}"
do
file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib
if [[ -z $file_path ]]; then
if [ -d "$ROCM_HOME/lib64/" ]; then
file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64
fi
fi
if [[ -z $file_path ]]; then
file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME
fi
if [[ -z $file_path ]]; then
file_path=($(find /opt/ -name "$lib")) # Then search in /opt
fi
if [[ -z $file_path ]]; then
echo "Error: Library file $lib is not found." >&2
exit 1
fi
cp $file_path $TRITON_ROCM_DIR/lib
# When running locally, and not building a wheel, we need to satisfy shared objects requests that don't look for versions
LINKNAME=$(echo $lib | sed -e 's/\.so.*/.so/g')
ln -sf $lib $TRITON_ROCM_DIR/lib/$LINKNAME
done
# Copy Include Files
cp -r $ROCM_HOME/include/hip $TRITON_ROCM_DIR/include
# Copy linker
mkdir -p $TRITON_ROCM_DIR/llvm/bin
cp $ROCM_HOME/llvm/bin/ld.lld $TRITON_ROCM_DIR/llvm/bin/

103
.github/scripts/amd/patch_triton_wheel.sh vendored Executable file
View File

@ -0,0 +1,103 @@
#!/bin/bash
set -x
if [ -z "$1" ]; then
echo "Need wheel location argument" && exit 1
fi
WHEELHOUSE_DIR=$1
PATCHELF_BIN=patchelf
ROCM_LIB=backends/amd/lib
ROCM_LD=backends/amd/llvm/bin
PREFIX=triton
fname_without_so_number() {
LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')
echo "$LINKNAME"
}
replace_needed_sofiles() {
find $1 -name '*.so*' -o -name 'ld.lld' | while read sofile; do
origname=$2
patchedname=$3
if [[ "$origname" != "$patchedname" ]]; then
set +e
origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "patching $sofile entry $origname to $patchedname"
$PATCHELF_BIN --replace-needed $origname $patchedname $sofile
fi
fi
done
}
mkdir -p "/tmp_dir"
pushd /tmp_dir
for pkg in /$WHEELHOUSE_DIR/*triton*.whl; do
echo "Modifying $pkg"
rm -rf tmp
mkdir -p tmp
cd tmp
cp $pkg .
unzip -q $(basename $pkg)
rm -f $(basename $pkg)
$PATCHELF_BIN --set-rpath ${LD_SO_RPATH:-'$ORIGIN:$ORIGIN/../../lib'} $PREFIX/$ROCM_LD/ld.lld
$PATCHELF_BIN --print-rpath $PREFIX/$ROCM_LD/ld.lld
# Modify libtriton.so as it sits in _C directory apart from its dependencies
find $PREFIX/_C -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile"
$PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/'../$ROCM_LIB} ${FORCE_RPATH:-} $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# All included dependencies are included in a single lib directory
deps=()
deps_soname=()
while read sofile; do
echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"
$PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile
$PATCHELF_BIN --print-rpath $sofile
deps+=("$sofile")
deps_soname+=("$(basename $sofile)")
done < <(find $PREFIX/$ROCM_LIB -type f -name "*.so*")
patched=()
for filepath in "${deps[@]}"; do
filename=$(basename $filepath)
destpath=$PREFIX/$ROCM_LIB/$filename
if [[ "$filepath" != "$destpath" ]]; then
cp $filepath $destpath
fi
patchedpath=$(fname_without_so_number $destpath)
patchedname=$(basename $patchedpath)
if [[ "$destpath" != "$patchedpath" ]]; then
mv $destpath $patchedpath
fi
patched+=("$patchedname")
echo "Copied $filepath to $patchedpath"
done
# Go through all required shared objects and see if any of our other objects are dependants. If so, replace so.ver wth so
for ((i=0;i<${#deps[@]};++i)); do
echo "replacing "${deps_soname[i]} ${patched[i]}
replace_needed_sofiles $PREFIX/$ROCM_LIB ${deps_soname[i]} ${patched[i]}
replace_needed_sofiles $PREFIX/_C ${deps_soname[i]} ${patched[i]}
replace_needed_sofiles $PREFIX/$ROCM_LD ${deps_soname[i]} ${patched[i]}
done
# Re-bundle whl with so adjustments
zip -rqy $(basename $pkg) *
if [[ -z "${MANYLINUX_VERSION}" ]]; then
newpkg=$pkg
else
newpkg=$(echo $pkg | sed -e "s/\linux_x86_64/${MANYLINUX_VERSION}/g")
fi
# Remove original whl
rm -f $pkg
# Move rebuilt whl to original location with new name.
mv $(basename $pkg) $newpkg
done

View File

@ -10,9 +10,6 @@ from typing import Optional
SCRIPT_DIR = Path(__file__).parent
REPO_DIR = SCRIPT_DIR.parent.parent
# TODO: Remove me once Triton version is again in sync for vanilla and ROCm
ROCM_TRITION_VERSION = "2.1.0"
def read_triton_pin(rocm_hash: bool = False) -> str:
triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"
@ -32,27 +29,6 @@ def check_and_replace(inp: str, src: str, dst: str) -> str:
return inp.replace(src, dst)
def patch_setup_py(
path: Path,
*,
version: str,
name: str = "triton",
expected_version: Optional[str] = None,
) -> None:
with open(path) as f:
orig = f.read()
# Replace name
orig = check_and_replace(orig, 'name="triton",', f'name="{name}",')
# Replace version
if not expected_version:
expected_version = read_triton_version()
orig = check_and_replace(
orig, f'version="{expected_version}",', f'version="{version}",'
)
with open(path, "w") as f:
f.write(orig)
def patch_init_py(
path: Path, *, version: str, expected_version: Optional[str] = None
) -> None:
@ -92,14 +68,20 @@ def build_triton(
with TemporaryDirectory() as tmpdir:
triton_basedir = Path(tmpdir) / "triton"
triton_pythondir = triton_basedir / "python"
triton_repo = "https://github.com/openai/triton"
if build_rocm:
triton_repo = "https://github.com/ROCmSoftwarePlatform/triton"
triton_pkg_name = "pytorch-triton-rocm"
else:
triton_repo = "https://github.com/openai/triton"
triton_pkg_name = "pytorch-triton"
check_call(["git", "clone", triton_repo], cwd=tmpdir)
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if release:
ver, rev, patch = version.split(".")
check_call(
["git", "checkout", f"release/{ver}.{rev}.x"], cwd=triton_basedir
)
else:
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(
@ -155,18 +137,15 @@ def build_triton(
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}",
expected_version=ROCM_TRITION_VERSION if build_rocm else None,
expected_version=None,
)
if build_rocm:
# TODO: Remove me when ROCM triton is updated
patch_setup_py(
triton_pythondir / "setup.py",
name=triton_pkg_name,
version=f"{version}",
expected_version=ROCM_TRITION_VERSION,
check_call(
[f"{SCRIPT_DIR}/amd/package_triton_wheel.sh"],
cwd=triton_basedir,
shell=True,
)
check_call("scripts/amd/setup_rocm_libs.sh", cwd=triton_basedir, shell=True)
print("ROCm libraries setup for triton installation...")
check_call(
@ -177,7 +156,10 @@ def build_triton(
shutil.copy(whl_path, Path.cwd())
if build_rocm:
check_call("scripts/amd/fix_so.sh", cwd=triton_basedir, shell=True)
check_call(
[f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh", Path.cwd()],
cwd=triton_basedir,
)
return Path.cwd() / whl_path.name

View File

@ -29,7 +29,7 @@ def parse_args() -> Any:
"--onto-branch", type=str, required=True, help="the target release branch"
)
parser.add_argument(
"--github-actor", type=str, required=True, help="all the worlds a stage"
"--github-actor", type=str, required=True, help="all the world's a stage"
)
parser.add_argument(
"--classification",

View File

@ -23,8 +23,10 @@ def main() -> None:
job_link = f"[job]({run_url})" if run_url is not None else "job"
msg = (
f"The {args.action} {job_link} was canceled. If you believe this is a mistake,"
+ f" then you can re trigger it through [pytorch-bot]({BOT_COMMANDS_WIKI})."
f"The {args.action} {job_link} was canceled or timed out. This most often happen if two merge requests were issued"
+ " for the same PR, or if merge job was waiting for more than 6 hours for tests to finish."
+ " In later case, please do not hesitate to reissue the merge command\n"
+ f" For more information see [pytorch-bot wiki]({BOT_COMMANDS_WIKI})."
)
gh_post_pr_comment(org, project, args.pr_num, msg)

View File

@ -18,7 +18,7 @@ ESTIMATED_TOKENS = [0]
TOKEN = os.environ["GITHUB_TOKEN"]
if not TOKEN:
raise Exception("GITHUB_TOKEN is not set")
raise Exception("GITHUB_TOKEN is not set") # noqa: TRY002
REPO_ROOT = Path(__file__).parent.parent.parent

Binary file not shown.

View File

@ -1,6 +1,7 @@
#!/usr/bin/env python3
import json
import logging
import os
import re
import subprocess
@ -8,6 +9,7 @@ import sys
import warnings
from enum import Enum
from functools import lru_cache
from logging import info
from typing import Any, Callable, Dict, List, Optional, Set
from urllib.request import Request, urlopen
@ -17,33 +19,7 @@ REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://gi
PREFIX = "test-config/"
# Same as shard names
VALID_TEST_CONFIG_LABELS = {
f"{PREFIX}{label}"
for label in {
"backwards_compat",
"crossref",
"default",
"deploy",
"distributed",
"docs_tests",
"dynamo",
"force_on_cpu",
"functorch",
"inductor",
"inductor_distributed",
"inductor_huggingface",
"inductor_timm",
"inductor_torchbench",
"jit_legacy",
"multigpu",
"nogpu_AVX512",
"nogpu_NO_AVX2",
"slow",
"tsan",
"xla",
}
}
logging.basicConfig(level=logging.INFO)
def is_cuda_or_rocm_job(job_name: Optional[str]) -> bool:
@ -90,6 +66,12 @@ def parse_args() -> Any:
parser.add_argument(
"--test-matrix", type=str, required=True, help="the original test matrix"
)
parser.add_argument(
"--selected-test-configs",
type=str,
default="",
help="a comma-separated list of test configurations from the test matrix to keep",
)
parser.add_argument(
"--workflow", type=str, help="the name of the current workflow, i.e. pull"
)
@ -155,19 +137,25 @@ def get_labels(pr_number: int) -> Set[str]:
}
def filter_labels(labels: Set[str], label_regex: Any) -> Set[str]:
"""
Return the list of matching labels
"""
return {l for l in labels if re.match(label_regex, l)}
def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, List[Any]]:
"""
Select the list of test config to run from the test matrix. The logic works
as follows:
If the PR has one or more labels as specified in the VALID_TEST_CONFIG_LABELS set, only
these test configs will be selected. This also works with ciflow labels, for example,
if a PR has both ciflow/trunk and test-config/functorch, only trunk functorch builds
and tests will be run
If the PR has one or more test-config labels as specified, only these test configs
will be selected. This also works with ciflow labels, for example, if a PR has both
ciflow/trunk and test-config/functorch, only trunk functorch builds and tests will
be run.
If the PR has none of the test-config label, all tests are run as usual.
"""
filtered_test_matrix: Dict[str, List[Any]] = {"include": []}
for entry in test_matrix.get("include", []):
@ -177,23 +165,46 @@ def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, Lis
label = f"{PREFIX}{config_name.strip()}"
if label in labels:
print(
f"Select {config_name} because label {label} is presented in the pull request by the time the test starts"
)
msg = f"Select {config_name} because label {label} is present in the pull request by the time the test starts"
info(msg)
filtered_test_matrix["include"].append(entry)
valid_test_config_labels = labels.intersection(VALID_TEST_CONFIG_LABELS)
if not filtered_test_matrix["include"] and not valid_test_config_labels:
# Found no valid label and the filtered test matrix is empty, return the same
test_config_labels = filter_labels(labels, re.compile(f"{PREFIX}.+"))
if not filtered_test_matrix["include"] and not test_config_labels:
info("Found no test-config label on the PR, so all test configs are included")
# Found no test-config label and the filtered test matrix is empty, return the same
# test matrix as before so that all tests can be run normally
return test_matrix
else:
msg = f"Found {test_config_labels} on the PR so only these test configs are run"
info(msg)
# When the filter test matrix contain matches or if a valid test config label
# is found in the PR, return the filtered test matrix
return filtered_test_matrix
def filter_selected_test_configs(
test_matrix: Dict[str, List[Any]], selected_test_configs: Set[str]
) -> Dict[str, List[Any]]:
"""
Keep only the selected configs if the list if not empty. Otherwise, keep all test configs.
This filter is used when the workflow is dispatched manually.
"""
if not selected_test_configs:
return test_matrix
filtered_test_matrix: Dict[str, List[Any]] = {"include": []}
for entry in test_matrix.get("include", []):
config_name = entry.get("config", "")
if not config_name:
continue
if config_name in selected_test_configs:
filtered_test_matrix["include"].append(entry)
return filtered_test_matrix
def set_periodic_modes(
test_matrix: Dict[str, List[Any]], job_name: Optional[str]
) -> Dict[str, List[Any]]:
@ -374,30 +385,33 @@ def process_jobs(
# - If the target record has the job (config) name, only that test config
# will be skipped or marked as unstable
if not target_job_cfg:
print(
msg = (
f"Issue {target_url} created by {author} has {issue_type.value} "
+ f"all CI jobs for {workflow} / {job_name}"
)
info(msg)
return _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
)
if target_job_cfg == BUILD_JOB_NAME:
print(
msg = (
f"Issue {target_url} created by {author} has {issue_type.value} "
+ f"the build job for {workflow} / {job_name}"
)
info(msg)
return _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
)
if target_job_cfg in (TEST_JOB_NAME, BUILD_AND_TEST_JOB_NAME):
print(
msg = (
f"Issue {target_url} created by {author} has {issue_type.value} "
+ f"all the test jobs for {workflow} / {job_name}"
)
info(msg)
return _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
@ -463,7 +477,7 @@ def parse_reenabled_issues(s: Optional[str]) -> List[str]:
def get_reenabled_issues(pr_body: str = "") -> List[str]:
default_branch = os.getenv("GIT_DEFAULT_BRANCH", "main")
default_branch = f"origin/{os.environ.get('GIT_DEFAULT_BRANCH', 'main')}"
try:
commit_messages = subprocess.check_output(
f"git cherry -v {default_branch}".split(" ")
@ -494,10 +508,15 @@ def perform_misc_tasks(
"ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")
)
set_output("ci-no-td", check_for_setting(labels, pr_body, "ci-no-td"))
# Only relevant for the one linux distributed cuda job, delete this when TD
# is rolled out completely
set_output(
"ci-td-distributed", check_for_setting(labels, pr_body, "ci-td-distributed")
)
# Obviously, if the job name includes unstable, then this is an unstable job
is_unstable = job_name and IssueType.UNSTABLE.value in job_name
if not is_unstable and test_matrix:
if not is_unstable and test_matrix and test_matrix.get("include"):
# Even when the job name doesn't mention unstable, we will also mark it as
# unstable when the test matrix only includes unstable jobs. Basically, this
# logic allows build or build-and-test jobs to be marked as unstable too.
@ -567,6 +586,16 @@ def main() -> None:
# No PR number, no tag, we can just return the test matrix as it is
filtered_test_matrix = test_matrix
if args.selected_test_configs:
selected_test_configs = {
v.strip().lower()
for v in args.selected_test_configs.split(",")
if v.strip()
}
filtered_test_matrix = filter_selected_test_configs(
filtered_test_matrix, selected_test_configs
)
if args.event_name == "schedule" and args.schedule == "29 8 * * *":
# we don't want to run the mem leak check or disabled tests on normal
# periodically scheduled jobs, only the ones at this time

View File

@ -13,16 +13,16 @@ architectures:
import os
from typing import Dict, List, Optional, Tuple
CUDA_ARCHES = ["11.8", "12.1"]
CUDA_ARCHES = ["11.8", "12.1", "12.4"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1"}
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8", "12.4": "8"}
ROCM_ARCHES = ["5.7", "6.0"]
ROCM_ARCHES = ["6.0", "6.1"]
CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
@ -58,6 +58,20 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.4": (
"nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
}
@ -324,7 +338,7 @@ def generate_wheels_matrix(
)
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if arch_version in ["12.1", "11.8"] and os == "linux":
if arch_version in ["12.4", "12.1", "11.8"] and os == "linux":
ret.append(
{
"python_version": python_version,
@ -367,5 +381,6 @@ def generate_wheels_matrix(
return ret
validate_nccl_dep_consistency("12.4")
validate_nccl_dep_consistency("12.1")
validate_nccl_dep_consistency("11.8")

View File

@ -378,7 +378,9 @@ def main() -> None:
for template, workflows in template_and_workflows:
# added Iterable check to appease the mypy gods
if not isinstance(workflows, Iterable):
raise Exception(f"How is workflows not iterable? {workflows}")
raise Exception( # noqa: TRY002
f"How is workflows not iterable? {workflows}"
) # noqa: TRY002
for workflow in workflows:
workflow.generate_workflow_file(workflow_template=template)

View File

@ -21,6 +21,8 @@ DOCKER_IMAGE_TYPES = ["runtime", "devel"]
def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:
ret: List[Dict[str, str]] = []
# CUDA amd64 Docker images are available as both runtime and devel while
# CPU arm64 image is only available as runtime.
for cuda, version in generate_binary_build_matrix.CUDA_ARCHES_FULL_VERSION.items():
for image in DOCKER_IMAGE_TYPES:
ret.append(
@ -31,9 +33,19 @@ def generate_docker_matrix() -> Dict[str, List[Dict[str, str]]]:
cuda
],
"image_type": image,
"platform": "linux/arm64,linux/amd64",
"platform": "linux/amd64",
}
)
ret.append(
{
"cuda": "cpu",
"cuda_full_version": "",
"cudnn_version": "",
"image_type": "runtime",
"platform": "linux/arm64",
}
)
return {"include": ret}

View File

@ -4,6 +4,7 @@
import argparse
import json
import operator
import os
import re
import sys
@ -126,7 +127,7 @@ def find_job_id_name(args: Any) -> Tuple[str, str]:
# Sort the jobs list by start time, in descending order. We want to get the most
# recently scheduled job on the runner.
jobs.sort(key=lambda job: job["started_at"], reverse=True)
jobs.sort(key=operator.itemgetter("started_at"), reverse=True)
for job in jobs:
if job["runner_name"] == args.runner_name:

99
.github/scripts/get_workflow_type.py vendored Normal file
View File

@ -0,0 +1,99 @@
import json
from argparse import ArgumentParser
from typing import Any
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_TYPE_LABEL = "label"
WORKFLOW_TYPE_RG = "rg"
WORKFLOW_TYPE_BOTH = "both"
def parse_args() -> Any:
parser = ArgumentParser("Get dynamic rollout settings")
parser.add_argument("--github-token", type=str, required=True, help="GitHub token")
parser.add_argument(
"--github-repo",
type=str,
required=False,
default="pytorch/test-infra",
help="GitHub repo to get the issue",
)
parser.add_argument(
"--github-issue", type=int, required=True, help="GitHub issue umber"
)
parser.add_argument(
"--github-user", type=str, required=True, help="GitHub username"
)
parser.add_argument(
"--github-branch", type=str, required=True, help="Current GitHub branch"
)
return parser.parse_args()
def get_gh_client(github_token: str) -> Github:
auth = Auth.Token(github_token)
return Github(auth=auth)
def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:
repo = gh.get_repo(repo)
return repo.get_issue(number=issue_num)
def is_exception_branch(branch: str) -> bool:
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_workflow_type(issue: Issue, username: str) -> str:
user_list = issue.get_comments()[0].body.split("\r\n")
try:
run_option = issue.get_comments()[1].body.split("\r\n")[0]
except Exception as e:
run_option = "single"
if user_list[0] == "!":
# Use old runners for everyone
return WORKFLOW_TYPE_LABEL
elif user_list[1] == "*":
if run_option == WORKFLOW_TYPE_BOTH:
# Use ARC runners and old runners for everyone
return WORKFLOW_TYPE_BOTH
else:
# Use only ARC runners for everyone
return WORKFLOW_TYPE_RG
elif username in user_list:
if run_option == WORKFLOW_TYPE_BOTH:
# Use ARC runners and old runners for a specific user
return WORKFLOW_TYPE_BOTH
else:
# Use only ARC runners for a specific user
return WORKFLOW_TYPE_RG
else:
# Use old runners by default
return WORKFLOW_TYPE_LABEL
def main() -> None:
args = parse_args()
if is_exception_branch(args.github_branch):
output = {"workflow_type": WORKFLOW_TYPE_LABEL}
else:
try:
gh = get_gh_client(args.github_token)
issue = get_issue(gh, args.github_repo, args.github_issue)
output = {"workflow_type": get_workflow_type(issue, args.github_user)}
except Exception as e:
output = {"workflow_type": WORKFLOW_TYPE_LABEL}
json_output = json.dumps(output)
print(json_output)
if __name__ == "__main__":
main()

Binary file not shown.

View File

@ -6,6 +6,9 @@ CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"
conda activate "${CONDA_ENV}"
# Use uv to speed up lintrunner init
python3 -m pip install uv
CACHE_DIRECTORY="/tmp/.lintbin"
# Try to recover the cached binaries
if [[ -d "${CACHE_DIRECTORY}" ]]; then

20
.github/scripts/td_llm_indexer.sh vendored Normal file
View File

@ -0,0 +1,20 @@
#!/bin/bash
set -euxo pipefail
# Download requirements
cd llm-target-determinator
pip install -q -r requirements.txt
cd ../codellama
pip install -e .
# Run indexer
cd ../llm-target-determinator
torchrun \
--standalone \
--nnodes=1 \
--nproc-per-node=1 \
indexer.py \
--experiment-name indexer-files \
--granularity FILE

View File

@ -9,6 +9,7 @@ from unittest import main, mock, TestCase
import yaml
from filter_test_configs import (
filter,
filter_selected_test_configs,
get_labels,
mark_unstable_jobs,
parse_reenabled_issues,
@ -17,7 +18,6 @@ from filter_test_configs import (
remove_disabled_jobs,
set_periodic_modes,
SUPPORTED_PERIODICAL_MODES,
VALID_TEST_CONFIG_LABELS,
)
@ -273,13 +273,13 @@ class TestConfigFilter(TestCase):
testcases = [
{
"test_matrix": '{include: [{config: "default", runner: "linux"}]}',
"expected": '{"include": [{"config": "default", "runner": "linux"}]}',
"description": "No match, keep the same test matrix",
"expected": '{"include": []}',
"description": "Request test-config/cfg but the test matrix doesn't have it",
},
{
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "plain-cfg"}]}',
"expected": '{"include": [{"config": "default", "runner": "linux"}, {"config": "plain-cfg"}]}',
"description": "No match because there is no prefix or suffix, keep the same test matrix",
"expected": '{"include": []}',
"description": "A valid test config label needs to start with test-config/",
},
{
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", shard: 1}]}',
@ -294,9 +294,8 @@ class TestConfigFilter(TestCase):
)
self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
def test_filter_with_valid_label(self) -> None:
def test_filter_with_test_config_label(self) -> None:
mocked_labels = {f"{PREFIX}cfg", "ciflow/trunk"}
VALID_TEST_CONFIG_LABELS.add(f"{PREFIX}cfg")
testcases = [
{
@ -317,6 +316,51 @@ class TestConfigFilter(TestCase):
)
self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
def test_filter_selected_test_configs(self) -> None:
testcases = [
{
"test_matrix": '{include: [{config: "default"}]}',
"selected_test_configs": "",
"expected": '{"include": [{"config": "default"}]}',
"description": "No selected test configs",
},
{
"test_matrix": '{include: [{config: "default"}]}',
"selected_test_configs": "foo",
"expected": '{"include": []}',
"description": "A different test config is selected",
},
{
"test_matrix": '{include: [{config: "default"}]}',
"selected_test_configs": "foo, bar",
"expected": '{"include": []}',
"description": "A different set of test configs is selected",
},
{
"test_matrix": '{include: [{config: "default"}]}',
"selected_test_configs": "foo, bar,default",
"expected": '{"include": [{"config": "default"}]}',
"description": "One of the test config is selected",
},
{
"test_matrix": '{include: [{config: "default"}, {config: "bar"}]}',
"selected_test_configs": "foo, bar,Default",
"expected": '{"include": [{"config": "default"}, {"config": "bar"}]}',
"description": "Several test configs are selected",
},
]
for case in testcases:
selected_test_configs = {
v.strip().lower()
for v in case["selected_test_configs"].split(",")
if v.strip()
}
filtered_test_matrix = filter_selected_test_configs(
yaml.safe_load(case["test_matrix"]), selected_test_configs
)
self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
def test_set_periodic_modes(self) -> None:
testcases: List[Dict[str, str]] = [
{
@ -641,6 +685,7 @@ class TestConfigFilter(TestCase):
ci_verbose_test_logs: bool = False,
ci_no_test_timeout: bool = False,
ci_no_td: bool = False,
ci_td_distributed: bool = False,
is_unstable: bool = False,
reenabled_issues: str = "",
) -> str:
@ -649,6 +694,7 @@ class TestConfigFilter(TestCase):
f"ci-verbose-test-logs={ci_verbose_test_logs}\n"
f"ci-no-test-timeout={ci_no_test_timeout}\n"
f"ci-no-td={ci_no_td}\n"
f"ci-td-distributed={ci_td_distributed}\n"
f"is-unstable={is_unstable}\n"
f"reenabled-issues={reenabled_issues}\n"
)

View File

@ -205,7 +205,6 @@ def mocked_read_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule
approved_by=["pytorch/metamates", "ngimel"],
mandatory_checks_name=[
"Lint",
"Facebook CLA Check",
"pull / linux-xenial-cuda11.3-py3.7-gcc7 / build",
],
ignore_flaky_failures=True,
@ -398,7 +397,7 @@ class TestTryMerge(TestCase):
def test_gql_retrieve_checksuites(self, *args: Any) -> None:
"Fetch comments and conclusions for PR with 60 commits"
pr = GitHubPR("pytorch", "pytorch", 94787)
self.assertEqual(len(pr.get_checkrun_conclusions()), 183)
self.assertEqual(len(pr.get_checkrun_conclusions()), 182)
def test_team_members(self, *args: Any) -> None:
"Test fetching team members works"
@ -742,6 +741,30 @@ class TestBypassFailures(TestCase):
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["UNSTABLE"]) == 1)
# Add another test case where there is no unstable keyword in the job name, but
# the job has already been marked as unstable
pr = GitHubPR("pytorch", "executorch", 3318)
checks = pr.get_checkrun_conclusions()
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
print(checks)
workflow_name = "test-llama-app"
job_name = "mobile-job (android)"
self.assertTrue(
checks[f"Android / {workflow_name} / {job_name}"].classification
== "UNSTABLE"
)
pending, failed, ignorable = categorize_checks(
checks, list(checks.keys()), ok_failed_checks_threshold=1
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["UNSTABLE"]) == 1)
def test_get_classifications_broken_trunk(self, *args: Any) -> None:
# The mock merge base is the actual value returned by gh_fetch_merge_base
test_cases = [
@ -833,6 +856,41 @@ class TestBypassFailures(TestCase):
self.assertTrue(len(ignorable["FLAKY"]) == 4)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 2)
def test_get_classifications_wrong_workflow_name(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 123104)
checks = pr.get_checkrun_conclusions()
check_name = "linux-binary-conda / conda-py3_8-cuda11_8-build / build"
check_name_workflow_path = ".github/workflows/generated-linux-binary-conda-nightly.yml / conda-py3_8-cuda11_8-build / build"
# Mock a check where the workflow name uses the full path
checks[check_name_workflow_path] = JobCheckState(
check_name_workflow_path,
checks[check_name].url,
checks[check_name].status,
checks[check_name].classification,
checks[check_name].job_id,
checks[check_name].title,
checks[check_name].summary,
)
del checks[check_name]
checks = get_classifications(
pr.pr_num,
pr.project,
checks,
[],
)
pending, failed, ignorable = categorize_checks(
checks,
list(checks.keys()),
)
self.assertTrue(len(pending) == 0)
self.assertTrue(len(failed) == 0)
self.assertTrue(len(ignorable["FLAKY"]) == 1)
self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 0)
@mock.patch("trymerge.read_merge_rules", side_effect=xla_merge_rules)
def test_dont_ignore_flaky_failures(self, *args: Any) -> None:
"""

View File

@ -123,6 +123,7 @@ fragment PRCheckSuites on CheckSuiteConnection {
workflow {
name
}
databaseId
url
}
checkRuns(first: 50) {
@ -1398,7 +1399,10 @@ def find_matching_merge_rule(
)
required_checks = list(
filter(
lambda x: "EasyCLA" in x or not skip_mandatory_checks, mandatory_checks
lambda x: ("EasyCLA" in x)
or ("Facebook CLA Check" in x)
or not skip_mandatory_checks,
mandatory_checks,
)
)
pending_checks, failed_checks, _ = categorize_checks(
@ -1409,6 +1413,13 @@ def find_matching_merge_rule(
else 0,
)
# categorize_checks assumes all tests are required if required_checks is empty.
# this is a workaround as we want to keep that behavior for categorize_checks
# generally.
if not required_checks:
pending_checks = []
failed_checks = []
hud_link = f"https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}"
if len(failed_checks) > 0:
if reject_reason_score < 30000:
@ -1608,28 +1619,59 @@ def remove_job_name_suffix(name: str, replacement: str = ")") -> str:
def is_broken_trunk(
name: str,
check: JobCheckState,
drci_classifications: Any,
) -> bool:
if not name or not drci_classifications:
if not check or not drci_classifications:
return False
name = check.name
job_id = check.job_id
# Consult the list of broken trunk failures from Dr.CI
return any(
name == broken_trunk["name"]
(name == broken_trunk["name"]) or (job_id and job_id == broken_trunk["id"])
for broken_trunk in drci_classifications.get("BROKEN_TRUNK", [])
)
def is_flaky(
name: str,
def is_unstable(
check: JobCheckState,
drci_classifications: Any,
) -> bool:
if not name or not drci_classifications:
if not check or not drci_classifications:
return False
name = check.name
job_id = check.job_id
# The job name has the unstable keyword. This is the original way to mark a job
# as unstable on HUD, Dr.CI, and trymerge
if "unstable" in name:
return True
# Consult the list of unstable failures from Dr.CI
return any(
(name == unstable["name"] or (job_id and job_id == unstable["id"]))
for unstable in drci_classifications.get("UNSTABLE", [])
)
def is_flaky(
check: JobCheckState,
drci_classifications: Any,
) -> bool:
if not check or not drci_classifications:
return False
name = check.name
job_id = check.job_id
# Consult the list of flaky failures from Dr.CI
return any(name == flaky["name"] for flaky in drci_classifications.get("FLAKY", []))
return any(
(name == flaky["name"] or (job_id and job_id == flaky["id"]))
for flaky in drci_classifications.get("FLAKY", [])
)
def is_invalid_cancel(
@ -1702,7 +1744,7 @@ def get_classifications(
if check.status == "SUCCESS" or check.status == "NEUTRAL":
continue
if "unstable" in name:
if is_unstable(check, drci_classifications):
checks_with_classifications[name] = JobCheckState(
check.name,
check.url,
@ -1716,7 +1758,7 @@ def get_classifications(
# NB: It's important to note that when it comes to ghstack and broken trunk classification,
# Dr.CI uses the base of the whole stack
if is_broken_trunk(name, drci_classifications):
if is_broken_trunk(check, drci_classifications):
checks_with_classifications[name] = JobCheckState(
check.name,
check.url,
@ -1728,7 +1770,7 @@ def get_classifications(
)
continue
elif is_flaky(name, drci_classifications):
elif is_flaky(check, drci_classifications):
checks_with_classifications[name] = JobCheckState(
check.name,
check.url,

View File

@ -60,7 +60,7 @@ def rebase_onto(
repo._run_git("rebase", onto_branch, branch)
if repo.rev_parse(branch) == repo.rev_parse(onto_branch):
raise Exception(SAME_SHA_ERROR)
raise Exception(SAME_SHA_ERROR) # noqa: TRY002
if dry_run:
push_result = repo._run_git("push", "--dry-run", "-f", remote_url, refspec)
@ -100,7 +100,7 @@ def rebase_ghstack_onto(
repo._run_git("rebase", onto_branch, orig_ref)
if repo.rev_parse(orig_ref) == repo.rev_parse(onto_branch):
raise Exception(SAME_SHA_ERROR)
raise Exception(SAME_SHA_ERROR) # noqa: TRY002
# steal the identity of the committer of the commit on the orig branch
email = repo._run_git("log", orig_ref, "--pretty=format:%ae", "-1")
@ -126,7 +126,7 @@ def rebase_ghstack_onto(
print(push_result)
if ghstack_result.returncode != 0:
print(ghstack_result.stderr.decode("utf-8"))
raise Exception(f"\n```{push_result}```")
raise Exception(f"\n```{push_result}```") # noqa: TRY002
# The contents of a successful push result should look like:
# Summary of changes (ghstack 0.6.0)

View File

@ -46,7 +46,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
!{{ common.concurrency(build_environment) }}
jobs:

View File

@ -48,7 +48,7 @@ env:
BUILD_ENVIRONMENT: !{{ build_environment }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
{%- if cross_compile_arm64 %}
CROSS_COMPILE_ARM64: 1
{% endif %}

View File

@ -86,9 +86,14 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ inputs.cuda-version != 'cpu' }}
if: ${{ inputs.cuda-version != 'cpu' && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
- name: Output disk space left
run: |

View File

@ -78,7 +78,7 @@ on:
jobs:
build:
runs-on: ${{ inputs.runs_on }}
timeout-minutes: 180
timeout-minutes: 210
env:
PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}

View File

@ -28,7 +28,21 @@ on:
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
upload-aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
GH_PYTORCHBOT_TOKEN:
required: false
@ -82,6 +96,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -97,6 +119,7 @@ jobs:
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Generate netrc (only for docs-push)
if: inputs.push
@ -156,6 +179,14 @@ jobs:
uses: ./.github/actions/chown-workspace
if: always()
- name: configure aws credentials
if : ${{ inputs.upload-aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.upload-aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Upload Python Docs Preview
uses: seemethere/upload-artifact-s3@v5
if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' && steps.build-docs.outcome == 'success' }}
@ -163,7 +194,7 @@ jobs:
retention-days: 14
s3-bucket: doc-previews
if-no-files-found: error
path: pytorch.github.io/docs/main/
path: pytorch_docs/main/
s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}
- name: Upload C++ Docs Preview

109
.github/workflows/_linux-build-label.yml vendored Normal file
View File

@ -0,0 +1,109 @@
name: linux-build
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: Runner label to select worker type
runner:
required: false
type: string
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: ${{ inputs.runner }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

105
.github/workflows/_linux-build-rg.yml vendored Normal file
View File

@ -0,0 +1,105 @@
name: linux-build-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: |
List of CUDA architectures CI build should target.
runner-group:
required: false
type: string
default: "arc-lf-linux.2xlarge"
description: Runner group to select group type
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on:
group: ${{ inputs.runner-group }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -47,6 +47,23 @@ on:
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
selected-test-configs:
description: |
A comma-separated list of test configurations from the test matrix to keep,
The empty list means we are going to keep every configurations by defaults
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: Role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -87,6 +104,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -125,6 +150,7 @@ jobs:
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
selected-test-configs: ${{ inputs.selected-test-configs }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Download pytest cache
@ -133,6 +159,7 @@ jobs:
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
@ -197,6 +224,7 @@ jobs:
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
@ -207,6 +235,7 @@ jobs:
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main

85
.github/workflows/_linux-test-label.yml vendored Normal file
View File

@ -0,0 +1,85 @@
name: linux-test-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 240
description: |
Set the maximum (in minutes) how long the workflow should take to finish
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Test
id: linux-test
uses: ./.github/actions/linux-test
with:
build-environment: ${{ inputs.build-environment }}
test-matrix: ${{ inputs.test-matrix }}
docker-image: ${{ inputs.docker-image }}
sync-tag: ${{ inputs.sync-tag }}
use-gha: ${{ inputs.use-gha }}
dashboard-tag: ${{ inputs.dashboard-tag }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

86
.github/workflows/_linux-test-rg.yml vendored Normal file
View File

@ -0,0 +1,86 @@
name: linux-test-label
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
test-matrix:
required: true
type: string
description: JSON description of what test configs to run.
docker-image:
required: true
type: string
description: Docker image to run in.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
timeout-minutes:
required: false
type: number
default: 240
description: |
Set the maximum (in minutes) how long the workflow should take to finish
use-gha:
required: false
type: string
default: ""
description: If set to any value, upload to GHA. Otherwise upload to S3.
dashboard-tag:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
runs-on:
group: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Test
id: linux-test
uses: ./.github/actions/linux-test
with:
build-environment: ${{ inputs.build-environment }}
test-matrix: ${{ inputs.test-matrix }}
docker-image: ${{ inputs.docker-image }}
sync-tag: ${{ inputs.sync-tag }}
use-gha: ${{ inputs.use-gha }}
dashboard-tag: ${{ inputs.dashboard-tag }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -37,6 +37,16 @@ on:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -71,6 +81,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -91,10 +109,15 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
- name: Lock NVIDIA A100 40GB Frequency
run: |
@ -116,6 +139,7 @@ jobs:
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Download TD artifacts
continue-on-error: true
@ -176,6 +200,7 @@ jobs:
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
@ -228,6 +253,7 @@ jobs:
-e VERBOSE_TEST_LOGS \
-e NO_TEST_TIMEOUT \
-e NO_TD \
-e TD_DISTRIBUTED \
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
@ -240,7 +266,6 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e DASHBOARD_TAG \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--ulimit stack=10485760:83886080 \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--ipc=host \
@ -290,6 +315,7 @@ jobs:
with:
file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
use-gha: ${{ inputs.use-gha }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Collect backtraces from coredumps (if any)
if: always()

View File

@ -24,11 +24,6 @@ on:
default: "3.8"
description: |
The python version to be used. Will be 3.8 by default
arch:
required: true
type: string
description: |
Contains the architecture to run the tests with
timeout-minutes:
required: false
type: number
@ -44,7 +39,7 @@ jobs:
# Also ensure that we always run with the right architecture
defaults:
run:
shell: arch -arch ${{ inputs.arch }} bash -e -l {0}
shell: bash -e -l {0}
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
@ -133,12 +128,6 @@ jobs:
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Pre-process arm64 wheels
if: inputs.build-environment == 'macos-12-py3-arm64'
run: |
# As wheels are cross-compiled they are reported as x86_64 ones
ORIG_WHLNAME=$(ls -1 dist/*.whl); ARM_WHLNAME=${ORIG_WHLNAME/x86_64/arm64}; mv "${ORIG_WHLNAME}" "${ARM_WHLNAME}"
- name: Set Test step time
id: test-timeout
shell: bash

View File

@ -0,0 +1,58 @@
name: Check whether the workflow owner can use ARC runners
on:
workflow_call:
inputs:
user_name:
required: true
type: string
description: The name of the workflow owner.
curr_branch:
required: true
type: string
description: Current branch.
issue_number:
required: false
type: string
default: "5132"
outputs:
workflow-type:
description: Type of runners to use
value: ${{ jobs.runner-determinator.outputs.workflow-type }}
jobs:
runner-determinator:
runs-on: linux.4xlarge
outputs:
workflow-type: ${{ steps.set-condition.outputs.workflow-type }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ISSUE_NUMBER: ${{ inputs.issue_number }}
USERNAME: ${{ inputs.user_name }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
fetch-depth: 1
submodules: true
- name: Install dependencies
run: python3 -m pip install urllib3==1.26.18 PyGithub==2.3.0
- name: Get the workflow type for the current user
id: set-condition
run: |
curr_branch="${{ inputs.curr_branch }}"
echo "Current branch is '$curr_branch'"
output="$(python3 .github/scripts/get_workflow_type.py \
--github-token "$GITHUB_TOKEN" \
--github-issue "$ISSUE_NUMBER" \
--github-branch "$curr_branch" \
--github-user "$USERNAME")"
echo "Output: '${output}'"
WORKFLOW_TYPE=$(echo "${output}" | jq -r '.workflow_type')
echo "workflow-type=$WORKFLOW_TYPE" >> "$GITHUB_OUTPUT"

View File

@ -92,7 +92,7 @@ jobs:
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install rockset==1.0.3
python3 -m pip install rockset==1.0.3 'xdoctest>=1.1.0'
- name: Start monitoring script
id: monitor-script

View File

@ -37,7 +37,7 @@ jobs:
device: ["cuda", "rocm"]
include:
- device: "rocm"
rocm_version: "6.0"
rocm_version: "6.1"
- device: "cuda"
rocm_version: ""
timeout-minutes: 40
@ -119,8 +119,7 @@ jobs:
- uses: actions/upload-artifact@v3
with:
# NB: Use the same name here and all wheels can be downloaded by referring to the same artifact
name: pytorch-triton-wheel
name: pytorch-triton-wheel-${{ matrix.py_vers }}-${{ matrix.device }}
if-no-files-found: error
path: ${{ runner.temp }}/artifacts/*
@ -157,8 +156,15 @@ jobs:
- name: Download Build Artifacts
uses: actions/download-artifact@v3
with:
name: pytorch-triton-wheel
path: ${{ runner.temp }}/artifacts/
# Download all available artifacts
path: ${{ runner.temp }}/artifacts-all
- name: Select Wheel Artifacts
shell: bash
run: |
set -x
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv "${RUNNER_TEMP}"/artifacts-all/pytorch-triton-wheel-*/* "${RUNNER_TEMP}/artifacts/"
- name: Set DRY_RUN (only for tagged pushes)
if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) }}
@ -246,8 +252,7 @@ jobs:
- uses: actions/upload-artifact@v3
with:
# NB: Use the same name here and all wheels can be downloaded by referring to the same artifact
name: pytorch-triton-conda
name: pytorch-triton-conda-${{ matrix.py_vers }}
if-no-files-found: error
path: ${{ runner.temp }}/artifacts/*
@ -267,8 +272,15 @@ jobs:
- name: Download Build Artifacts
uses: actions/download-artifact@v3
with:
name: pytorch-triton-conda
path: ${{ runner.temp }}/artifacts/
# Download all available artifacts
path: ${{ runner.temp }}/artifacts-all
- name: Select Conda Artifacts
shell: bash
run: |
set -x
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv "${RUNNER_TEMP}"/artifacts-all/pytorch-triton-conda-*/* "${RUNNER_TEMP}/artifacts/"
- name: Set DRY_RUN (only for tagged pushes)
if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) }}

View File

@ -7,6 +7,7 @@ on:
- Dockerfile
- docker.Makefile
- .github/workflows/docker-release.yml
- .github/scripts/generate_docker_release_matrix.py
push:
branches:
- nightly
@ -126,20 +127,25 @@ jobs:
run: |
make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"
- name: Push nightly tags
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' }}
if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.build_platforms == 'linux/amd4' }}
run: |
PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-cuda${CUDA_VERSION_SHORT}-cudnn${CUDNN_VERSION}-runtime"
CUDA_SUFFIX="-cu${CUDA_VERSION}"
PYTORCH_NIGHTLY_COMMIT=$(docker run ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
python -c 'import torch; print(torch.version.git_version[:7],end="")')
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_DOCKER_TAG}" \
ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}"
ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
# Please note, here we ned to pin specific verison of CUDA as with latest label
if [[ ${CUDA_VERSION_SHORT} == "12.1" ]]; then
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" \
ghcr.io/pytorch/pytorch-nightly:latest
docker push ghcr.io/pytorch/pytorch-nightly:latest
fi
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}-cu${CUDA_VERSION}" \
ghcr.io/pytorch/pytorch-nightly:latest
docker push ghcr.io/pytorch/pytorch-nightly:latest
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-aarch64-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-conda-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -222,6 +222,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
runs_on: linux.24xlarge
build_name: conda-py3_8-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -407,6 +470,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.9"
runs_on: linux.24xlarge
build_name: conda-py3_9-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_9-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_9-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.9"
build_name: conda-py3_9-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_9-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_9-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.9"
build_name: conda-py3_9-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_10-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -592,6 +718,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_10-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.10"
runs_on: linux.24xlarge
build_name: conda-py3_10-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_10-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_10-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.10"
build_name: conda-py3_10-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_10-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_10-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.10"
build_name: conda-py3_10-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_11-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -777,6 +966,69 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_11-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.11"
runs_on: linux.24xlarge
build_name: conda-py3_11-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_11-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_11-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.11"
build_name: conda-py3_11-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_11-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_11-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.11"
build_name: conda-py3_11-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_12-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -961,3 +1213,66 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_12-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.12"
runs_on: linux.24xlarge
build_name: conda-py3_12-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_12-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_12-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.12"
build_name: conda-py3_12-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_12-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_12-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.12"
build_name: conda-py3_12-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -26,7 +26,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -229,7 +229,7 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm5_7-shared-with-deps-cxx11-abi-build:
libtorch-cuda12_4-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
@ -238,97 +238,56 @@ jobs:
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm5_7-shared-with-deps-cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm5_7-shared-with-deps-cxx11-abi-test: # Testing
libtorch-cuda12_4-shared-with-deps-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm5_7-shared-with-deps-cxx11-abi-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.7-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm5_7-shared-with-deps-cxx11-abi
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/libtorch-cxx11-builder:rocm5.7-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm5_7-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm5_7-shared-with-deps-cxx11-abi-test
needs: libtorch-cuda12_4-shared-with-deps-cxx11-abi-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm5_7-shared-with-deps-cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_4-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
@ -440,3 +399,109 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_1-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm6_1-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_1-shared-with-deps-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm6_1-shared-with-deps-cxx11-abi-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm6_1-shared-with-deps-cxx11-abi
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/libtorch-cxx11-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_1-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_1-shared-with-deps-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm6_1-shared-with-deps-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -26,7 +26,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-pre-cxx11-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -31,7 +31,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-pre-cxx11-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -229,7 +229,7 @@ jobs:
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm5_7-shared-with-deps-pre-cxx11-build:
libtorch-cuda12_4-shared-with-deps-pre-cxx11-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
@ -238,97 +238,56 @@ jobs:
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm5_7-shared-with-deps-pre-cxx11
build_name: libtorch-cuda12_4-shared-with-deps-pre-cxx11
build_environment: linux-binary-libtorch-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm5_7-shared-with-deps-pre-cxx11-test: # Testing
libtorch-cuda12_4-shared-with-deps-pre-cxx11-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm5_7-shared-with-deps-pre-cxx11-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.7-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm5_7-shared-with-deps-pre-cxx11
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm5.7-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm5_7-shared-with-deps-pre-cxx11-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm5_7-shared-with-deps-pre-cxx11-test
needs: libtorch-cuda12_4-shared-with-deps-pre-cxx11-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm5.7
GPU_ARCH_VERSION: 5.7
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm5.7-main
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm5_7-shared-with-deps-pre-cxx11
build_name: libtorch-cuda12_4-shared-with-deps-pre-cxx11
build_environment: linux-binary-libtorch-pre-cxx11
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_4-shared-with-deps-pre-cxx11-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-pre-cxx11-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-cuda12_4-shared-with-deps-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
@ -440,3 +399,109 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_1-shared-with-deps-pre-cxx11-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm6_1-shared-with-deps-pre-cxx11
build_environment: linux-binary-libtorch-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_1-shared-with-deps-pre-cxx11-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-rocm6_1-shared-with-deps-pre-cxx11-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-rocm6_1-shared-with-deps-pre-cxx11
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_1-shared-with-deps-pre-cxx11-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_1-shared-with-deps-pre-cxx11-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: pre-cxx11
build_name: libtorch-rocm6_1-shared-with-deps-pre-cxx11
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -26,7 +26,7 @@ env:
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

File diff suppressed because it is too large Load Diff

View File

@ -26,7 +26,7 @@ env:
BUILD_ENVIRONMENT: macos-arm64-binary-conda
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: macos-arm64-binary-conda-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -26,7 +26,7 @@ env:
BUILD_ENVIRONMENT: macos-arm64-binary-libtorch-cxx11-abi
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: macos-arm64-binary-libtorch-cxx11-abi-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

View File

@ -26,7 +26,7 @@ env:
BUILD_ENVIRONMENT: macos-arm64-binary-wheel
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SKIP_ALL_TESTS: 1
SKIP_ALL_TESTS: 0
concurrency:
group: macos-arm64-binary-wheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

File diff suppressed because it is too large Load Diff

View File

@ -800,3 +800,260 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_4-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: libtorch-cuda12_4-shared-with-deps-debug
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-debug-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-cuda12_4-shared-with-deps-debug-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-cuda12_4-shared-with-deps-debug
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-debug-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-debug-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
build_name: libtorch-cuda12_4-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -800,3 +800,260 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_4-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: libtorch-cuda12_4-shared-with-deps-release
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: libtorch-cuda12_4-shared-with-deps-release-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: libtorch-cuda12_4-shared-with-deps-release
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_4-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-release-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
build_name: libtorch-cuda12_4-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,40 @@
name: inductor-micro-benchmark
on:
schedule:
- cron: 0 7 * * *
push:
tags:
- ciflow/inductor-micro-benchmark/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor-micro-benchmark", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-test:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-micro-benchmark-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720

Some files were not shown because too many files have changed in this diff Show More